Skip to content

Rewrite paolina using DataFrames#957

Merged
Ian0sborne merged 19 commits into
next-exp:masterfrom
gonzaponte:paolina-df
May 11, 2026
Merged

Rewrite paolina using DataFrames#957
Ian0sborne merged 19 commits into
next-exp:masterfrom
gonzaponte:paolina-df

Conversation

@gonzaponte
Copy link
Copy Markdown
Collaborator

@gonzaponte gonzaponte commented Apr 20, 2026

Paolina functions were written using the HitCollection and Hit classes, which are cumbersome. Pandas' Dataframes are a much better tool to handle this type of data and, in practice, we always use them. This PR rewrites some of the paolina functionality avoiding these unnecessary intermediate objects.

The idea is that it's easier to move this conversion down the workflow,
step by step
Useful when we have written a hypothesis strategy to generate complex
data, but we only need one example or a dummy value for property-based
testing
1. Beersheba hits don't have Q, for some reason, so we need to fill in
   the value
2. We need to drop some unused columns and reorder the rest
3. We need to force a type change because there is something wrong in the
   input file

1 should probably be revisited, 2 will hopefully be worked out once the
full event-model removal is done and 3 is a serious WTF with some
continuous variables having integer type and needs to be fixed.
This is due to the terrible bookkeeping of our test files. For now, this
is just a demonstration that the tests pass, i.e., the refactor does not
modify the output. In the future, test files will be updated to prevent
this from the beginning.
Copy link
Copy Markdown
Member

@jwaiton jwaiton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! This is a really important change that I'm excited to see implemented.

I've commented commit-by-commit, so some of the commits may no longer be useful (and some outdated ones may still be, for example I still think the dhits_from_files() docstring is a bit barebones).

Comment thread invisible_cities/cities/components.py Outdated
Comment thread invisible_cities/cities/components.py Outdated
Comment thread invisible_cities/cities/components.py Outdated
Comment thread invisible_cities/reco/paolina_functions.py Outdated
Comment thread invisible_cities/reco/paolina_functions.py
Comment thread invisible_cities/cities/components.py
Comment thread invisible_cities/cities/components.py
Comment thread invisible_cities/cities/components.py
Comment thread invisible_cities/cities/components.py
Comment thread invisible_cities/museum/penthesilea/penthesilea.py
Copy link
Copy Markdown
Member

@jwaiton jwaiton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes the use of the HitCollection class from paolina, retrofitting all relevant functions to instead work with dataframes. General improvements to the docstrings and code clarity were made in parallel.

Good work!

@Ian0sborne Ian0sborne merged commit 0037429 into next-exp:master May 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants