Rewrite paolina using DataFrames#957
Merged
Merged
Conversation
The idea is that it's easier to move this conversion down the workflow, step by step
Useful when we have written a hypothesis strategy to generate complex data, but we only need one example or a dummy value for property-based testing
1. Beersheba hits don't have Q, for some reason, so we need to fill in the value 2. We need to drop some unused columns and reorder the rest 3. We need to force a type change because there is something wrong in the input file 1 should probably be revisited, 2 will hopefully be worked out once the full event-model removal is done and 3 is a serious WTF with some continuous variables having integer type and needs to be fixed.
This is due to the terrible bookkeeping of our test files. For now, this is just a demonstration that the tests pass, i.e., the refactor does not modify the output. In the future, test files will be updated to prevent this from the beginning.
jwaiton
reviewed
Apr 27, 2026
Member
There was a problem hiding this comment.
Good work! This is a really important change that I'm excited to see implemented.
I've commented commit-by-commit, so some of the commits may no longer be useful (and some outdated ones may still be, for example I still think the dhits_from_files() docstring is a bit barebones).
jwaiton
approved these changes
May 7, 2026
Member
jwaiton
left a comment
There was a problem hiding this comment.
This PR removes the use of the HitCollection class from paolina, retrofitting all relevant functions to instead work with dataframes. General improvements to the docstrings and code clarity were made in parallel.
Good work!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Paolina functions were written using the
HitCollectionandHitclasses, which are cumbersome. Pandas' Dataframes are a much better tool to handle this type of data and, in practice, we always use them. This PR rewrites some of the paolina functionality avoiding these unnecessary intermediate objects.