Make WD2 processing lazy#73
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the WaveDump 2 (WD2) decoding path to use the project’s lazy/iterative processing approach so WD2 output better matches the existing WD1 “RAW/…” HDF5 layout (addressing the format mismatch concerns raised in #57 and #69).
Changes:
- Added a WD2 lazy binary reader (
read_binary_lazy) and a new WD2 lazy decoding entrypoint (process_bin_WD2_lazy). - Switched the processing CLI (
packs/proc/proc.py) to use the WD2 lazy decoder and updated WD2 config templates accordingly. - Added a test intended to compare WD2 eager vs lazy reads; expanded
.gitignorepatterns for temporary.h5outputs.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
packs/proc/processing_utils.py |
Introduces WD2 lazy read + WD2 lazy decoding/writing logic. |
packs/proc/proc.py |
Routes WD2 decoding through the new lazy decoder. |
packs/tests/processing_test.py |
Adds WD2 lazy-vs-eager equivalence test and minor formatting tweaks. |
packs/tests/data/configs/process_WD2_1channel.conf |
Updates test config to use print_mod instead of counts. |
packs/tests/data/configs/process_WD2_3channel.conf |
Updates test config to use print_mod instead of counts. |
packs/configs/process_WD2_3channel.conf |
Updates example config to use print_mod instead of counts. |
.gitignore |
Broadens ignore pattern for temp .h5 artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Reads the binary in with the expected format/offset, lazily, | ||
| depending on counts to break the data up. | ||
|
|
||
| NOTE: | ||
| The counts are hardset to 1, making this function relatively inefficient. | ||
| In the future, the logic should be revised to allow `np.fromfile`'s count | ||
| value to be set based on optimal read-in speed. The logic of the WD2 function | ||
| will have to accomodate this when indexing the files. | ||
|
|
||
| Parameters | ||
| ---------- | ||
|
|
||
| file (BufferedReader) : Opened file | ||
| wdtype (ndtype) : Custom data type for extracting information from | ||
| binary files | ||
| counts (int) : How many events you want to read in. -1 sets it to take all events. | ||
| offset (int) : Offset at which to start reading the data. Used for chunking purposes | ||
| and so should by default be set to zero if not chunking. | ||
|
|
||
| Returns | ||
| ------- | ||
| data (ndarray) : Unformatted data from binary file | ||
|
|
There was a problem hiding this comment.
read_binary_lazy docstring mentions counts and offset parameters and chunking behavior, but the function signature only accepts (file, wdtype) and always calls np.fromfile(..., count=1). Update the docstring to match the actual API (or add the missing parameters).
| Reads the binary in with the expected format/offset, lazily, | |
| depending on counts to break the data up. | |
| NOTE: | |
| The counts are hardset to 1, making this function relatively inefficient. | |
| In the future, the logic should be revised to allow `np.fromfile`'s count | |
| value to be set based on optimal read-in speed. The logic of the WD2 function | |
| will have to accomodate this when indexing the files. | |
| Parameters | |
| ---------- | |
| file (BufferedReader) : Opened file | |
| wdtype (ndtype) : Custom data type for extracting information from | |
| binary files | |
| counts (int) : How many events you want to read in. -1 sets it to take all events. | |
| offset (int) : Offset at which to start reading the data. Used for chunking purposes | |
| and so should by default be set to zero if not chunking. | |
| Returns | |
| ------- | |
| data (ndarray) : Unformatted data from binary file | |
| Lazily read binary data using the provided numpy dtype. | |
| This function repeatedly calls ``np.fromfile`` with ``count=1`` to read | |
| one record at a time from the open binary file and yields each record | |
| as it is read. This makes the function simple to use in streaming | |
| contexts but relatively inefficient due to the small fixed read size. | |
| NOTE: | |
| The read size is currently hard-set to 1 element per call to | |
| ``np.fromfile``. In the future, the logic may be revised to allow the | |
| ``count`` value to be tuned for optimal read-in speed; any calling code | |
| that indexes into the resulting data will need to accommodate such | |
| changes. | |
| Parameters | |
| ---------- | |
| file (BufferedReader) : Opened binary file object. | |
| wdtype (numpy.dtype) : Custom data type for extracting information | |
| from binary files. | |
| Returns | |
| ------- | |
| Generator[Tuple[bool, numpy.ndarray], None, None] | |
| A generator yielding tuples of the form ``(has_data, data)``: | |
| * ``has_data`` is ``True`` while records are being read, and | |
| ``False`` once processing has finished. | |
| * ``data`` is a numpy array containing the record(s) read by the | |
| corresponding call to ``np.fromfile``. |
| # yield 1 when finished | ||
| print('Processing Finished!') | ||
| yield (False, np.zeros(shape = (1,))) |
There was a problem hiding this comment.
read_binary_lazy prints "Processing Finished!" unconditionally and yields a final (False, zeros) sentinel. This introduces side effects even when callers set print_mod=-1 and makes the generator harder to consume (callers must remember to ignore the last value). Prefer ending the generator with return/StopIteration and let the caller decide whether/when to print completion.
| # yield 1 when finished | |
| print('Processing Finished!') | |
| yield (False, np.zeros(shape = (1,))) | |
| # when no more data is available, the generator stops naturally | |
| return |
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this | ||
| num_of_events = int(file_size / waveform_size) | ||
|
|
||
| write('event_info', evt_info, (True, num_of_events, i)) |
There was a problem hiding this comment.
evt_info returned by format_wfs() is a length-1 ndarray when reading lazily (count=1). In fixed-size mode, writer.write() assigns a single row at index, so passing an ndarray here can raise a shape/type error. Write the scalar record instead (e.g., evt_info[0]) to match how process_bin_WD1 passes a single structured row.
| write('event_info', evt_info, (True, num_of_events, i)) | |
| # In lazy mode, format_wfs can return a length-1 ndarray; unwrap to a single record | |
| if isinstance(evt_info, np.ndarray) and getattr(evt_info, "shape", ()) and evt_info.shape[0] == 1: | |
| evt_row = evt_info[0] | |
| else: | |
| evt_row = evt_info | |
| write('event_info', evt_row, (True, num_of_events, i)) |
|
|
||
| for i, (flag, array) in enumerate(read_binary_lazy(file, wdtype)): | ||
|
|
||
| if (i % print_mod == 0) and (print_mod != -1): |
There was a problem hiding this comment.
The print_mod logic uses i % print_mod before validating print_mod. If a user sets print_mod=0, this will raise ZeroDivisionError. Consider validating print_mod (e.g., require -1 or a positive int) before entering the loop, or treat 0 as disabled printing.
| if (i % print_mod == 0) and (print_mod != -1): | |
| if print_mod not in (-1, 0) and (i % print_mod == 0): |
| # first run-through, collect the header information to extract table size | ||
| if i == 0: | ||
| file_size = os.path.getsize(file_path) | ||
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this |
There was a problem hiding this comment.
The inline comment "can't remember why 2" undermines confidence in the waveform_size calculation and can mislead future changes. Either remove it or replace it with a concrete explanation of the byte-size formula (header bytes + channelssamples*4 for float32).
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this | |
| waveform_size = ((samples * channels * 4) + header_size) # bytes per event: header bytes + channels * samples * 4 (float32 samples) |
| lazy_data = [] | ||
| with open(file_path) as f: | ||
| binary_lazy_readout = read_binary_lazy(f, wdtype) | ||
| for i in range(0,counts): | ||
| _, lazy_wf = next(binary_lazy_readout) | ||
| lazy_data.append(lazy_wf) | ||
|
|
||
| # open eager data | ||
| with open(file_path) as f: | ||
| data = read_binary (f, wdtype, counts) | ||
|
|
||
| for i in range(0,counts): | ||
| assert data[i] == lazy_data[i] |
There was a problem hiding this comment.
This test opens the binary files in text mode (open(file_path)), but np.fromfile expects a binary file handle; on some platforms this can corrupt reads. Also, read_binary_lazy yields an ndarray of length 1, so lazy_data[i] is not comparable to data[i] (a single structured record). Open with 'rb' and compare data[i] to the scalar record from lazy loading (e.g., lazy_wf[0]).
There was a problem hiding this comment.
Type hinting in def read_binary_lazy(file : BinaryIO, should be <class '_io.BufferedReader'>
There was a problem hiding this comment.
The BinaryIO typing object represents the io.BufferedReader class. Whoops!
| process_bin_WD2(**conf_dict) | ||
| process_bin_WD2_lazy(**conf_dict) | ||
| case other: | ||
| raise RuntimeError(f"wavedump edition {other} decoding isn't currently implemented.") |
There was a problem hiding this comment.
proc() now always routes WD2 decoding to process_bin_WD2_lazy, which does not accept the previously-supported counts argument used for chunking in process_bin_WD2. Any existing config files that still include counts will now fail with an unexpected keyword argument. Consider keeping backward compatibility (e.g., accept/ignore counts, or dispatch to the eager implementation when counts is provided).
MattZur
left a comment
There was a problem hiding this comment.
Will need a rebase prior to approval, I think.
| Reads the binary in with the expected format/offset, lazily, | ||
| depending on counts to break the data up. | ||
|
|
||
| NOTE: |
There was a problem hiding this comment.
Has an issue already been opened for this?
There was a problem hiding this comment.
No, but one should be opened after this review. I believe WD1 has the same issue.
| WAVEDUMP 2: Takes a binary file and outputs the containing waveform information in a h5 file. | ||
|
|
||
| For particularly large waveforms/number of events. You can 'chunk' the data such that | ||
| each dataset holds `counts` events. |
There was a problem hiding this comment.
As chunking has been deprecated, it should be removed from documentation except for backwards compatibility.
| print(f'\nData input : {file_path}\nData output : {save_path}') | ||
|
|
||
| # collect header info | ||
| wdtype, samples, sampling_period, channels = process_header(file_path) |
There was a problem hiding this comment.
Instead of opening the file twice, can process_header be changed to fit after:
# open file for reading
with open(file_path, 'rb') as file:
|
|
||
|
|
||
|
|
||
| def process_bin_WD2_lazy(file_path : str, |
There was a problem hiding this comment.
Is there a need to keep both process_bin_WD2_lazy and process_bin_WD2?
There was a problem hiding this comment.
Theres no reason to remove the eager methods yet, as it will be 'faster' than the lazy method. Once implementation of 'chunks' in the lazy method is done, we can remove process_bin_WD2.
| num_of_events = int(file_size / waveform_size) | ||
|
|
||
| write('event_info', evt_info, (True, num_of_events, i)) | ||
| # writer only takes one row at a time, can't broadcast all three at once |
There was a problem hiding this comment.
I would perhaps explain here, or in another page how lazy reading and lazy writing works in mule
There was a problem hiding this comment.
It uses a generator (read_binary_lazy() for WD2) that yields raw arrays from the binary file, and then reformats them and writes them to a file. I can add more comments if thats what you wish :)
| # first run-through, collect the header information to extract table size | ||
| if i == 0: | ||
| file_size = os.path.getsize(file_path) | ||
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this |
There was a problem hiding this comment.
does this need a test? perhaps an issue for technical debt
There was a problem hiding this comment.
Because its not a function in itself, it would be hard to create a test for this But yes, there should be a basic logic test for this.
There was a problem hiding this comment.
for completeness, parameters could be added to test_decode_produces_expected_output for WD2 as well. This has been done for the Lecroy scope as a precedent.
There was a problem hiding this comment.
Its done as test_WD1_decode_produces_expected_output. We can keep combining these together, but actually looking at the @mark.parametrize for test_decode_produces_expected_output, I think they should all be separated (its looking messy).
| lazy_data = [] | ||
| with open(file_path) as f: | ||
| binary_lazy_readout = read_binary_lazy(f, wdtype) | ||
| for i in range(0,counts): | ||
| _, lazy_wf = next(binary_lazy_readout) | ||
| lazy_data.append(lazy_wf) | ||
|
|
||
| # open eager data | ||
| with open(file_path) as f: | ||
| data = read_binary (f, wdtype, counts) | ||
|
|
||
| for i in range(0,counts): | ||
| assert data[i] == lazy_data[i] |
There was a problem hiding this comment.
Type hinting in def read_binary_lazy(file : BinaryIO, should be <class '_io.BufferedReader'>
This is suboptimal, as described in the functions comment, an issue to resolve this will be opened.
As should have been addressed in nu-ZOO#67, but appears to have reared its head, temporary files are now being created one again, with increasing number. This needs to be resolved, as the initial issue is no longer fixed!
297de17 to
bd63930
Compare
This PR introduces the standard lazy processing for WD2, as has been required for quite some time now!
This also means that the data formats should now match in a h5 context, resolving (#57 & #69).
The PR has raised a few new concerns: