Make WD2 processing lazy by jwaiton · Pull Request #73 · nu-ZOO/MULE

jwaiton · 2026-03-28T23:54:06Z

This PR introduces the standard lazy processing for WD2, as has been required for quite some time now!

This also means that the data formats should now match in a h5 context, resolving (#57 & #69).

The PR has raised a few new concerns:

the lazy loading currently works line-by-line for both WD1 and WD2, this should be altered to work on a batched system that differs depending on the machine to improve performance.
the temporary file issue addressed in Make temporary files for tests actually temporary #32 seems to have caused a new issue, spawning new temporary files in my linux machine. This needs further testing to be clarified.

Copilot

Pull request overview

This PR updates the WaveDump 2 (WD2) decoding path to use the project’s lazy/iterative processing approach so WD2 output better matches the existing WD1 “RAW/…” HDF5 layout (addressing the format mismatch concerns raised in #57 and #69).

Changes:

Added a WD2 lazy binary reader (read_binary_lazy) and a new WD2 lazy decoding entrypoint (process_bin_WD2_lazy).
Switched the processing CLI (packs/proc/proc.py) to use the WD2 lazy decoder and updated WD2 config templates accordingly.
Added a test intended to compare WD2 eager vs lazy reads; expanded .gitignore patterns for temporary .h5 outputs.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`packs/proc/processing_utils.py`	Introduces WD2 lazy read + WD2 lazy decoding/writing logic.
`packs/proc/proc.py`	Routes WD2 decoding through the new lazy decoder.
`packs/tests/processing_test.py`	Adds WD2 lazy-vs-eager equivalence test and minor formatting tweaks.
`packs/tests/data/configs/process_WD2_1channel.conf`	Updates test config to use `print_mod` instead of `counts`.
`packs/tests/data/configs/process_WD2_3channel.conf`	Updates test config to use `print_mod` instead of `counts`.
`packs/configs/process_WD2_3channel.conf`	Updates example config to use `print_mod` instead of `counts`.
`.gitignore`	Broadens ignore pattern for temp `.h5` artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T21:35:21Z

+    Reads the binary in with the expected format/offset, lazily,
+    depending on counts to break the data up.
+
+    NOTE:
+    The counts are hardset to 1, making this function relatively inefficient.
+    In the future, the logic should be revised to allow `np.fromfile`'s count
+    value to be set based on optimal read-in speed. The logic of the WD2 function
+    will have to accomodate this when indexing the files.
+
+    Parameters
+    ----------
+
+        file    (BufferedReader)  :  Opened file
+        wdtype  (ndtype)         :  Custom data type for extracting information from
+                                     binary files
+        counts  (int)             :  How many events you want to read in. -1 sets it to take all events.
+        offset  (int)             :  Offset at which to start reading the data. Used for chunking purposes
+                                     and so should by default be set to zero if not chunking.
+
+    Returns
+    -------
+        data  (ndarray)  :  Unformatted data from binary file
+


read_binary_lazy docstring mentions counts and offset parameters and chunking behavior, but the function signature only accepts (file, wdtype) and always calls np.fromfile(..., count=1). Update the docstring to match the actual API (or add the missing parameters).

Suggested change

Reads the binary in with the expected format/offset, lazily,

depending on counts to break the data up.

NOTE:

The counts are hardset to 1, making this function relatively inefficient.

In the future, the logic should be revised to allow `np.fromfile`'s count

value to be set based on optimal read-in speed. The logic of the WD2 function

will have to accomodate this when indexing the files.

Parameters

----------

file (BufferedReader) : Opened file

wdtype (ndtype) : Custom data type for extracting information from

binary files

counts (int) : How many events you want to read in. -1 sets it to take all events.

offset (int) : Offset at which to start reading the data. Used for chunking purposes

and so should by default be set to zero if not chunking.

Returns

-------

data (ndarray) : Unformatted data from binary file

Lazily read binary data using the provided numpy dtype.

This function repeatedly calls ``np.fromfile`` with ``count=1`` to read

one record at a time from the open binary file and yields each record

as it is read. This makes the function simple to use in streaming

contexts but relatively inefficient due to the small fixed read size.

NOTE:

The read size is currently hard-set to 1 element per call to

``np.fromfile``. In the future, the logic may be revised to allow the

``count`` value to be tuned for optimal read-in speed; any calling code

that indexes into the resulting data will need to accommodate such

changes.

Parameters

----------

file (BufferedReader) : Opened binary file object.

wdtype (numpy.dtype) : Custom data type for extracting information

from binary files.

Returns

-------

Generator[Tuple[bool, numpy.ndarray], None, None]

A generator yielding tuples of the form ``(has_data, data)``:

* ``has_data`` is ``True`` while records are being read, and

``False`` once processing has finished.

* ``data`` is a numpy array containing the record(s) read by the

corresponding call to ``np.fromfile``.

Copilot · 2026-04-01T21:35:21Z

+    # yield 1 when finished
+    print('Processing Finished!')
+    yield (False, np.zeros(shape = (1,)))


read_binary_lazy prints "Processing Finished!" unconditionally and yields a final (False, zeros) sentinel. This introduces side effects even when callers set print_mod=-1 and makes the generator harder to consume (callers must remember to ignore the last value). Prefer ending the generator with return/StopIteration and let the caller decide whether/when to print completion.

Suggested change

# yield 1 when finished

print('Processing Finished!')

yield (False, np.zeros(shape = (1,)))

# when no more data is available, the generator stops naturally

return

Copilot · 2026-04-01T21:35:22Z

+                        waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this
+                        num_of_events = int(file_size / waveform_size)
+
+                    write('event_info', evt_info, (True, num_of_events, i))


evt_info returned by format_wfs() is a length-1 ndarray when reading lazily (count=1). In fixed-size mode, writer.write() assigns a single row at index, so passing an ndarray here can raise a shape/type error. Write the scalar record instead (e.g., evt_info[0]) to match how process_bin_WD1 passes a single structured row.

Suggested change

write('event_info', evt_info, (True, num_of_events, i))

# In lazy mode, format_wfs can return a length-1 ndarray; unwrap to a single record

if isinstance(evt_info, np.ndarray) and getattr(evt_info, "shape", ()) and evt_info.shape[0] == 1:

evt_row = evt_info[0]

else:

evt_row = evt_info

write('event_info', evt_row, (True, num_of_events, i))

Copilot · 2026-04-01T21:35:22Z

+
+            for i, (flag, array) in enumerate(read_binary_lazy(file, wdtype)):
+
+                if (i % print_mod == 0) and (print_mod != -1):


The print_mod logic uses i % print_mod before validating print_mod. If a user sets print_mod=0, this will raise ZeroDivisionError. Consider validating print_mod (e.g., require -1 or a positive int) before entering the loop, or treat 0 as disabled printing.

Suggested change

if (i % print_mod == 0) and (print_mod != -1):

if print_mod not in (-1, 0) and (i % print_mod == 0):

Copilot · 2026-04-01T21:35:22Z

+                    # first run-through, collect the header information to extract table size
+                    if i == 0:
+                        file_size     = os.path.getsize(file_path)
+                        waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this


The inline comment "can't remember why 2" undermines confidence in the waveform_size calculation and can mislead future changes. Either remove it or replace it with a concrete explanation of the byte-size formula (header bytes + channelssamples*4 for float32).

Suggested change

waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this

waveform_size = ((samples * channels * 4) + header_size) # bytes per event: header bytes + channels * samples * 4 (float32 samples)

Copilot · 2026-04-01T21:35:22Z

+    lazy_data = []
+    with open(file_path) as f:
+        binary_lazy_readout   = read_binary_lazy(f, wdtype)
+        for i in range(0,counts):
+            _, lazy_wf            = next(binary_lazy_readout)
+            lazy_data.append(lazy_wf)
+
+    # open eager data
+    with open(file_path) as f:
+        data                  = read_binary     (f, wdtype, counts)
+
+    for i in range(0,counts):
+        assert data[i] == lazy_data[i]


This test opens the binary files in text mode (open(file_path)), but np.fromfile expects a binary file handle; on some platforms this can corrupt reads. Also, read_binary_lazy yields an ndarray of length 1, so lazy_data[i] is not comparable to data[i] (a single structured record). Open with 'rb' and compare data[i] to the scalar record from lazy loading (e.g., lazy_wf[0]).

Type hinting in def read_binary_lazy(file : BinaryIO, should be <class '_io.BufferedReader'>

The BinaryIO typing object represents the io.BufferedReader class. Whoops!

Copilot · 2026-04-01T21:35:23Z

-                        process_bin_WD2(**conf_dict)
+                        process_bin_WD2_lazy(**conf_dict)
                    case other:
                        raise RuntimeError(f"wavedump edition {other} decoding isn't currently implemented.")


proc() now always routes WD2 decoding to process_bin_WD2_lazy, which does not accept the previously-supported counts argument used for chunking in process_bin_WD2. Any existing config files that still include counts will now fail with an unexpected keyword argument. Consider keeping backward compatibility (e.g., accept/ignore counts, or dispatch to the eager implementation when counts is provided).

MattZur

Will need a rebase prior to approval, I think.

MattZur · 2026-05-11T19:52:19Z

+    Reads the binary in with the expected format/offset, lazily,
+    depending on counts to break the data up.
+
+    NOTE:


Has an issue already been opened for this?

No, but one should be opened after this review. I believe WD1 has the same issue.

MattZur · 2026-05-11T20:04:45Z

+    WAVEDUMP 2: Takes a binary file and outputs the containing waveform information in a h5 file.
+
+    For particularly large waveforms/number of events. You can 'chunk' the data such that
+    each dataset holds `counts` events.


As chunking has been deprecated, it should be removed from documentation except for backwards compatibility.

MattZur · 2026-05-11T20:06:58Z

+    print(f'\nData input   :  {file_path}\nData output  :  {save_path}')
+
+    # collect header info
+    wdtype, samples, sampling_period, channels = process_header(file_path)


Instead of opening the file twice, can process_header be changed to fit after:

# open file for reading with open(file_path, 'rb') as file:

Need to try this soon.

MattZur · 2026-05-11T20:09:43Z



+
+def process_bin_WD2_lazy(file_path  :  str,


Is there a need to keep both process_bin_WD2_lazy and process_bin_WD2?

Theres no reason to remove the eager methods yet, as it will be 'faster' than the lazy method. Once implementation of 'chunks' in the lazy method is done, we can remove process_bin_WD2.

MattZur · 2026-05-11T20:14:56Z

+                        num_of_events = int(file_size / waveform_size)
+
+                    write('event_info', evt_info, (True, num_of_events, i))
+                    # writer only takes one row at a time, can't broadcast all three at once


I would perhaps explain here, or in another page how lazy reading and lazy writing works in mule

It uses a generator (read_binary_lazy() for WD2) that yields raw arrays from the binary file, and then reformats them and writes them to a file. I can add more comments if thats what you wish :)

MattZur · 2026-05-11T20:16:34Z

+                    # first run-through, collect the header information to extract table size
+                    if i == 0:
+                        file_size     = os.path.getsize(file_path)
+                        waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this


does this need a test? perhaps an issue for technical debt

Because its not a function in itself, it would be hard to create a test for this But yes, there should be a basic logic test for this.

MattZur · 2026-05-11T20:24:30Z

for completeness, parameters could be added to test_decode_produces_expected_output for WD2 as well. This has been done for the Lecroy scope as a precedent.

Its done as test_WD1_decode_produces_expected_output. We can keep combining these together, but actually looking at the @mark.parametrize for test_decode_produces_expected_output, I think they should all be separated (its looking messy).

MattZur · 2026-05-11T20:27:40Z

+    lazy_data = []
+    with open(file_path) as f:
+        binary_lazy_readout   = read_binary_lazy(f, wdtype)
+        for i in range(0,counts):
+            _, lazy_wf            = next(binary_lazy_readout)
+            lazy_data.append(lazy_wf)
+
+    # open eager data
+    with open(file_path) as f:
+        data                  = read_binary     (f, wdtype, counts)
+
+    for i in range(0,counts):
+        assert data[i] == lazy_data[i]


Type hinting in def read_binary_lazy(file : BinaryIO, should be <class '_io.BufferedReader'>

This is suboptimal, as described in the functions comment, an issue to resolve this will be opened.

As should have been addressed in nu-ZOO#67, but appears to have reared its head, temporary files are now being created one again, with increasing number. This needs to be resolved, as the initial issue is no longer fixed!

jwaiton requested a review from a team March 28, 2026 23:54

jwaiton added this to MULE points list Mar 28, 2026

jwaiton added the enhancement New feature or request label Mar 28, 2026

github-project-automation Bot moved this to Triage in MULE points list Mar 28, 2026

jwaiton moved this from Triage to Mid value Issues in MULE points list Mar 28, 2026

jwaiton requested a review from Copilot April 1, 2026 21:31

Copilot started reviewing on behalf of jwaiton April 1, 2026 21:32 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

jwaiton mentioned this pull request Apr 13, 2026

Add lazy processing of WD2 files #17

Open

MattZur reviewed May 11, 2026

View reviewed changes

jwaiton added 12 commits May 13, 2026 16:59

implement read_binary_lazy()

14e623d

implement process_bin_WD2_lazy()

808625e

alter config to match correct format

af59f09

introduce process_bin_WD2_lazy() into conf_dict setup

c279542

cosmetics

fdd1711

force chunking to be set to 1

6e53e1f

This is suboptimal, as described in the functions comment, an issue to resolve this will be opened.

alter configs to match new style

bfcc2e9

update gitignore to ignore temporary files

58b20fb

As should have been addressed in nu-ZOO#67, but appears to have reared its head, temporary files are now being created one again, with increasing number. This needs to be resolved, as the initial issue is no longer fixed!

include basic lazy vs eager test

1718bba

fix docstrings

cd3e9fb

functionalise num_of_events function

f0bef21

implement test_number_of_events_WD2()

bd63930

jwaiton force-pushed the alter_WD2_processing branch from 297de17 to bd63930 Compare May 13, 2026 16:07

cosmetics and typing

5262429

-    Reads the binary in with the expected format/offset, lazily,
-    depending on counts to break the data up.
-    NOTE:
-    The counts are hardset to 1, making this function relatively inefficient.
-    In the future, the logic should be revised to allow `np.fromfile`'s count
-    value to be set based on optimal read-in speed. The logic of the WD2 function
-    will have to accomodate this when indexing the files.
-    Parameters
-    ----------
-        file    (BufferedReader)  :  Opened file
-        wdtype  (ndtype)         :  Custom data type for extracting information from
-                                     binary files
-        counts  (int)             :  How many events you want to read in. -1 sets it to take all events.
-        offset  (int)             :  Offset at which to start reading the data. Used for chunking purposes
-                                     and so should by default be set to zero if not chunking.
-    Returns
-    -------
-        data  (ndarray)  :  Unformatted data from binary file
+    Lazily read binary data using the provided numpy dtype.
+    This function repeatedly calls ``np.fromfile`` with ``count=1`` to read
+    one record at a time from the open binary file and yields each record
+    as it is read. This makes the function simple to use in streaming
+    contexts but relatively inefficient due to the small fixed read size.
+    NOTE:
+    The read size is currently hard-set to 1 element per call to
+    ``np.fromfile``. In the future, the logic may be revised to allow the
+    ``count`` value to be tuned for optimal read-in speed; any calling code
+    that indexes into the resulting data will need to accommodate such
+    changes.
+    Parameters
+    ----------
+        file    (BufferedReader)  :  Opened binary file object.
+        wdtype  (numpy.dtype)     :  Custom data type for extracting information
+                                     from binary files.
+    Returns
+    -------
+        Generator[Tuple[bool, numpy.ndarray], None, None]
+            A generator yielding tuples of the form ``(has_data, data)``:
+            * ``has_data`` is ``True`` while records are being read, and
+              ``False`` once processing has finished.
+            * ``data`` is a numpy array containing the record(s) read by the
+              corresponding call to ``np.fromfile``.

-                    write('event_info', evt_info, (True, num_of_events, i))
+                    # In lazy mode, format_wfs can return a length-1 ndarray; unwrap to a single record
+                    if isinstance(evt_info, np.ndarray) and getattr(evt_info, "shape", ()) and evt_info.shape[0] == 1:
+                        evt_row = evt_info[0]
+                    else:
+                        evt_row = evt_info
+                    write('event_info', evt_row, (True, num_of_events, i))


		for i, (flag, array) in enumerate(read_binary_lazy(file, wdtype)):

		if (i % print_mod == 0) and (print_mod != -1):

	if (i % print_mod == 0) and (print_mod != -1):
	if print_mod not in (-1, 0) and (i % print_mod == 0):

	waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this
	waveform_size = ((samples * channels * 4) + header_size) # bytes per event: header bytes + channels * samples * 4 (float32 samples)

Conversation

jwaiton commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MattZur May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MattZur left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MattZur May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MattZur May 11, 2026 •

edited

Loading

MattZur May 11, 2026 •

edited

Loading