Compress waveform data with flacarray by Copilot · Pull Request #98 · ucgmsim/workflow

Copilot · 2026-03-15T23:41:54Z

I did allow AI to begin writing this PR, but ended up throwing away 90% of what it did because it did not improve the resulting compression ratios meaningfully, made some pretty janky code, and resulted in significant errors in the compressed data.

What does this PR do?

This PR allows FLAC compression of waveform data to produce significantly smaller waveforms without significant loss in precision. Useful for publications, archival storage, and for researchers creating figures inspecting results where the size of the original data is simply too big to work with. Achieves compression ratios in the range of 8.75x compared to the original waveform with a relatively straightforward implementation and error close to the machine precision noise floor.

For a test I took one of the broadband waveforms from ayushi's dataset and compressed it using standard options.

.rw-r--r--  5.4G jake 16 Mar 01:45  realisation.bb
.rw-r--r--  5.4G jake 16 Mar 01:43  realisation.lf
.rw-r--r--  345M jake 16 Mar 16:27  realisation_compressed.lf
.rw-r--r--  510M jake 16 Mar 16:26  realisation_compressed_bb.h5

From this we can see that the broadband data received >10x compression ratio, while the lf data received a ratio of 15.6x. The low-frequency data will always compress better than broadband data because the lf data is smoother which the polynomial-fitting based FLAC codec loves.

To actually read the array, I have introduced a waveform module that does some xarray magic that produces a dataset that looks like a real dataset, but decompresses the waveform on-the-fly when you select data.

>>> from workflow import waveform
>>> dset = waveform.load_waveform_dataset('/home/jake/tmp/292203/realisation_co\
mpressed.h5')
>>> dset
<xarray.Dataset> Size: 5GB
Dimensions:    (component: 3, station: 14485, time: 30820)
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 406kB 'utMuUSA' 'u4aAgSA' ... 'pkz3SHB' 'vmp4YSA'
    x          (station) int32 58kB ...
    y          (station) int32 58kB ...
    latitude   (station) float32 58kB ...
    longitude  (station) float32 58kB ...
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1
Data variables:
    waveform   (component, station, time) float32 5GB dask.array<chunksize=(3, 3344, 3344), meta=np.ndarray>
Attributes:
    units:    g

Note that despite xarray saying that the data consumes 5gb, the python process actually uses way less than this:

This 10x reduction in memory usage is because it is only reading the compressed data from disk. Because this is an xarray dataset you can do all the normal stuff we do with xarray datasets but you just need to add .compute() to tell xarray to actually go ahead and do the decompression.

>>> real_stations = dset.waveform.sel(station=dset.station.str.match(r'^\w{2,4}$')) # happens instantly, selects all real stations
>>> real_stations
<xarray.DataArray 'waveform' (component: 3, station: 64, time: 30820)> Size: 24MB
dask.array<getitem, shape=(3, 64, 30820), dtype=float32, chunksize=(3, 64, 3344), chunktype=numpy.ndarray>
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 2kB 'NCHS' 'TOZ' 'WFSS' ... 'KUZ' 'TDHS' 'OPCS'
    x          (station) int32 256B ...
    y          (station) int32 256B ...
    latitude   (station) float32 256B ...
    longitude  (station) float32 256B ...
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1
>>> actual_values = real_stations.compute() # converts the dask array into a numpy array, takes some time to decompress the wavefor\
ms
>>> actual_values
<xarray.DataArray 'waveform' (component: 3, station: 64, time: 30820)> Size: 24MB
array([[[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.9850486e-06, -1.9850486e-06, -1.9850486e-06],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          5.1158713e-07,  5.1158713e-07,  5.1158713e-07],
        [ 2.7117494e-08,  2.7117494e-08,  2.7117494e-08, ...,
          2.4404289e-07,  2.4404289e-07,  2.4404289e-07],
        ...,
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.2280907e-07, -1.2280907e-07, -1.2280907e-07],
        [-4.3233740e-08, -4.3233740e-08, -4.3233740e-08, ...,
          1.2970122e-07,  1.2970122e-07,  1.2970122e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          7.4622221e-07,  7.4622221e-07,  7.4622221e-07]],

       [[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -9.3293784e-07, -9.3293784e-07, -9.3293784e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.3694080e-07, -1.3694080e-07, -1.3694080e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -8.7871740e-08, -8.7871740e-08, -8.7871740e-08],
...
         -3.9649149e-07, -3.9649149e-07, -3.9649149e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -4.6030618e-07, -4.6030618e-07, -4.6030618e-07],
        [-1.4156103e-07, -1.4156103e-07, -1.4156103e-07, ...,
         -1.4156103e-07, -1.4156103e-07, -1.4156103e-07]],

       [[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -2.9094554e-07, -2.9094554e-07, -2.9094554e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -4.2664396e-08, -4.2664396e-08, -4.2664396e-08],
        [ 1.0498866e-08,  1.0498866e-08,  1.0498866e-08, ...,
          6.8242593e-07,  6.8242593e-07,  6.8242593e-07],
        ...,
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          2.7099122e-07,  2.7099122e-07,  2.7099122e-07],
        [ 1.6589183e-08,  1.6589183e-08,  1.6589183e-08, ...,
         -1.3269164e-07, -1.3269164e-07, -1.3269164e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -6.6182110e-08, -6.6182110e-08, -6.6182110e-08]]],
      shape=(3, 64, 30820), dtype=float32)
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 2kB 'NCHS' 'TOZ' 'WFSS' ... 'KUZ' 'TDHS' 'OPCS'
    x          (station) int32 256B 3235 1294 3429 3821 ... 3415 994 3538 2850
    y          (station) int32 256B 2491 1221 1823 839 2463 ... 403 148 63 839
    latitude   (station) float32 256B -39.51 -37.73 -39.03 ... -37.63 -38.01
    longitude  (station) float32 256B 176.9 175.5 177.4 ... 175.7 178.4 177.3
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1

Reading the real stations from the original dataset shows a difference below the 32-bit floating point noise floor, demonstrating that the values are recovered essentially losslessly.

>>> import numpy as np
>>> np.abs(orig_real_stations - actual_values).max()
<xarray.DataArray 'waveform' ()> Size: 4B
array(9.872019e-08, dtype=float32)

- New workflow script: workflow/scripts/compress_waveform.py - Reads bb_sim HDF5 output, compresses waveform data with FlacArray - Preserves coordinates and attributes in the compressed output - Registered as compress-waveform CLI command - Added flacarray dependency to pyproject.toml Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…r string detection, remove unused numpy import Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…k chunking - Scale waveform data to int16 range for efficient FLAC bit-shunting - Apply delta encoding (first differences) to reduce variance for Rice coding - Use dask chunking when loading input dataset - Preserve all xarray metadata (coords, attrs, dims, dtype) - Add decompress_waveform() to reconstruct xr.Dataset from compressed file - Add tests for roundtrip correctness, compression efficiency, and metadata - Add dask dependency for chunked dataset loading Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…extract test constant Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…ap_blocks - Chunk by station (not time) so each dask task gets complete component-triples with full timeseries - Use map_blocks(functools.partial(_encode_chunk, ...)) pattern for parallel per-station encoding - Add component-wise delta encoding (x, y−x, z−y) to exploit strong inter-component correlation in seismic data - Keep existing time-delta and int16 scaling - Update decompress to undo component delta (cumsum axis=0) - Update test data to have correlated components Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…spacing fix Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…nt16 Replace int16 scaling (32767 levels, 15-bit precision) with 23-bit scaling (8388607 levels) that matches the float32 mantissa width. This is 256x more precise and gives effectively lossless round-trip for single-precision waveform data, while keeping delta-encoded values safely within int32 range for FlacArray compression. Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

…only Remove the explicit time-axis delta encoding (np.diff/np.cumsum along the time axis) which could accumulate errors over tens of thousands of timesteps. FLAC's built-in linear prediction already handles temporal smoothness internally, so the time delta provided no compression benefit while adding complexity and drift risk. Changes: - _encode_chunk: remove time-axis np.diff, store component deltas directly - decompress_waveform: remove time-axis np.cumsum, use int64 dtype for the remaining component-axis cumsum (only 3 elements, no drift risk) - Add test_no_drift: seismic-like waveform with quiet→active→quiet pattern, checks that the tail doesn't drift from the original Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

lispandfound · 2026-03-16T23:26:53Z

Tests will fail due to an hf sim issue that is resolved in a different PR.

Copilot

Pull request overview

Adds FLAC-based compression for waveform datasets and a loader that exposes compressed waveforms as a lazy, dask-backed xarray variable, enabling much smaller on-disk waveforms with on-demand decompression for analysis workflows.

Changes:

Introduces workflow.waveform.load_waveform_dataset() and an HDF5-backed wrapper to lazily decompress FLAC-compressed waveforms into xarray/dask.
Adds a compress-waveform CLI to write compressed waveform data into an _flac_compressed_waveform HDF5 group.
Adds tests covering waveform roundtrip integrity and basic compression behavior; updates dependencies to include flacarray, dask, and h5py.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
workflow/waveform.py	New loader + array wrapper to expose compressed waveform as a dask-backed xarray variable.
workflow/scripts/compress_waveform.py	New CLI script to write FLAC-compressed waveform streams into an HDF5 group.
workflow/scripts/hf_sim.py	Adds `hashlib` import (used for deterministic hashing later in the file).
tests/test_compress_waveform.py	Adds roundtrip and compression-size tests for the new compression/loader path.
pyproject.toml	Registers `compress-waveform` CLI and adds required dependencies.
uv.lock	Locks new dependency set (dask/flacarray/h5py and transitive deps).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

workflow/scripts/compress_waveform.py

workflow/waveform.py

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Initial plan

b713d91

Copilot AI assigned Copilot and lispandfound Mar 15, 2026

Copilot started work on behalf of lispandfound March 15, 2026 23:41 View session

Copilot AI and others added 2 commits March 15, 2026 23:50

Address code review: use dtype.kind check instead of np.issubdtype fo…

29c7d3b

…r string detection, remove unused numpy import Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add workflow stage to compress waveform hdf5 file with FlacArray~~ Add compress-waveform stage using FlacArray Mar 15, 2026

Copilot AI requested a review from lispandfound March 15, 2026 23:53

Copilot finished work on behalf of lispandfound March 15, 2026 23:53

Copilot started work on behalf of lispandfound March 15, 2026 23:57 View session

Copilot AI and others added 2 commits March 16, 2026 00:09

Address code review: rename constant, document dask materialization, …

a238447

…extract test constant Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot AI changed the title ~~Add compress-waveform stage using FlacArray~~ Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray Mar 16, 2026

Copilot finished work on behalf of lispandfound March 16, 2026 00:13

Copilot started work on behalf of lispandfound March 16, 2026 00:17 View session

Copilot AI and others added 2 commits March 16, 2026 00:26

Address code review: explicit .compute(), clearer scale_factor cast, …

c542e55

…spacing fix Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot AI changed the title ~~Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray~~ Fix dask chunking to station axis, add component delta encoding via map_blocks Mar 16, 2026

Copilot finished work on behalf of lispandfound March 16, 2026 00:29

Copilot started work on behalf of lispandfound March 16, 2026 01:03 View session

Copilot stopped work on behalf of lispandfound due to an error March 16, 2026 01:04
The session was cancelled by the user.

Copilot started work on behalf of lispandfound March 16, 2026 01:04 View session

Copilot AI changed the title ~~Fix dask chunking to station axis, add component delta encoding via map_blocks~~ Replace int16 quantization with 23-bit scaling for waveform compression Mar 16, 2026

Copilot finished work on behalf of lispandfound March 16, 2026 01:34

Copilot started work on behalf of lispandfound March 16, 2026 01:41 View session

Copilot AI changed the title ~~Replace int16 quantization with 23-bit scaling for waveform compression~~ Remove time-axis delta encoding to fix cumsum drift in waveform decompression Mar 16, 2026

Copilot finished work on behalf of lispandfound March 16, 2026 02:08

fix: do not delta encode waveform components

40091b9

Use level argument

a5930d2

lispandfound changed the title ~~Remove time-axis delta encoding to fix cumsum drift in waveform decompression~~ Compress waveform data with flacarray Mar 16, 2026

lispandfound added 12 commits March 16, 2026 15:54

actually use the precision argument

4401979

tests(compress-waveform): simplify tests to reflect simplified code

3511c09

fix(compress-waveform): satisfy type checker

03bf833

tests(compress-waveform): remove ds make test dataset

80d018d

fix(hf-sim): fix import

05ff473

remove unused import

0568c17

include h5py as a dependency and remove dask

034183d

feat: add xarray shim for waveform reading

a086d55

fix: ci nitpicks

f434df7

docs: reduce documentation

40f4786

docs: make CLI docs simpler

cd8d660

docs: precision accuracy

4fe3930

lispandfound requested review from AndrewRidden-Harper, claudio525, Copilot and sungeunbae March 16, 2026 23:24

Copilot started reviewing on behalf of lispandfound March 16, 2026 23:24 View session

lispandfound marked this pull request as ready for review March 16, 2026 23:26

Copilot AI reviewed Mar 16, 2026

View reviewed changes

workflow/scripts/compress_waveform.py Show resolved Hide resolved

workflow/waveform.py Outdated Show resolved Hide resolved

lispandfound and others added 2 commits March 17, 2026 12:41

fix slice indexing

b313109

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat: support stepped timestepping values

32d785a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress waveform data with flacarray#98

Compress waveform data with flacarray#98
Copilot wants to merge 25 commits intopegasusfrom
copilot/compress-waveform-with-flacarray

Copilot AI commented Mar 15, 2026 •

edited by lispandfound

Loading

Uh oh!

lispandfound commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 15, 2026 • edited by lispandfound Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

lispandfound commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 15, 2026 •

edited by lispandfound

Loading