Skip to content

Compress waveform data with flacarray#98

Open
Copilot wants to merge 25 commits intopegasusfrom
copilot/compress-waveform-with-flacarray
Open

Compress waveform data with flacarray#98
Copilot wants to merge 25 commits intopegasusfrom
copilot/compress-waveform-with-flacarray

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 15, 2026

I did allow AI to begin writing this PR, but ended up throwing away 90% of what it did because it did not improve the resulting compression ratios meaningfully, made some pretty janky code, and resulted in significant errors in the compressed data.

What does this PR do?

This PR allows FLAC compression of waveform data to produce significantly smaller waveforms without significant loss in precision. Useful for publications, archival storage, and for researchers creating figures inspecting results where the size of the original data is simply too big to work with. Achieves compression ratios in the range of 8.75x compared to the original waveform with a relatively straightforward implementation and error close to the machine precision noise floor.

For a test I took one of the broadband waveforms from ayushi's dataset and compressed it using standard options.

.rw-r--r--  5.4G jake 16 Mar 01:45  realisation.bb
.rw-r--r--  5.4G jake 16 Mar 01:43  realisation.lf
.rw-r--r--  345M jake 16 Mar 16:27  realisation_compressed.lf
.rw-r--r--  510M jake 16 Mar 16:26  realisation_compressed_bb.h5

From this we can see that the broadband data received >10x compression ratio, while the lf data received a ratio of 15.6x. The low-frequency data will always compress better than broadband data because the lf data is smoother which the polynomial-fitting based FLAC codec loves.

To actually read the array, I have introduced a waveform module that does some xarray magic that produces a dataset that looks like a real dataset, but decompresses the waveform on-the-fly when you select data.

>>> from workflow import waveform
>>> dset = waveform.load_waveform_dataset('/home/jake/tmp/292203/realisation_co\
mpressed.h5')
>>> dset
<xarray.Dataset> Size: 5GB
Dimensions:    (component: 3, station: 14485, time: 30820)
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 406kB 'utMuUSA' 'u4aAgSA' ... 'pkz3SHB' 'vmp4YSA'
    x          (station) int32 58kB ...
    y          (station) int32 58kB ...
    latitude   (station) float32 58kB ...
    longitude  (station) float32 58kB ...
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1
Data variables:
    waveform   (component, station, time) float32 5GB dask.array<chunksize=(3, 3344, 3344), meta=np.ndarray>
Attributes:
    units:    g

Note that despite xarray saying that the data consumes 5gb, the python process actually uses way less than this:
image
This 10x reduction in memory usage is because it is only reading the compressed data from disk. Because this is an xarray dataset you can do all the normal stuff we do with xarray datasets but you just need to add .compute() to tell xarray to actually go ahead and do the decompression.

>>> real_stations = dset.waveform.sel(station=dset.station.str.match(r'^\w{2,4}$')) # happens instantly, selects all real stations
>>> real_stations
<xarray.DataArray 'waveform' (component: 3, station: 64, time: 30820)> Size: 24MB
dask.array<getitem, shape=(3, 64, 30820), dtype=float32, chunksize=(3, 64, 3344), chunktype=numpy.ndarray>
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 2kB 'NCHS' 'TOZ' 'WFSS' ... 'KUZ' 'TDHS' 'OPCS'
    x          (station) int32 256B ...
    y          (station) int32 256B ...
    latitude   (station) float32 256B ...
    longitude  (station) float32 256B ...
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1
>>> actual_values = real_stations.compute() # converts the dask array into a numpy array, takes some time to decompress the wavefor\
ms
>>> actual_values
<xarray.DataArray 'waveform' (component: 3, station: 64, time: 30820)> Size: 24MB
array([[[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.9850486e-06, -1.9850486e-06, -1.9850486e-06],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          5.1158713e-07,  5.1158713e-07,  5.1158713e-07],
        [ 2.7117494e-08,  2.7117494e-08,  2.7117494e-08, ...,
          2.4404289e-07,  2.4404289e-07,  2.4404289e-07],
        ...,
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.2280907e-07, -1.2280907e-07, -1.2280907e-07],
        [-4.3233740e-08, -4.3233740e-08, -4.3233740e-08, ...,
          1.2970122e-07,  1.2970122e-07,  1.2970122e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          7.4622221e-07,  7.4622221e-07,  7.4622221e-07]],

       [[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -9.3293784e-07, -9.3293784e-07, -9.3293784e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -1.3694080e-07, -1.3694080e-07, -1.3694080e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -8.7871740e-08, -8.7871740e-08, -8.7871740e-08],
...
         -3.9649149e-07, -3.9649149e-07, -3.9649149e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -4.6030618e-07, -4.6030618e-07, -4.6030618e-07],
        [-1.4156103e-07, -1.4156103e-07, -1.4156103e-07, ...,
         -1.4156103e-07, -1.4156103e-07, -1.4156103e-07]],

       [[ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -2.9094554e-07, -2.9094554e-07, -2.9094554e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -4.2664396e-08, -4.2664396e-08, -4.2664396e-08],
        [ 1.0498866e-08,  1.0498866e-08,  1.0498866e-08, ...,
          6.8242593e-07,  6.8242593e-07,  6.8242593e-07],
        ...,
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          2.7099122e-07,  2.7099122e-07,  2.7099122e-07],
        [ 1.6589183e-08,  1.6589183e-08,  1.6589183e-08, ...,
         -1.3269164e-07, -1.3269164e-07, -1.3269164e-07],
        [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -6.6182110e-08, -6.6182110e-08, -6.6182110e-08]]],
      shape=(3, 64, 30820), dtype=float32)
Coordinates:
  * component  (component) <U1 12B 'x' 'y' 'z'
  * station    (station) <U7 2kB 'NCHS' 'TOZ' 'WFSS' ... 'KUZ' 'TDHS' 'OPCS'
    x          (station) int32 256B 3235 1294 3429 3821 ... 3415 994 3538 2850
    y          (station) int32 256B 2491 1221 1823 839 2463 ... 403 148 63 839
    latitude   (station) float32 256B -39.51 -37.73 -39.03 ... -37.63 -38.01
    longitude  (station) float32 256B 176.9 175.5 177.4 ... 175.7 178.4 177.3
  * time       (time) float64 247kB -3.0 -2.995 -2.99 ... 151.1 151.1 151.1

Reading the real stations from the original dataset shows a difference below the 32-bit floating point noise floor, demonstrating that the values are recovered essentially losslessly.

>>> import numpy as np
>>> np.abs(orig_real_stations - actual_values).max()
<xarray.DataArray 'waveform' ()> Size: 4B
array(9.872019e-08, dtype=float32)

Copilot AI and others added 2 commits March 15, 2026 23:50
- New workflow script: workflow/scripts/compress_waveform.py
- Reads bb_sim HDF5 output, compresses waveform data with FlacArray
- Preserves coordinates and attributes in the compressed output
- Registered as compress-waveform CLI command
- Added flacarray dependency to pyproject.toml

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…r string detection, remove unused numpy import

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot AI changed the title [WIP] Add workflow stage to compress waveform hdf5 file with FlacArray Add compress-waveform stage using FlacArray Mar 15, 2026
Copilot AI requested a review from lispandfound March 15, 2026 23:53
Copilot AI and others added 2 commits March 16, 2026 00:09
…k chunking

- Scale waveform data to int16 range for efficient FLAC bit-shunting
- Apply delta encoding (first differences) to reduce variance for Rice coding
- Use dask chunking when loading input dataset
- Preserve all xarray metadata (coords, attrs, dims, dtype)
- Add decompress_waveform() to reconstruct xr.Dataset from compressed file
- Add tests for roundtrip correctness, compression efficiency, and metadata
- Add dask dependency for chunked dataset loading

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…extract test constant

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot AI changed the title Add compress-waveform stage using FlacArray Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray Mar 16, 2026
Copilot AI and others added 2 commits March 16, 2026 00:26
…ap_blocks

- Chunk by station (not time) so each dask task gets complete
  component-triples with full timeseries
- Use map_blocks(functools.partial(_encode_chunk, ...)) pattern
  for parallel per-station encoding
- Add component-wise delta encoding (x, y−x, z−y) to exploit
  strong inter-component correlation in seismic data
- Keep existing time-delta and int16 scaling
- Update decompress to undo component delta (cumsum axis=0)
- Update test data to have correlated components

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…spacing fix

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot AI changed the title Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray Fix dask chunking to station axis, add component delta encoding via map_blocks Mar 16, 2026
Copilot stopped work on behalf of lispandfound due to an error March 16, 2026 01:04
…nt16

Replace int16 scaling (32767 levels, 15-bit precision) with 23-bit
scaling (8388607 levels) that matches the float32 mantissa width.
This is 256x more precise and gives effectively lossless round-trip
for single-precision waveform data, while keeping delta-encoded
values safely within int32 range for FlacArray compression.

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot AI changed the title Fix dask chunking to station axis, add component delta encoding via map_blocks Replace int16 quantization with 23-bit scaling for waveform compression Mar 16, 2026
…only

Remove the explicit time-axis delta encoding (np.diff/np.cumsum along
the time axis) which could accumulate errors over tens of thousands of
timesteps. FLAC's built-in linear prediction already handles temporal
smoothness internally, so the time delta provided no compression
benefit while adding complexity and drift risk.

Changes:
- _encode_chunk: remove time-axis np.diff, store component deltas directly
- decompress_waveform: remove time-axis np.cumsum, use int64 dtype for
  the remaining component-axis cumsum (only 3 elements, no drift risk)
- Add test_no_drift: seismic-like waveform with quiet→active→quiet
  pattern, checks that the tail doesn't drift from the original

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot AI changed the title Replace int16 quantization with 23-bit scaling for waveform compression Remove time-axis delta encoding to fix cumsum drift in waveform decompression Mar 16, 2026
@lispandfound lispandfound changed the title Remove time-axis delta encoding to fix cumsum drift in waveform decompression Compress waveform data with flacarray Mar 16, 2026
@lispandfound
Copy link
Copy Markdown
Contributor

Tests will fail due to an hf sim issue that is resolved in a different PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds FLAC-based compression for waveform datasets and a loader that exposes compressed waveforms as a lazy, dask-backed xarray variable, enabling much smaller on-disk waveforms with on-demand decompression for analysis workflows.

Changes:

  • Introduces workflow.waveform.load_waveform_dataset() and an HDF5-backed wrapper to lazily decompress FLAC-compressed waveforms into xarray/dask.
  • Adds a compress-waveform CLI to write compressed waveform data into an _flac_compressed_waveform HDF5 group.
  • Adds tests covering waveform roundtrip integrity and basic compression behavior; updates dependencies to include flacarray, dask, and h5py.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
workflow/waveform.py New loader + array wrapper to expose compressed waveform as a dask-backed xarray variable.
workflow/scripts/compress_waveform.py New CLI script to write FLAC-compressed waveform streams into an HDF5 group.
workflow/scripts/hf_sim.py Adds hashlib import (used for deterministic hashing later in the file).
tests/test_compress_waveform.py Adds roundtrip and compression-size tests for the new compression/loader path.
pyproject.toml Registers compress-waveform CLI and adds required dependencies.
uv.lock Locks new dependency set (dask/flacarray/h5py and transitive deps).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

lispandfound and others added 2 commits March 17, 2026 12:41
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants