Open
Conversation
- New workflow script: workflow/scripts/compress_waveform.py - Reads bb_sim HDF5 output, compresses waveform data with FlacArray - Preserves coordinates and attributes in the compressed output - Registered as compress-waveform CLI command - Added flacarray dependency to pyproject.toml Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…r string detection, remove unused numpy import Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add workflow stage to compress waveform hdf5 file with FlacArray
Add compress-waveform stage using FlacArray
Mar 15, 2026
…k chunking - Scale waveform data to int16 range for efficient FLAC bit-shunting - Apply delta encoding (first differences) to reduce variance for Rice coding - Use dask chunking when loading input dataset - Preserve all xarray metadata (coords, attrs, dims, dtype) - Add decompress_waveform() to reconstruct xr.Dataset from compressed file - Add tests for roundtrip correctness, compression efficiency, and metadata - Add dask dependency for chunked dataset loading Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…extract test constant Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot
AI
changed the title
Add compress-waveform stage using FlacArray
Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray
Mar 16, 2026
…ap_blocks - Chunk by station (not time) so each dask task gets complete component-triples with full timeseries - Use map_blocks(functools.partial(_encode_chunk, ...)) pattern for parallel per-station encoding - Add component-wise delta encoding (x, y−x, z−y) to exploit strong inter-component correlation in seismic data - Keep existing time-delta and int16 scaling - Update decompress to undo component delta (cumsum axis=0) - Update test data to have correlated components Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
…spacing fix Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot
AI
changed the title
Rewrite compress_waveform with int16 scaling, delta encoding, and roundtrip to xarray
Fix dask chunking to station axis, add component delta encoding via map_blocks
Mar 16, 2026
Copilot stopped work on behalf of
lispandfound due to an error
March 16, 2026 01:04
…nt16 Replace int16 scaling (32767 levels, 15-bit precision) with 23-bit scaling (8388607 levels) that matches the float32 mantissa width. This is 256x more precise and gives effectively lossless round-trip for single-precision waveform data, while keeping delta-encoded values safely within int32 range for FlacArray compression. Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot
AI
changed the title
Fix dask chunking to station axis, add component delta encoding via map_blocks
Replace int16 quantization with 23-bit scaling for waveform compression
Mar 16, 2026
…only Remove the explicit time-axis delta encoding (np.diff/np.cumsum along the time axis) which could accumulate errors over tens of thousands of timesteps. FLAC's built-in linear prediction already handles temporal smoothness internally, so the time delta provided no compression benefit while adding complexity and drift risk. Changes: - _encode_chunk: remove time-axis np.diff, store component deltas directly - decompress_waveform: remove time-axis np.cumsum, use int64 dtype for the remaining component-axis cumsum (only 3 elements, no drift risk) - Add test_no_drift: seismic-like waveform with quiet→active→quiet pattern, checks that the tail doesn't drift from the original Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copilot
AI
changed the title
Replace int16 quantization with 23-bit scaling for waveform compression
Remove time-axis delta encoding to fix cumsum drift in waveform decompression
Mar 16, 2026
Contributor
|
Tests will fail due to an hf sim issue that is resolved in a different PR. |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds FLAC-based compression for waveform datasets and a loader that exposes compressed waveforms as a lazy, dask-backed xarray variable, enabling much smaller on-disk waveforms with on-demand decompression for analysis workflows.
Changes:
- Introduces
workflow.waveform.load_waveform_dataset()and an HDF5-backed wrapper to lazily decompress FLAC-compressed waveforms into xarray/dask. - Adds a
compress-waveformCLI to write compressed waveform data into an_flac_compressed_waveformHDF5 group. - Adds tests covering waveform roundtrip integrity and basic compression behavior; updates dependencies to include
flacarray,dask, andh5py.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| workflow/waveform.py | New loader + array wrapper to expose compressed waveform as a dask-backed xarray variable. |
| workflow/scripts/compress_waveform.py | New CLI script to write FLAC-compressed waveform streams into an HDF5 group. |
| workflow/scripts/hf_sim.py | Adds hashlib import (used for deterministic hashing later in the file). |
| tests/test_compress_waveform.py | Adds roundtrip and compression-size tests for the new compression/loader path. |
| pyproject.toml | Registers compress-waveform CLI and adds required dependencies. |
| uv.lock | Locks new dependency set (dask/flacarray/h5py and transitive deps). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I did allow AI to begin writing this PR, but ended up throwing away 90% of what it did because it did not improve the resulting compression ratios meaningfully, made some pretty janky code, and resulted in significant errors in the compressed data.
What does this PR do?
This PR allows FLAC compression of waveform data to produce significantly smaller waveforms without significant loss in precision. Useful for publications, archival storage, and for researchers creating figures inspecting results where the size of the original data is simply too big to work with. Achieves compression ratios in the range of 8.75x compared to the original waveform with a relatively straightforward implementation and error close to the machine precision noise floor.
For a test I took one of the broadband waveforms from ayushi's dataset and compressed it using standard options.
From this we can see that the broadband data received >10x compression ratio, while the lf data received a ratio of 15.6x. The low-frequency data will always compress better than broadband data because the lf data is smoother which the polynomial-fitting based FLAC codec loves.
To actually read the array, I have introduced a
waveformmodule that does some xarray magic that produces a dataset that looks like a real dataset, but decompresses the waveform on-the-fly when you select data.Note that despite xarray saying that the data consumes 5gb, the python process actually uses way less than this:

This 10x reduction in memory usage is because it is only reading the compressed data from disk. Because this is an xarray dataset you can do all the normal stuff we do with xarray datasets but you just need to add
.compute()to tell xarray to actually go ahead and do the decompression.Reading the real stations from the original dataset shows a difference below the 32-bit floating point noise floor, demonstrating that the values are recovered essentially losslessly.