generated from ucgmsim/template
-
Notifications
You must be signed in to change notification settings - Fork 0
Compress waveform data with flacarray #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Copilot
wants to merge
25
commits into
pegasus
Choose a base branch
from
copilot/compress-waveform-with-flacarray
base: pegasus
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+413
−2
Open
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
b713d91
Initial plan
Copilot a055112
Add compress-waveform stage using FlacArray compression
Copilot 29c7d3b
Address code review: use dtype.kind check instead of np.issubdtype fo…
Copilot 928491a
Rewrite compress_waveform with int16 scaling, delta encoding, and das…
Copilot a238447
Address code review: rename constant, document dask materialization, …
Copilot c20a76c
Fix dask chunking to station axis, add component delta encoding via m…
Copilot c542e55
Address code review: explicit .compute(), clearer scale_factor cast, …
Copilot cbbdfc5
Improve waveform recovery: scale to int32 range (23-bit) instead of i…
Copilot f30674c
Fix waveform drift: remove time-delta encoding, keep component-delta …
Copilot 40091b9
fix: do not delta encode waveform components
lispandfound a5930d2
Use level argument
lispandfound 4401979
actually use the precision argument
lispandfound 3511c09
tests(compress-waveform): simplify tests to reflect simplified code
lispandfound 03bf833
fix(compress-waveform): satisfy type checker
lispandfound 80d018d
tests(compress-waveform): remove ds make test dataset
lispandfound 05ff473
fix(hf-sim): fix import
lispandfound 0568c17
remove unused import
lispandfound 034183d
include h5py as a dependency and remove dask
lispandfound a086d55
feat: add xarray shim for waveform reading
lispandfound f434df7
fix: ci nitpicks
lispandfound 40f4786
docs: reduce documentation
lispandfound cd8d660
docs: make CLI docs simpler
lispandfound 4fe3930
docs: precision accuracy
lispandfound b313109
fix slice indexing
lispandfound 32d785a
feat: support stepped timestepping values
lispandfound File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| from pathlib import Path | ||
|
|
||
| import numpy as np | ||
| import xarray as xr | ||
|
|
||
| from workflow import waveform | ||
| from workflow.scripts.compress_waveform import ( | ||
| compress_waveform, | ||
| ) | ||
|
|
||
| # Constants for test data generation | ||
| N_COMPONENTS, N_STATIONS, N_TIME = 3, 5, 1000 | ||
| DT = 0.05 | ||
|
|
||
|
|
||
| def _make_test_dataset() -> xr.Dataset: | ||
| """Create a simple synthetic waveform dataset for testing.""" | ||
| time = np.arange(N_TIME) * DT | ||
| waveform = ( | ||
| np.sin(time * 2 * np.pi * 1.0) | ||
| + np.random.default_rng(42).standard_normal((N_COMPONENTS, N_STATIONS, N_TIME)) | ||
| * 0.1 | ||
| ) | ||
|
|
||
| return xr.Dataset( | ||
| {"waveform": (["component", "station", "time"], waveform.astype(np.float32))}, | ||
| coords={ | ||
| "component": ["x", "y", "z"], | ||
| "station": [f"STA{i:02d}" for i in range(N_STATIONS)], | ||
| "time": time, | ||
| "lat": ("station", np.linspace(-45, -43, N_STATIONS)), | ||
| }, | ||
| attrs={"units": "m/s", "source": "test_gen"}, | ||
| ) | ||
|
|
||
|
|
||
| def test_waveform_roundtrip_integrity(tmp_path: Path) -> None: | ||
| """Verify waveform values and metadata survive the compression roundtrip.""" | ||
| with _make_test_dataset() as ds: | ||
| input_path = tmp_path / "input.h5" | ||
| original_attrs = ds.attrs | ||
| ds.to_netcdf(input_path, engine="h5netcdf") | ||
| output_path = tmp_path / "output.h5" | ||
|
|
||
| compress_waveform(input_path, output_path) | ||
| restored = waveform.load_waveform_dataset(output_path).compute() | ||
|
|
||
| restored_subset = {k: v for k, v in restored.attrs.items() if k in original_attrs} | ||
| assert restored_subset == original_attrs, ( | ||
| "Restored attributes do not match original attributes." | ||
| ) | ||
|
|
||
| for coord in ds.coords: | ||
| np.testing.assert_array_equal(restored[coord].values, ds[coord].values) | ||
|
|
||
| xr.testing.assert_allclose(restored, ds, atol=5e-4) | ||
|
|
||
|
|
||
| def test_compression_efficiency(tmp_path: Path) -> None: | ||
| """Verify the compressed file is actually smaller than the raw values.""" | ||
| input_path = tmp_path / "input.h5" | ||
| output_path = tmp_path / "output.h5" | ||
|
|
||
| with _make_test_dataset() as ds: | ||
| ds.to_netcdf(input_path, engine="h5netcdf") | ||
|
|
||
| compress_waveform(input_path, output_path) | ||
|
|
||
| raw_size = input_path.stat().st_size | ||
| compressed_size = output_path.stat().st_size | ||
|
|
||
| assert compressed_size < raw_size, ( | ||
| f"Compression failed to reduce size: {compressed_size} >= {raw_size}" | ||
| ) |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| """Compress Waveform. | ||
|
|
||
| Description | ||
| ----------- | ||
| Compress a broadband waveform HDF5 file using FLAC compression. | ||
|
|
||
| Inputs | ||
| ------ | ||
| 1. A broadband waveform file (HDF5/NetCDF4 format, output of ``bb-sim``). | ||
|
|
||
| Outputs | ||
| ------- | ||
| A compressed waveform file in HDF5 format with FlacArray-encoded waveform data. | ||
|
|
||
| Environment | ||
| ----------- | ||
| Can be run in the cybershake container. Can also be run from your own | ||
| computer using the ``compress-waveform`` command which is installed after running | ||
| ``pip install workflow@git+https://github.com/ucgmsim/workflow``. | ||
|
|
||
| Usage | ||
| ----- | ||
| ``compress-waveform WAVEFORM_FFP OUTPUT_FFP`` | ||
|
|
||
| For More Help | ||
| ------------- | ||
| See the output of ``compress-waveform --help``. | ||
| """ | ||
|
|
||
| from pathlib import Path | ||
| from typing import Annotated | ||
|
|
||
| import flacarray.hdf5 | ||
| import h5py | ||
| import typer | ||
| import xarray as xr | ||
|
|
||
| from qcore import cli | ||
| from workflow import log_utils | ||
|
|
||
| app = typer.Typer() | ||
|
|
||
|
|
||
| @cli.from_docstring(app) | ||
| @log_utils.log_call() | ||
| def compress_waveform( | ||
| waveform_ffp: Annotated[Path, typer.Argument(dir_okay=False, exists=True)], | ||
| output_ffp: Annotated[Path, typer.Argument(dir_okay=False, writable=True)], | ||
| level: Annotated[int, typer.Option(min=0, max=8)] = 5, | ||
| precision: Annotated[int, typer.Option(min=1)] = 4, | ||
| ) -> None: | ||
| """Compress a broadband waveform file using FLAC. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| waveform_ffp : Path | ||
| Path to the input broadband waveform file (HDF5/NetCDF4). | ||
| output_ffp : Path | ||
| Path to the output compressed HDF5 file. | ||
| level : int, optional | ||
| FLAC compression level (0-8). Higher values compress more but | ||
| are slower. Defaults to 5. | ||
| precision : int, optional | ||
| FLAC precision level (in significant digits of input data). Higher values compress less but | ||
| have more precision. Defaults to 4. | ||
| """ | ||
| with ( | ||
| xr.open_dataset(waveform_ffp, engine="h5netcdf") as broadband, | ||
| ): | ||
| broadband.drop_vars("waveform").to_netcdf(output_ffp, engine="h5netcdf") | ||
| with h5py.File(output_ffp, "a") as hdf: | ||
| group = hdf.create_group("_flac_compressed_waveform") | ||
| group.attrs["flac_array"] = True | ||
| group.attrs["name"] = "waveform" | ||
| group.attrs["shape"] = broadband.waveform.shape | ||
| group.attrs["dims"] = broadband.waveform.dims | ||
| group.attrs["dtype"] = str(broadband.waveform.dtype) | ||
|
|
||
| flacarray.hdf5.write_array( | ||
| broadband.waveform.values, | ||
| group, | ||
| precision=precision, | ||
| level=level, | ||
| use_threads=True, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| app() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.