Skip to content

Vs30 refactor#42

Draft
AndrewRidden-Harper wants to merge 195 commits intomasterfrom
vs30_refactor
Draft

Vs30 refactor#42
AndrewRidden-Harper wants to merge 195 commits intomasterfrom
vs30_refactor

Conversation

@AndrewRidden-Harper
Copy link
Copy Markdown
Contributor

@AndrewRidden-Harper AndrewRidden-Harper commented Jan 27, 2026

This PR is a major refactor of the Vs30 package to improve readability, modularity, and testability. The mathematical expressions are equivalent to the original version, but use clearer variable names and simplified forms. Regression tests confirm that the output from the refactored codebase is consistent with the original version.

Design changes

Centralized configuration

  • All parameters that influence the calculated Vs30 values are defined in a single config.yaml file, rather than being hardcoded across various source files as in the original codebase.
  • Parameters can optionally be overridden via command-line arguments.
  • Configuration is validated at load time using Pydantic.

Modular CLI

  • A Typer-based CLI (vs30) exposes the full pipeline as well as individual stages (update-categorical-vs30-models, make-initial-vs30-raster, spatial-fit, etc.), making it possible to run or re-run specific steps independently.
  • A new compute-at-locations command computes Vs30 at specific latitude/longitude points without generating full raster grids, which is efficient for querying a small number of sites.

Input Vs30 data

  • The original version had multiple modes of operation that selected which observed Vs30 dataset to use (e.g., "original" combined three hardcoded sources with dataset-specific filtering and downsampling; "cpt" loaded CPT data), with different processing and Bayesian update paths for each mode.
  • In the refactored version, data preparation and filtering is the user's responsibility. All observed Vs30 values that are passed in are used directly.
  • Observations are provided in one or both of two categories:
    • Clustered observations (typically dense CPT-inferred Vs30 values) — processed with DBSCAN clustering so that spatially grouped measurements are not over-weighted in the Bayesian update.
    • Independent observations — treated as individual measurements.
  • Both kinds of datasets can be used in the same run, with at least one being required.

Vs30 map generation

  • When spatially adjusting the Vs30 map by fitting multivariate normal (MVN) distributions to observed Vs30 values, the original package looped over all pixels in the raster (processing them in blocks), computing the distance from each pixel to all observations to determine which (if any) observations would affect it.
  • The refactored version reverses the approach: it first loops over observations, using bounding boxes to identify which map pixels will be affected, then loops only over those affected pixels for the more expensive MVN conditioning calculations.
  • When the number of observations is much smaller than the number of pixels in the map, this is substantially faster because the majority of pixels (those far from any observation) are never processed.
  • For clustered observations, the bounding box search uses a subsampled set of observations (by default, every 100th observation within each cluster). This is a good approximation because all observations within a spatial cluster affect nearly the same set of map pixels. Unclustered (isolated) observations are always included in full. The subsampling step is configurable and can be set to 1 to use all observations, at the cost of a slower bounding box search.
  • Separately, the max_points parameter caps the number of nearest observations used for the per-pixel MVN update, limiting the cost of the matrix inversion at each pixel.

Multiprocessing

  • Both the raster pipeline and compute-at-locations support multiprocessing. The spatial adjustment work is divided into chunks of affected pixels distributed across worker processes.
  • BLAS thread oversubscription is managed at runtime using threadpoolctl, rather than requiring environment variables to be set before import.

Test suite

  • A comprehensive pytest suite covers all modules: configuration, category assignment, Bayesian updates, raster creation, hybrid modifications, spatial adjustment, the CLI, parallelism, and full-pipeline regression tests.

AndrewRidden-Harper and others added 30 commits March 23, 2026 12:23
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rences

Remove 11 tests that merely exercised numpy/scipy library behavior (exponential
decay, monotonicity, shape checks). Keep 3 tests for non-obvious correctness:
Matérn κ=0.5 mathematical identity and picklability of resolved correlation
callables. Fix test_compute_at_locations to reference modified_foster_2019
resource files matching the benchmark CSV.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The modified_foster_2019 and foster_2019 categorical files were identical.
Point both configs (and test fixtures) to the foster_2019 versions and
delete the duplicate modified_foster_2019 categorical CSVs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file was already referenced by the modified_foster_2019 config but
was untracked. Contains 459 independent observations including 47 Q3
broadband stations retained by the modified model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Systematic experiment matrix (200-point comparison against Wellington
test subgrid) determined the actual parameters Jaehwi used for
V1.0_26Mar.tif:

- Bayesian update from priors (NOT Foster 2019 posteriors despite
  being the code default — Foster posteriors give 199 m/s error)
- Ratio 1:1 combination (50/50 geometric mean, not stdv weighting)
- No coastal distance modification (code was commented out)
- GID 4 slope interpolation skipped (mod6=True default)
- 671 reconstructed observations (McGann 276 + Wotherspoon 36 +
  Kaiser 359)

Add skip_alluvium_slope parameter to decouple GID 4 slope skip from
coastal distance flag (Jaehwi had these independent: mod6=True while
coastal distance was disabled separately). Thread through pipeline,
parallel, and CLI config loading.

Update jaehwi_v1p0.yaml: combination_method ratio, combine_ratio 1.0,
apply_coastal_distance_mod false, skip_alluvium_slope true.

Current match: 76.5% of points within 1 m/s, 18.03 m/s mean abs diff
(residual from MVN spatial adjustment sensitivity in high-observation
categories).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion comparison

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
compute_grid → grid_pipeline
compute_at_locations → points_pipeline
run_in_memory_pipeline_for_model_type → compute_model_grid

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…26Mar.tif either

Even with the same codebase and nproc=1, 15% of pixels (MVN-affected
near observations) differ by ~2.8% due to float32/caching artifacts
that depend on processing order. Confirms our pipeline reproduces the
intended model correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…duction artifacts

demonstrate_grid_artifacts.py: runnable tests showing float32 precision loss,
distance caching, and combined effects on MVN spatial adjustment.

why_we_cannot_reproduce_v1p0_grid_exactly.md: non-technical explanation for
colleagues unfamiliar with the codebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ran both pipelines in points mode on 200 random NZ locations. Geology and
terrain IDs match 100%. Median Vs30 diff is 0.0002 m/s. The 14.5% of points
with >1% diff are near observations where the reconstructed 671-station set
diverges slightly from Jaehwi's internal loader.

Also discovered and documented a terrain raster pixel boundary artifact:
V1.0_26Mar.tif pixel centres coincide with IwahashiPike.tif pixel boundaries
(50 m grid offset), so the NZTM-WGS84-NZTM coordinate roundtrip (~5 um
shift) can flip terrain IDs at those exact locations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…solution

The identical IwahashiPike.tif raster (MD5 match) produced 35% terrain ID
disagreement due to a 50 m grid offset between V1.0_26Mar.tif and the terrain
raster. Sample points at reference TIF pixel centres fall exactly on terrain
raster pixel boundaries, where a 3-5 um NZTM-WGS84-NZTM coordinate roundtrip
shift flips the floor() pixel assignment. Shifting by 50 m resolves it
completely (200/200 terrain IDs match).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous 608-station file was a filtered reconstruction. This replaces
it with the exact 671-station output of sites_load_NSHM2022.load_vs()
from /home/arr65/src/Vs30_2026/vs30/, ensuring our pipeline uses identical
observation data to Jaehwi's code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mod6 flag conflated two independent modifications: alluvium slope
interpolation and coastal distance modification. When skip_alluvium_slope
was True and apply_coastal_distance_mod was False, mod6 evaluated to True,
causing coastal distance modification to run with dummy zero distances and
clamping all GID 4 (alluvium) pixels to vs30_min=240.0.

Fix: replace mod6/mod13 booleans with independent apply_alluvium_slope_mod
and apply_coastal_distance_mod controls. Rename skip_alluvium_slope to
apply_alluvium_slope_mod (positive logic, consistent _mod suffix). Remove
defaults so configs must explicitly specify both parameters.

200-point comparison improved from 82% to 99.5% within 0.01% of Jaehwi's
output (mean diff 8.9 → 0.032 m/s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shorten jaehwi_v1p0_reproduction_investigation.md to conclusions,
confirmed config, and 200-point comparison results. Condense
legacy_python_code_bug_analysis.md to a summary table and brief
descriptions of each bug (useful context for developers comparing
against Viktor's Python port).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…test config loading

- Consolidate mp.get_context("spawn") into parallel.py as single source
- Extract _default_correlation_functions() in pipeline.py
- Extract _collect_observation_csvs() in pipeline.py
- Move shared test config loading to conftest.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants