Vs30 refactor by AndrewRidden-Harper · Pull Request #42 · ucgmsim/Vs30

AndrewRidden-Harper · 2026-01-27T09:06:06Z

This PR is a major refactor of the Vs30 package to improve readability, modularity, and testability. The mathematical expressions are equivalent to the original version, but use clearer variable names and simplified forms. Regression tests confirm that the output from the refactored codebase is consistent with the original version.

Design changes

Centralized configuration

All parameters that influence the calculated Vs30 values are defined in a single config.yaml file, rather than being hardcoded across various source files as in the original codebase.
Parameters can optionally be overridden via command-line arguments.
Configuration is validated at load time using Pydantic.

Modular CLI

A Typer-based CLI (vs30) exposes the full pipeline as well as individual stages (update-categorical-vs30-models, make-initial-vs30-raster, spatial-fit, etc.), making it possible to run or re-run specific steps independently.
A new compute-at-locations command computes Vs30 at specific latitude/longitude points without generating full raster grids, which is efficient for querying a small number of sites.

Input Vs30 data

The original version had multiple modes of operation that selected which observed Vs30 dataset to use (e.g., "original" combined three hardcoded sources with dataset-specific filtering and downsampling; "cpt" loaded CPT data), with different processing and Bayesian update paths for each mode.
In the refactored version, data preparation and filtering is the user's responsibility. All observed Vs30 values that are passed in are used directly.
Observations are provided in one or both of two categories:
- Clustered observations (typically dense CPT-inferred Vs30 values) — processed with DBSCAN clustering so that spatially grouped measurements are not over-weighted in the Bayesian update.
- Independent observations — treated as individual measurements.
Both kinds of datasets can be used in the same run, with at least one being required.

Vs30 map generation

When spatially adjusting the Vs30 map by fitting multivariate normal (MVN) distributions to observed Vs30 values, the original package looped over all pixels in the raster (processing them in blocks), computing the distance from each pixel to all observations to determine which (if any) observations would affect it.
The refactored version reverses the approach: it first loops over observations, using bounding boxes to identify which map pixels will be affected, then loops only over those affected pixels for the more expensive MVN conditioning calculations.
When the number of observations is much smaller than the number of pixels in the map, this is substantially faster because the majority of pixels (those far from any observation) are never processed.
For clustered observations, the bounding box search uses a subsampled set of observations (by default, every 100th observation within each cluster). This is a good approximation because all observations within a spatial cluster affect nearly the same set of map pixels. Unclustered (isolated) observations are always included in full. The subsampling step is configurable and can be set to 1 to use all observations, at the cost of a slower bounding box search.
Separately, the max_points parameter caps the number of nearest observations used for the per-pixel MVN update, limiting the cost of the matrix inversion at each pixel.

Multiprocessing

Both the raster pipeline and compute-at-locations support multiprocessing. The spatial adjustment work is divided into chunks of affected pixels distributed across worker processes.
BLAS thread oversubscription is managed at runtime using threadpoolctl, rather than requiring environment variables to be set before import.

Test suite

A comprehensive pytest suite covers all modules: configuration, category assignment, Bayesian updates, raster creation, hybrid modifications, spatial adjustment, the CLI, parallelism, and full-pipeline regression tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rences Remove 11 tests that merely exercised numpy/scipy library behavior (exponential decay, monotonicity, shape checks). Keep 3 tests for non-obvious correctness: Matérn κ=0.5 mathematical identity and picklability of resolved correlation callables. Fix test_compute_at_locations to reference modified_foster_2019 resource files matching the benchmark CSV. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The modified_foster_2019 and foster_2019 categorical files were identical. Point both configs (and test fixtures) to the foster_2019 versions and delete the duplicate modified_foster_2019 categorical CSVs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This file was already referenced by the modified_foster_2019 config but was untracked. Contains 459 independent observations including 47 Q3 broadband stations retained by the modified model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Systematic experiment matrix (200-point comparison against Wellington test subgrid) determined the actual parameters Jaehwi used for V1.0_26Mar.tif: - Bayesian update from priors (NOT Foster 2019 posteriors despite being the code default — Foster posteriors give 199 m/s error) - Ratio 1:1 combination (50/50 geometric mean, not stdv weighting) - No coastal distance modification (code was commented out) - GID 4 slope interpolation skipped (mod6=True default) - 671 reconstructed observations (McGann 276 + Wotherspoon 36 + Kaiser 359) Add skip_alluvium_slope parameter to decouple GID 4 slope skip from coastal distance flag (Jaehwi had these independent: mod6=True while coastal distance was disabled separately). Thread through pipeline, parallel, and CLI config loading. Update jaehwi_v1p0.yaml: combination_method ratio, combine_ratio 1.0, apply_coastal_distance_mod false, skip_alluvium_slope true. Current match: 76.5% of points within 1 m/s, 18.03 m/s mean abs diff (residual from MVN spatial adjustment sensitivity in high-observation categories). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… in spatial

…ias in pipeline

…ion comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

compute_grid → grid_pipeline compute_at_locations → points_pipeline run_in_memory_pipeline_for_model_type → compute_model_grid Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…26Mar.tif either Even with the same codebase and nproc=1, 15% of pixels (MVN-affected near observations) differ by ~2.8% due to float32/caching artifacts that depend on processing order. Confirms our pipeline reproduces the intended model correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…duction artifacts demonstrate_grid_artifacts.py: runnable tests showing float32 precision loss, distance caching, and combined effects on MVN spatial adjustment. why_we_cannot_reproduce_v1p0_grid_exactly.md: non-technical explanation for colleagues unfamiliar with the codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ran both pipelines in points mode on 200 random NZ locations. Geology and terrain IDs match 100%. Median Vs30 diff is 0.0002 m/s. The 14.5% of points with >1% diff are near observations where the reconstructed 671-station set diverges slightly from Jaehwi's internal loader. Also discovered and documented a terrain raster pixel boundary artifact: V1.0_26Mar.tif pixel centres coincide with IwahashiPike.tif pixel boundaries (50 m grid offset), so the NZTM-WGS84-NZTM coordinate roundtrip (~5 um shift) can flip terrain IDs at those exact locations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…solution The identical IwahashiPike.tif raster (MD5 match) produced 35% terrain ID disagreement due to a 50 m grid offset between V1.0_26Mar.tif and the terrain raster. Sample points at reference TIF pixel centres fall exactly on terrain raster pixel boundaries, where a 3-5 um NZTM-WGS84-NZTM coordinate roundtrip shift flips the floor() pixel assignment. Shifting by 50 m resolves it completely (200/200 terrain IDs match). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous 608-station file was a filtered reconstruction. This replaces it with the exact 671-station output of sites_load_NSHM2022.load_vs() from /home/arr65/src/Vs30_2026/vs30/, ensuring our pipeline uses identical observation data to Jaehwi's code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The mod6 flag conflated two independent modifications: alluvium slope interpolation and coastal distance modification. When skip_alluvium_slope was True and apply_coastal_distance_mod was False, mod6 evaluated to True, causing coastal distance modification to run with dummy zero distances and clamping all GID 4 (alluvium) pixels to vs30_min=240.0. Fix: replace mod6/mod13 booleans with independent apply_alluvium_slope_mod and apply_coastal_distance_mod controls. Rename skip_alluvium_slope to apply_alluvium_slope_mod (positive logic, consistent _mod suffix). Remove defaults so configs must explicitly specify both parameters. 200-point comparison improved from 82% to 99.5% within 0.01% of Jaehwi's output (mean diff 8.9 → 0.032 m/s). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Shorten jaehwi_v1p0_reproduction_investigation.md to conclusions, confirmed config, and 200-point comparison results. Condense legacy_python_code_bug_analysis.md to a summary table and brief descriptions of each bug (useful context for developers comparing against Viktor's Python port). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…test config loading - Consolidate mp.get_context("spawn") into parallel.py as single source - Extract _default_correlation_functions() in pipeline.py - Extract _collect_observation_csvs() in pipeline.py - Move shared test config loading to conftest.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AndrewRidden-Harper added 30 commits December 4, 2025 15:45

vectorized bounding box around measurements

c51a329

use rasterio coordinate transforms

4ec2c75

immediately exclude invalid map values

b090427

basic prototype

3b02d6f

bug fixes

00344e6

in progress

c9f9e64

add typer

f345744

working pieces but not all brought together

4788062

improved id raster creation

dd3a5d9

improvements for making initial vs30 rasters

a68f170

improved structure

4704724

small fixes

70bd683

basic cluster_update functionality

5176b5b

improved handling of clustered data

edb2a4a

initial vs30 raster compatible with new column names

97ca0a2

create the hybrid geology, coast dist, slope raster

f110d5e

renamed modules

f63c1c2

del old vs30 code

98e6afd

del more old vs30 code

482b663

reorganised

0f48dc4

renamed to spatial for clarity

21a98c1

cli command for spatial update (mvn)

3de4a50

pipeline v1 (untested)

0d45450

resolved some small differences between old and new

a1cb7fa

updates for consistency with old code

d6009bb

more descriptive filenames

598af30

improved output filename clarity

f490ce0

more descriptive file and variable names

64ab83b

add flag in config for doing the Bayesian update

55d3ec7

only take inputs from config.yaml for simplicity

d810376

AndrewRidden-Harper and others added 30 commits March 23, 2026 12:23

add foster_2019 observation CSV excluding all Kaiser Q3 stations

74955d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

remove dead PHI_GEOLOGY, PHI_TERRAIN, and PHI dict constants

cbbb3d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

differences between the original and modified Foster 2019 models

5b72282

minor formatting

f87b152

minor code formatting improvements

7a25e95

minor tidying

139ce19

del stale code and comments

b4be636

minor stylistic improvements

6f7b566

minor stylistic improvements

c97864c

del unused constants import

ea50885

tidy up imports

9bde006

extract load_model_config to deduplicate points/grid config loading

63831d7

del stale unused code

fbb28de

simplify category lookups and remove unnecessary DataFrame copy

bae6bf7

remove unused RasterData methods, dead code, and redundant assertions…

1991780

… in spatial

replace assertions with ValueError and remove unnecessary variable al…

86a0350

…ias in pipeline

document jaehwi v1p0 investigation findings and foster 2019 reproduct…

6f74b7d

…ion comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rename pipeline functions for clarity

5f0f8c7

compute_grid → grid_pipeline compute_at_locations → points_pipeline run_in_memory_pipeline_for_model_type → compute_model_grid Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vs30 refactor#42

Vs30 refactor#42
AndrewRidden-Harper wants to merge 195 commits intomasterfrom
vs30_refactor

AndrewRidden-Harper commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndrewRidden-Harper commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design changes

Centralized configuration

Modular CLI

Input Vs30 data

Vs30 map generation

Multiprocessing

Test suite

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndrewRidden-Harper commented Jan 27, 2026 •

edited

Loading