Skip to content

Improved benchmark PR check & faster validation#281

Merged
hmgaudecker merged 23 commits intomainfrom
improved-benchmark-pr-check
Mar 23, 2026
Merged

Improved benchmark PR check & faster validation#281
hmgaudecker merged 23 commits intomainfrom
improved-benchmark-pr-check

Conversation

@hmgaudecker
Copy link
Copy Markdown
Member

@hmgaudecker hmgaudecker commented Mar 21, 2026

Summary

Split the single asv-run-and-publish task into two purpose-specific workflows
and overhaul the PR benchmark comment format.

  • PR branches: pixi run -e tests-cuda13 asv-run-and-pr-comment — runs benchmarks
    and posts a grouped comparison table as a PR comment
  • Main branch: pixi run -e tests-cuda13 asv-run-and-publish-main — runs benchmarks
    and publishes to the dashboard

Benchmark naming & table format

  • Rename benchmark classes to descriptive names (e.g. TimeMahlerYumMahlerYum,
    TimeSolvePrecautionarySavingsSolve) and standardize methods to generic names
    (time_execution, peakmem_execution, track_compilation_time)
  • Rewrite pr_comment.py to parse ASV's pipe-separated markdown output into a grouped
    6-column table: Benchmark, Statistic, before, after, Ratio, Alert
  • Alert column shows a cross mark when ratio > 1.10
  • Commit hashes rendered as clickable links

Streamlined benchmark parameters

  • Remove PrecautionarySavingsGridLookup (redundant with SimulateWithSolve)
  • Add PrecautionarySavingsSimulateWithSolveIrreg for irregular grid coverage
  • Bump subject counts: Simulate to 1M, SimulateWithSolve to 500k, Mortality to 100k
  • Skip check_initial_conditions in benchmarks (measures simulation, not validation)

Vectorize validation (138x speedup)

Replace O(n) Python loops over JAX arrays in validation.py with vectorized
jnp.isin / jnp.unique / jnp.searchsorted operations. Validation for 100k
subjects drops from 30s to 0.2s. This makes large subject counts (1M) practical
for both benchmarks and user code.

Other fixes

  • Fix hash collision: use head_sha_full[:8] instead of git rev-parse --short=8
  • Paginate workflow comment listing (github.paginate)
  • Handle missing gh CLI with helpful error message
  • Backfill merge-base result file when benchmark names or params change
  • Ensure HEAD result file exists under HEAD's hash when ASV ignores --set-commit-hash

CI workflow

The benchmark-check workflow checks for the <!-- benchmark-check --> comment
with embedded commit hash (current/stale/missing).

To satisfy the benchmark check

pixi run -e tests-cuda13 asv-run-and-pr-comment

Will be updated/automated once we have a server for this.

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community bot commented Mar 21, 2026

Documentation build overview

📚 pylcm | 🛠️ Build #31919517 | 📁 Comparing c3dd9f4 against latest (ca71893)


🔍 Preview build

Show files changed (32 files in total): 📝 32 modified | ➕ 0 added | ➖ 0 deleted
File Status
index.html 📝 modified
approximating-continuous-shocks/index.html 📝 modified
benchmarking/index.html 📝 modified
benchmarking-1/index.html 📝 modified
beta-delta/index.html 📝 modified
conventions/index.html 📝 modified
debugging/index.html 📝 modified
defining-models/index.html 📝 modified
dispatchers/index.html 📝 modified
function-representation/index.html 📝 modified
grids/index.html 📝 modified
index-1/index.html 📝 modified
index-2/index.html 📝 modified
index-3/index.html 📝 modified
index-4/index.html 📝 modified
installation/index.html 📝 modified
interpolation/index.html 📝 modified
mahler-yum-2024/index.html 📝 modified
mortality/index.html 📝 modified
pandas-interop/index.html 📝 modified
parameters/index.html 📝 modified
precautionary-savings/index.html 📝 modified
precautionary-savings-health/index.html 📝 modified
regimes/index.html 📝 modified
setup/index.html 📝 modified
shocks/index.html 📝 modified
solving-and-simulating/index.html 📝 modified
stochastic-transitions/index.html 📝 modified
tiny/index.html 📝 modified
tiny-example/index.html 📝 modified
transitions/index.html 📝 modified
write-economics/index.html 📝 modified

@hmgaudecker
Copy link
Copy Markdown
Member Author

hmgaudecker commented Mar 21, 2026

Benchmark comparison (main → HEAD)

Comparing ca718939 (main) → c3dd9f47 (HEAD)

Benchmark Statistic before after Ratio Alert
Mahler-Yum execution time 3.93±0.05s 3.71±0.01s 0.94
compilation time 1.33m 1.32m 1.00
peak mem usage 2.14G 2.14G 1.00
Mortality execution time 145±4ms 144±1ms 0.99
compilation time 7.47s 7.45s 1.00
peak mem usage 1.2G 1.21G 1.00
Precautionary Savings - Solve (200) execution time 19.1±0.6ms 20.1±3ms 1.05
compilation time 2.74s 2.73s 0.99
peak mem usage 1.03G 1.03G 1.00
Precautionary Savings - Solve (500) execution time 19.4±1ms 21.3±1ms 1.09
compilation time 3.33s 3.30s 0.99
peak mem usage 1.04G 1.03G 1.00
Precautionary Savings - Simulate (1000000) execution time 77.3±0.8ms 80.5±3ms 1.04
compilation time 4.55s 4.56s 1.00
peak mem usage 1.16G 1.16G 1.00
Precautionary Savings - Solve & Simulate execution time 92.3±2ms 93.8±2ms 1.02
compilation time 7.37s 7.40s 1.00
peak mem usage 1.17G 1.17G 1.00
Precautionary Savings - Solve & Simulate (irreg) execution time 422±1ms 422±1ms 1.00
compilation time 8.23s 8.22s 1.00
peak mem usage 1.22G 1.23G 1.00

hmgaudecker and others added 18 commits March 21, 2026 14:26
Rename benchmark classes to descriptive names (e.g. TimeMahlerYum → MahlerYum,
TimeSolve → PrecautionarySavingsSolve) and standardize methods to generic names
(time_execution, peakmem_execution, track_compilation_time). Rewrite the PR
comment formatter to parse ASV compare output into a grouped table with alert
column. Fix hash collision issue (use full hash sliced to 8 chars instead of
git rev-parse --short=8), paginate workflow comment listing, and handle missing
gh CLI gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hod.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse ASV's pipe-separated markdown output (was expecting whitespace-
separated). Fall through to raw results when comparison has only n/a
ratios (e.g. after benchmark renames).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use grouped 3-column table (Benchmark, Statistic, Value) for raw results
with display names, matching the comparison table layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure HEAD result file exists under HEAD's hash (copy from latest when
ASV ignores --set-commit-hash). Backfill merge-base result file with
current benchmark names so asv compare finds matching entries after
renames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Filter out old benchmark names from comparison table so only current
benchmarks with display names appear.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop small parameter variants (Solve 50, GridLookup 500), replace
Simulate 1k with 100k subjects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also sync base entries when parameter sets change (not just when
benchmark names are missing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace O(n) Python loops over JAX arrays in validation.py with
vectorized jnp.isin/jnp.unique/jnp.searchsorted operations. This
reduces validation time from 30s to 0.2s for 100k subjects (138x).

Also add check_initial_conditions=False to all benchmark simulate()
calls so benchmarks measure simulation performance, not validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With validation vectorized and benchmarks skipping validation,
1M subjects runs in 0.09s (hot) / 5s (compilation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Mortality: bump to 1M subjects
- Remove PrecautionarySavingsGridLookup (redundant with SimulateWithSolve)
- Add PrecautionarySavingsSimulateWithSolveIrreg for irreg grid coverage
- Bump SimulateWithSolve classes to 1M subjects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The check was already performed upstream in validate_initial_conditions
which raises eagerly before _collect_structural_errors is called.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use setdefault instead of get for base_results so updates are reflected
in base_data when "results" key is missing. Also show benchmarks with
unmapped class names (using raw name) instead of silently dropping them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1M subjects failed in ASV due to GPU memory pressure when run after
Mahler-Yum (2.14G peak). 100k is still 100x the previous 1k and runs
in 0.3s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hmgaudecker hmgaudecker changed the title Improved benchmark PR check Improved benchmark PR check & faster validation Mar 22, 2026
@hmgaudecker hmgaudecker requested a review from timmens March 22, 2026 14:34
hmgaudecker and others added 2 commits March 22, 2026 17:08
@hmgaudecker
Copy link
Copy Markdown
Member Author

hmgaudecker commented Mar 23, 2026

BTW, we do now see a difference between grid types, as expected by @timmens:

Benchmark Statistic before after Ratio Alert
Precautionary Savings - Solve & Simulate execution time 92.3±2ms 93.8±2ms 1.02
compilation time 7.37s 7.40s 1.00
peak mem usage 1.17G 1.17G 1.00
Precautionary Savings - Solve & Simulate (irreg) execution time 422±1ms 422±1ms 1.00
compilation time 8.23s 8.22s 1.00
peak mem usage 1.22G 1.23G 1.00

4.5-fold, but probably dwarfed in many models by other things.

@mj023, you may want to merge this branch into #280 for the timings already, so we'll have one harness for benchmarking going forward. Probably most useful to add a benchmark of the batched version to the precautionary savings model.

@hmgaudecker hmgaudecker merged commit 811e2ed into main Mar 23, 2026
10 checks passed
@hmgaudecker hmgaudecker deleted the improved-benchmark-pr-check branch March 23, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants