Improved benchmark PR check & faster validation by hmgaudecker · Pull Request #281 · OpenSourceEconomics/pylcm

hmgaudecker · 2026-03-21T06:44:01Z

Summary

Split the single asv-run-and-publish task into two purpose-specific workflows
and overhaul the PR benchmark comment format.

PR branches: pixi run -e tests-cuda13 asv-run-and-pr-comment — runs benchmarks
and posts a grouped comparison table as a PR comment
Main branch: pixi run -e tests-cuda13 asv-run-and-publish-main — runs benchmarks
and publishes to the dashboard

Benchmark naming & table format

Rename benchmark classes to descriptive names (e.g. TimeMahlerYum → MahlerYum,
TimeSolve → PrecautionarySavingsSolve) and standardize methods to generic names
(time_execution, peakmem_execution, track_compilation_time)
Rewrite pr_comment.py to parse ASV's pipe-separated markdown output into a grouped
6-column table: Benchmark, Statistic, before, after, Ratio, Alert
Alert column shows a cross mark when ratio > 1.10
Commit hashes rendered as clickable links

Streamlined benchmark parameters

Remove PrecautionarySavingsGridLookup (redundant with SimulateWithSolve)
Add PrecautionarySavingsSimulateWithSolveIrreg for irregular grid coverage
Bump subject counts: Simulate to 1M, SimulateWithSolve to 500k, Mortality to 100k
Skip check_initial_conditions in benchmarks (measures simulation, not validation)

Vectorize validation (138x speedup)

Replace O(n) Python loops over JAX arrays in validation.py with vectorized
jnp.isin / jnp.unique / jnp.searchsorted operations. Validation for 100k
subjects drops from 30s to 0.2s. This makes large subject counts (1M) practical
for both benchmarks and user code.

Other fixes

Fix hash collision: use head_sha_full[:8] instead of git rev-parse --short=8
Paginate workflow comment listing (github.paginate)
Handle missing gh CLI with helpful error message
Backfill merge-base result file when benchmark names or params change
Ensure HEAD result file exists under HEAD's hash when ASV ignores --set-commit-hash

CI workflow

The benchmark-check workflow checks for the  comment
with embedded commit hash (current/stale/missing).

To satisfy the benchmark check

pixi run -e tests-cuda13 asv-run-and-pr-comment

Will be updated/automated once we have a server for this.

read-the-docs-community · 2026-03-21T06:46:04Z

Documentation build overview

📚 pylcm | 🛠️ Build #31919517 | 📁 Comparing c3dd9f4 against latest (ca71893)

🔍 Preview build

Show files changed (32 files in total): 📝 32 modified | ➕ 0 added | ➖ 0 deleted

File	Status
index.html	📝 modified
approximating-continuous-shocks/index.html	📝 modified
benchmarking/index.html	📝 modified
benchmarking-1/index.html	📝 modified
beta-delta/index.html	📝 modified
conventions/index.html	📝 modified
debugging/index.html	📝 modified
defining-models/index.html	📝 modified
dispatchers/index.html	📝 modified
function-representation/index.html	📝 modified
grids/index.html	📝 modified
index-1/index.html	📝 modified
index-2/index.html	📝 modified
index-3/index.html	📝 modified
index-4/index.html	📝 modified
installation/index.html	📝 modified
interpolation/index.html	📝 modified
mahler-yum-2024/index.html	📝 modified
mortality/index.html	📝 modified
pandas-interop/index.html	📝 modified
parameters/index.html	📝 modified
precautionary-savings/index.html	📝 modified
precautionary-savings-health/index.html	📝 modified
regimes/index.html	📝 modified
setup/index.html	📝 modified
shocks/index.html	📝 modified
solving-and-simulating/index.html	📝 modified
stochastic-transitions/index.html	📝 modified
tiny/index.html	📝 modified
tiny-example/index.html	📝 modified
transitions/index.html	📝 modified
write-economics/index.html	📝 modified

hmgaudecker · 2026-03-21T09:08:29Z

Benchmark comparison (main → HEAD)

Comparing ca718939 (main) → c3dd9f47 (HEAD)

Benchmark	Statistic	before	after	Ratio
Mahler-Yum	execution time	3.93±0.05s	3.71±0.01s	0.94
	compilation time	1.33m	1.32m	1.00
	peak mem usage	2.14G	2.14G	1.00
Mortality	execution time	145±4ms	144±1ms	0.99
	compilation time	7.47s	7.45s	1.00
	peak mem usage	1.2G	1.21G	1.00
Precautionary Savings - Solve (200)	execution time	19.1±0.6ms	20.1±3ms	1.05
	compilation time	2.74s	2.73s	0.99
	peak mem usage	1.03G	1.03G	1.00
Precautionary Savings - Solve (500)	execution time	19.4±1ms	21.3±1ms	1.09
	compilation time	3.33s	3.30s	0.99
	peak mem usage	1.04G	1.03G	1.00
Precautionary Savings - Simulate (1000000)	execution time	77.3±0.8ms	80.5±3ms	1.04
	compilation time	4.55s	4.56s	1.00
	peak mem usage	1.16G	1.16G	1.00
Precautionary Savings - Solve & Simulate	execution time	92.3±2ms	93.8±2ms	1.02
	compilation time	7.37s	7.40s	1.00
	peak mem usage	1.17G	1.17G	1.00
Precautionary Savings - Solve & Simulate (irreg)	execution time	422±1ms	422±1ms	1.00
	compilation time	8.23s	8.22s	1.00
	peak mem usage	1.22G	1.23G	1.00

Rename benchmark classes to descriptive names (e.g. TimeMahlerYum → MahlerYum, TimeSolve → PrecautionarySavingsSolve) and standardize methods to generic names (time_execution, peakmem_execution, track_compilation_time). Rewrite the PR comment formatter to parse ASV compare output into a grouped table with alert column. Fix hash collision issue (use full hash sliced to 8 chars instead of git rev-parse --short=8), paginate workflow comment listing, and handle missing gh CLI gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parse ASV's pipe-separated markdown output (was expecting whitespace- separated). Fall through to raw results when comparison has only n/a ratios (e.g. after benchmark renames). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use grouped 3-column table (Benchmark, Statistic, Value) for raw results with display names, matching the comparison table layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ensure HEAD result file exists under HEAD's hash (copy from latest when ASV ignores --set-commit-hash). Backfill merge-base result file with current benchmark names so asv compare finds matching entries after renames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Filter out old benchmark names from comparison table so only current benchmarks with display names appear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop small parameter variants (Solve 50, GridLookup 500), replace Simulate 1k with 100k subjects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Also sync base entries when parameter sets change (not just when benchmark names are missing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace O(n) Python loops over JAX arrays in validation.py with vectorized jnp.isin/jnp.unique/jnp.searchsorted operations. This reduces validation time from 30s to 0.2s for 100k subjects (138x). Also add check_initial_conditions=False to all benchmark simulate() calls so benchmarks measure simulation performance, not validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

With validation vectorized and benchmarks skipping validation, 1M subjects runs in 0.09s (hot) / 5s (compilation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Mortality: bump to 1M subjects - Remove PrecautionarySavingsGridLookup (redundant with SimulateWithSolve) - Add PrecautionarySavingsSimulateWithSolveIrreg for irreg grid coverage - Bump SimulateWithSolve classes to 1M subjects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The check was already performed upstream in validate_initial_conditions which raises eagerly before _collect_structural_errors is called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use setdefault instead of get for base_results so updates are reflected in base_data when "results" key is missing. Also show benchmarks with unmapped class names (using raw name) instead of silently dropping them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1M subjects failed in ASV due to GPU memory pressure when run after Mahler-Yum (2.14G peak). 100k is still 100x the previous 1k and runs in 0.3s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hmgaudecker · 2026-03-23T07:23:53Z

BTW, we do now see a difference between grid types, as expected by @timmens:

Benchmark	Statistic	before	after	Ratio
Precautionary Savings - Solve & Simulate	execution time	92.3±2ms	93.8±2ms	1.02
	compilation time	7.37s	7.40s	1.00
	peak mem usage	1.17G	1.17G	1.00
Precautionary Savings - Solve & Simulate (irreg)	execution time	422±1ms	422±1ms	1.00
	compilation time	8.23s	8.22s	1.00
	peak mem usage	1.22G	1.23G	1.00

4.5-fold, but probably dwarfed in many models by other things.

@mj023, you may want to merge this branch into #280 for the timings already, so we'll have one harness for benchmarking going forward. Probably most useful to add a benchmark of the batched version to the precautionary savings model.

hmgaudecker added 2 commits March 21, 2026 07:20

Improve asv usage example based on automated review.

f0ccc78

Make the benchmark PR check useful without ASV.

f690972

Fix workflow.

8896a4b

hmgaudecker and others added 18 commits March 21, 2026 14:26

Add GEMINI.md referring to AGENTS.md

ddddf22

Fix __init__.py docstring to match renamed track_compilation_time met…

0d5326d

…hod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix comment publishing.

5bc4eb2

Fix table formatting.

06c5ccb

Fix table formatting.

07b189d

Parse ASV's pipe-separated markdown output (was expecting whitespace- separated). Fall through to raw results when comparison has only n/a ratios (e.g. after benchmark renames). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix table formatting.

9432857

Use grouped 3-column table (Benchmark, Statistic, Value) for raw results with display names, matching the comparison table layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix table formatting.

8ab4021

Filter out old benchmark names from comparison table so only current benchmarks with display names appear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean up benchmark names and build grouped PR comment table.

7d85b82

Drop small parameter variants (Solve 50, GridLookup 500), replace Simulate 1k with 100k subjects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix comment publishing.

087459c

Also sync base entries when parameter sets change (not just when benchmark names are missing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bump Simulate benchmark to 1M subjects.

db54fe7

With validation vectorized and benchmarks skipping validation, 1M subjects runs in 0.09s (hot) / 5s (compilation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove dead regime-ID validation in _collect_structural_errors.

f9d2f8a

The check was already performed upstream in validate_initial_conditions which raises eagerly before _collect_structural_errors is called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce mortality benchmark to 100k subjects.

8146519

1M subjects failed in ASV due to GPU memory pressure when run after Mahler-Yum (2.14G peak). 100k is still 100x the previous 1k and runs in 0.3s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use 500k subjects for SimulateWithSolve benchmarks.

b4b8c75

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hmgaudecker changed the title ~~Improved benchmark PR check~~ Improved benchmark PR check & faster validation Mar 22, 2026

hmgaudecker requested a review from timmens March 22, 2026 14:34

hmgaudecker and others added 2 commits March 22, 2026 17:08

Merge branch 'main' into improved-benchmark-pr-check

fe6faa6

Drop 10k param from Simulate benchmark, keep only 1M.

c3dd9f4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timmens approved these changes Mar 23, 2026

View reviewed changes

hmgaudecker merged commit 811e2ed into main Mar 23, 2026
10 checks passed

hmgaudecker deleted the improved-benchmark-pr-check branch March 23, 2026 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved benchmark PR check & faster validation#281

Improved benchmark PR check & faster validation#281
hmgaudecker merged 23 commits intomainfrom
improved-benchmark-pr-check

hmgaudecker commented Mar 21, 2026 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

hmgaudecker commented Mar 21, 2026 •

edited

Loading

Uh oh!

hmgaudecker commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hmgaudecker commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark naming & table format

Streamlined benchmark parameters

Vectorize validation (138x speedup)

Other fixes

CI workflow

To satisfy the benchmark check

Uh oh!

read-the-docs-community bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

hmgaudecker commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark comparison (main → HEAD)

Uh oh!

hmgaudecker commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmgaudecker commented Mar 21, 2026 •

edited

Loading

read-the-docs-community bot commented Mar 21, 2026 •

edited

Loading

hmgaudecker commented Mar 21, 2026 •

edited

Loading

hmgaudecker commented Mar 23, 2026 •

edited

Loading