Improved benchmark PR check & faster validation#281
Merged
hmgaudecker merged 23 commits intomainfrom Mar 23, 2026
Merged
Conversation
Member
Author
Benchmark comparison (main → HEAD)Comparing
|
Rename benchmark classes to descriptive names (e.g. TimeMahlerYum → MahlerYum, TimeSolve → PrecautionarySavingsSolve) and standardize methods to generic names (time_execution, peakmem_execution, track_compilation_time). Rewrite the PR comment formatter to parse ASV compare output into a grouped table with alert column. Fix hash collision issue (use full hash sliced to 8 chars instead of git rev-parse --short=8), paginate workflow comment listing, and handle missing gh CLI gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse ASV's pipe-separated markdown output (was expecting whitespace- separated). Fall through to raw results when comparison has only n/a ratios (e.g. after benchmark renames). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use grouped 3-column table (Benchmark, Statistic, Value) for raw results with display names, matching the comparison table layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure HEAD result file exists under HEAD's hash (copy from latest when ASV ignores --set-commit-hash). Backfill merge-base result file with current benchmark names so asv compare finds matching entries after renames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Filter out old benchmark names from comparison table so only current benchmarks with display names appear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop small parameter variants (Solve 50, GridLookup 500), replace Simulate 1k with 100k subjects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also sync base entries when parameter sets change (not just when benchmark names are missing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace O(n) Python loops over JAX arrays in validation.py with vectorized jnp.isin/jnp.unique/jnp.searchsorted operations. This reduces validation time from 30s to 0.2s for 100k subjects (138x). Also add check_initial_conditions=False to all benchmark simulate() calls so benchmarks measure simulation performance, not validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With validation vectorized and benchmarks skipping validation, 1M subjects runs in 0.09s (hot) / 5s (compilation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Mortality: bump to 1M subjects - Remove PrecautionarySavingsGridLookup (redundant with SimulateWithSolve) - Add PrecautionarySavingsSimulateWithSolveIrreg for irreg grid coverage - Bump SimulateWithSolve classes to 1M subjects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The check was already performed upstream in validate_initial_conditions which raises eagerly before _collect_structural_errors is called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use setdefault instead of get for base_results so updates are reflected in base_data when "results" key is missing. Also show benchmarks with unmapped class names (using raw name) instead of silently dropping them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1M subjects failed in ASV due to GPU memory pressure when run after Mahler-Yum (2.14G peak). 100k is still 100x the previous 1k and runs in 0.3s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
BTW, we do now see a difference between grid types, as expected by @timmens:
4.5-fold, but probably dwarfed in many models by other things. @mj023, you may want to merge this branch into #280 for the timings already, so we'll have one harness for benchmarking going forward. Probably most useful to add a benchmark of the batched version to the precautionary savings model. |
timmens
approved these changes
Mar 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split the single
asv-run-and-publishtask into two purpose-specific workflowsand overhaul the PR benchmark comment format.
pixi run -e tests-cuda13 asv-run-and-pr-comment— runs benchmarksand posts a grouped comparison table as a PR comment
pixi run -e tests-cuda13 asv-run-and-publish-main— runs benchmarksand publishes to the dashboard
Benchmark naming & table format
TimeMahlerYum→MahlerYum,TimeSolve→PrecautionarySavingsSolve) and standardize methods to generic names(
time_execution,peakmem_execution,track_compilation_time)pr_comment.pyto parse ASV's pipe-separated markdown output into a grouped6-column table: Benchmark, Statistic, before, after, Ratio, Alert
Streamlined benchmark parameters
PrecautionarySavingsGridLookup(redundant with SimulateWithSolve)PrecautionarySavingsSimulateWithSolveIrregfor irregular grid coveragecheck_initial_conditionsin benchmarks (measures simulation, not validation)Vectorize validation (138x speedup)
Replace O(n) Python loops over JAX arrays in
validation.pywith vectorizedjnp.isin/jnp.unique/jnp.searchsortedoperations. Validation for 100ksubjects drops from 30s to 0.2s. This makes large subject counts (1M) practical
for both benchmarks and user code.
Other fixes
head_sha_full[:8]instead ofgit rev-parse --short=8github.paginate)ghCLI with helpful error message--set-commit-hashCI workflow
The
benchmark-checkworkflow checks for the<!-- benchmark-check -->commentwith embedded commit hash (current/stale/missing).
To satisfy the benchmark check
Will be updated/automated once we have a server for this.