Skip to content

Formalize test inference with enriched metrics and auto-analysis#106

Draft
forklady42 wants to merge 7 commits intomainfrom
test/formalize-inference
Draft

Formalize test inference with enriched metrics and auto-analysis#106
forklady42 wants to merge 7 commits intomainfrom
test/formalize-inference

Conversation

@forklady42
Copy link
Copy Markdown
Collaborator

Summary

  • Enrich metrics.csv from 3 columns (rank,index,nmae) to 10: adds loss, max_pred, max_target, mean_pred, mean_target, num_electrons, duration_ms — all computed per-sample over spatial dims
  • Flexible checkpoint resolution in test.py: checks ckpt_file > last.ckpt > best.ckpt > glob fallback, replacing the hardcoded last.ckpt
  • New summarize.py module: computes NMAE stats (mean/median/P95/P99/max), threshold counts, generates histogram + CDF plots, and optionally logs to W&B (image, table, histogram, scalar stats)
  • Auto-chain analysis after trainer.test(): summary + distribution plots always run; saturation and tail analysis run when applicable

Test plan

  • All 25 tests on main pass (uv run pytest)
  • Pre-commit (ruff lint + format) passes on all changed files
  • Run test inference on a checkpoint and verify metrics.csv has all 10 columns
  • Verify summary.txt and nmae_distribution.png are generated in log_dir
  • Verify analyze_saturation works on the enriched CSV (no more missing column errors)
  • Verify W&B logging works when wandb_mode: online

🤖 Generated with Claude Code

forklady42 and others added 7 commits March 27, 2026 14:00
Enrich metrics.csv from 3 columns (rank, index, nmae) to 10 columns
adding loss, max_pred, max_target, mean_pred, mean_target,
num_electrons, and duration_ms. Add flexible checkpoint resolution
(ckpt_file > last.ckpt > best.ckpt > glob fallback) and automatic
post-test summary statistics with distribution plots. This unblocks
analyze_saturation.py which already expected max_pred/max_target
columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compute max_pred, max_target, mean_pred, mean_target, and
num_electrons per-sample by reducing over spatial dimensions only
(keeping the batch dimension). Previously these were batch-level
scalars that happened to be correct only with batch_size=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The glob fallback picks the latest epoch by lexicographic sort,
not the lowest val_loss. Fix the docstring and comment to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No need for a separate normmae_fn when the only loss function is
NormMAE — both compute the same thing. Uses loss_fn for both the
nmae and loss columns in metrics.csv (they'll be identical for now).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When wandb_mode != "disabled", log to W&B after test inference:
- Distribution PNG as wandb.Image
- Per-sample metrics as wandb.Table for interactive filtering
- Native histogram for the overview panel
- Scalar summary stats (mean, median, P95, P99, max)

W&B is wired into the Trainer so Lightning's built-in test_loss
metric also appears in the dashboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
analyze_metrics does not create its output directory. The test
entrypoint now mkdir's saturation_dir before calling it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant