Add calibration package checkpointing, target config, and hyperparameter CLI#538
Draft
Add calibration package checkpointing, target config, and hyperparameter CLI#538
Conversation
juaristi22
reviewed
Feb 18, 2026
juaristi22
reviewed
Feb 18, 2026
juaristi22
reviewed
Feb 18, 2026
juaristi22
reviewed
Feb 18, 2026
juaristi22
reviewed
Feb 18, 2026
Collaborator
There was a problem hiding this comment.
Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!
Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?
4c51b32 to
61523d8
Compare
4 tasks
61523d8 to
6744481
Compare
…ter CLI - Add build-only mode to save calibration matrix as pickle package - Add target config YAML for declarative target exclusion rules - Add CLI flags for beta, lambda_l2, learning_rate hyperparameters - Add streaming subprocess output in Modal runner - Add calibration pipeline documentation - Add tests for target config filtering and CLI arg parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal calibration runner was missing --lambda-l0 passthrough. Also fix KeyError: Ellipsis when load_dataset() returns dicts instead of h5py datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upload a pre-built calibration package to Modal and run only the fitting phase, skipping HuggingFace download and matrix build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunked training with per-target CSV log matching notebook format - Wire --log-freq through CLI and Modal runner - Create output directory if missing (fixes Modal container error) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set verbose_freq=chunk so epoch counts don't reset each chunk - Rename: diagnostics -> unified_diagnostics.csv, epoch log -> calibration_log.csv (matches dashboard expectation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of creating a new Microsimulation per clone (~3 min each, 22 hours for 436 clones), precompute values for all 51 states on one sim object (~3 min total), then assemble per-clone values via numpy fancy indexing (~microseconds per clone). New methods: _build_state_values, _assemble_clone_values, _evaluate_constraints_from_values, _calculate_target_values_from_values. DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds. Takeup re-randomization deferred to future post-processing layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Modal runner: add --package-volume flag to read calibration package from a Modal Volume instead of passing 2+ GB as a function argument - unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to prevent CUDA memory fragmentation during L0 backward pass - docs/calibration.md: rewrite to lead with lightweight build-then-fit workflow, document prerequisites, and add volume-based Modal usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
59b27a8 to
0a0f167
Compare
- target_config.yaml: exclude everything except person_count/age (~8,766 targets) to isolate fitting issues from zero-target and zero-row-sum problems in policy variables - target_config_full.yaml: backup of the previous full config - unified_calibration.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to fix CUDA memory fragmentation during backward pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- apply_target_config: support 'include' rules (keep only matching targets) in addition to 'exclude' rules; geo_level now optional - target_config.yaml: 3-line include config replaces 90-line exclusion list for age demographics (person_count with age domain, ~8,784 targets) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The roth_ira_contributions target has zero row sum (no CPS records), making it impossible to calibrate. Remove it from target_config.yaml so Modal runs don't waste epochs on an unachievable target. Also adds `python -m policyengine_us_data.calibration.validate_package` CLI tool for pre-upload package validation, with automatic validation on --build-only runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Achievability analysis showed 9 district-level IRS dollar variables have per-household values 5-27x too high in the extended CPS, making them irreconcilable with count targets (needed_w ~0.04-0.2 vs ~26). Drop salt, AGI, income_tax, dividend/interest vars, QBI deduction, taxable IRA distributions, income_tax_positive, traditional IRA. Add ACA PTC district targets (aca_ptc + tax_unit_count). Save calibration package BEFORE target_config filtering so the full matrix can be reused with different configs without rebuilding. Also: population-based initial weights from age targets per CD, cumulative epoch numbering in chunked logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PUF cloning already happens upstream in extended_cps.py, so the --puf-dataset flag in the calibration pipeline was redundant (and would have doubled the data a second time). Removed the flag, _build_puf_cloned_dataset function, and all related params. Added 4 compatible national targets: child_support_expense, child_support_received, health_insurance_premiums_without_medicare_part_b, and rent (all needed_w 27-37, compatible with count targets at ~26). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add module-level picklable worker functions (_process_single_clone, _init_clone_worker) and standalone helpers for constraint evaluation and target-value calculation usable by worker processes - Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem - Branch clone loop on workers param: parallel (workers>1) uses ProcessPoolExecutor with initializer pattern; sequential unchanged - Add parallel state/county precomputation with per-state fresh sims - Add tests for picklability, pool creation, parallel branching, and clone loop infrastructure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Migrate from changelog_entry.yaml to towncrier fragments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Format bump_version.py with black Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Replace old changelog workflows with towncrier fragment check - Replace pr_changelog.yaml fork-check + reusable changelog check with simple towncrier fragment existence check - Delete reusable_changelog_check.yaml (no longer needed) - Delete check-changelog-entry.sh (checked for old changelog_entry.yaml) - Update versioning.yaml to use towncrier build instead of yaml-changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Runs all ETL scripts (create_database_tables, create_initial_strata, etl_national_targets, etl_age, etl_medicaid, etl_snap, etl_state_income_tax, etl_irs_soi, validate_database) in sequence and validates the resulting SQLite database for: - Expected tables (strata, stratum_constraints, targets) - National targets include key variables (snap, social_security, ssi) - State income tax targets cover 42+ states with CA > $100B - Congressional district strata for 435+ districts - All target variables exist in policyengine-us - Total target count > 1000 This prevents API mismatches and import errors from going undetected when ETL scripts are modified. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Migrate from changelog_entry.yaml to towncrier fragments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Format bump_version.py with black Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Replace old changelog workflows with towncrier fragment check - Replace pr_changelog.yaml fork-check + reusable changelog check with simple towncrier fragment existence check - Delete reusable_changelog_check.yaml (no longer needed) - Delete check-changelog-entry.sh (checked for old changelog_entry.yaml) - Update versioning.yaml to use towncrier build instead of yaml-changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Runs all ETL scripts (create_database_tables, create_initial_strata, etl_national_targets, etl_age, etl_medicaid, etl_snap, etl_state_income_tax, etl_irs_soi, validate_database) in sequence and validates the resulting SQLite database for: - Expected tables (strata, stratum_constraints, targets) - National targets include key variables (snap, social_security, ssi) - State income tax targets cover 42+ states with CA > $100B - Congressional district strata for 435+ districts - All target variables exist in policyengine-us - Total target count > 1000 This prevents API mismatches and import errors from going undetected when ETL scripts are modified. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Add module-level picklable worker functions (_process_single_clone, _init_clone_worker) and standalone helpers for constraint evaluation and target-value calculation usable by worker processes - Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem - Branch clone loop on workers param: parallel (workers>1) uses ProcessPoolExecutor with initializer pattern; sequential unchanged - Add parallel state/county precomputation with per-state fresh sims - Add tests for picklability, pool creation, parallel branching, and clone loop infrastructure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion/ Move all calibration code from datasets/cps/local_area_calibration/ to calibration/, update imports across the codebase, add validate_staging module, and improve unified calibration with target config support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After adding main_promote as a second entrypoint, Modal can no longer infer which function to run without an explicit specifier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build functions (build_state_h5, etc.) print banners to stdout, which gets captured by the subprocess and mixed with the JSON output. This caused json.loads() to fail with "Failed to parse output" for all 8 workers, returning empty completed/failed lists. The pipeline then silently continued past the error check (total_failed == 0) and uploaded stale files. Fix: redirect stdout to stderr during worker processing, restore for JSON output. Also fail the build when errors exist but nothing completed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of trusting worker JSON results alone (which broke when stdout was polluted), now reload the volume after builds and count actual h5 files. The build fails if the volume has fewer files than expected, regardless of what workers reported. This makes the checkpoint system the source of truth for build completeness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At-large districts (AK, DE, ND, SD, VT, WY) have GEOID ending in 00 (e.g., DE=1000) but display as XX-01 via max(cd%100, 1). The worker naively converted DE-01 back to 1001 which didn't exist in the DB. Now tries the direct conversion first, then falls back to finding the sole CD for that state's FIPS prefix (at-large case). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal volume was caching old calibration inputs from previous runs. The code only checked file existence, not freshness, so new model fits on HuggingFace were never pulled. Also clear the version build directory to prevent stale h5 files (built from old weights) from being treated as completed by the volume checkpoint system. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DC (GEOID 1198, district 98) and at-large states (GEOID XX00, district 00) should all display as XX-01. Previously max(d, 1) only handled 00, producing DC-98.h5 instead of DC-01.h5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers now always re-draw takeup using block-level seeded draws, matching the calibration matrix builder's computation. This fixes H5 files producing aca_ptc values 6-40x off from calibration targets. Pipeline changes: - publish_local_area: thread rerandomize_takeup/blocks/filter params - worker_script: always rerandomize, optionally use calibration blocks - local_area: pass blocks path to workers when available - huggingface: optionally download stacked_blocks.npy - unified_calibration: print BLOCKS_PATH for Modal collection - remote_calibration_runner: collect, save, and upload blocks to HF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename w_district_calibration.npy and unified_weights.npy to calibration_weights.npy everywhere (HF paths, local defaults, docs) - Add upload_calibration_artifacts() to huggingface.py for atomic multi-file HF uploads (weights + blocks + logs in one commit) - Add --upload flag (replaces --upload-logs) and --trigger-publish flag to remote_calibration_runner.py - Add _trigger_repository_dispatch() for GitHub workflow auto-trigger - Remove dead _upload_logs_to_hf() and _upload_calibration_artifact() - Add scripts/upload_calibration.py CLI + make upload-calibration target - Update modal_app/README.md with new flags and artifact table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Chains make data, upload-dataset (API direct to HF), calibrate-modal (GPU fit + upload weights), and stage-h5s (build + stage H5s). Configurable via GPU, EPOCHS, BRANCH, NUM_WORKERS variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562
Summary
--build-onlysaves the expensive matrix build as a pickle,--package-pathloads it for fast re-fitting with different hyperparameters or target setstarget_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups--beta,--lambda-l2,--learning-rateare now tunable from the command line and Modal runnerdocs/calibration.mdcovers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)XX-01(conventional 1-based) instead ofXX-00Test plan
pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py— CLI arg parsing testspytest policyengine_us_data/tests/test_calibration/test_target_config.py— target config filtering + package round-trip testsmake calibrate-buildproduces package,--package-pathloads it and fits🤖 Generated with Claude Code