Skip to content

Add calibration package checkpointing, target config, and hyperparameter CLI#538

Draft
baogorek wants to merge 57 commits intomainfrom
calibration-pipeline-improvements
Draft

Add calibration package checkpointing, target config, and hyperparameter CLI#538
baogorek wants to merge 57 commits intomainfrom
calibration-pipeline-improvements

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Feb 17, 2026

Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562

Summary

  • Calibration package checkpointing: --build-only saves the expensive matrix build as a pickle, --package-path loads it for fast re-fitting with different hyperparameters or target sets
  • Target config YAML: Declarative exclusion rules (target_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups
  • Hyperparameter CLI flags: --beta, --lambda-l2, --learning-rate are now tunable from the command line and Modal runner
  • Modal runner improvements: Streaming subprocess output, support for new flags
  • Documentation: docs/calibration.md covers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)
  • At-large district naming fix: H5 filenames for at-large districts now use XX-01 (conventional 1-based) instead of XX-00
  • GCS staging fix: GCS uploads moved from staging phase to promotion phase, so both GCS and HuggingFace are updated together during promote

Note: This branch includes commits from #537 (PUF impute) since the calibration pipeline depends on that work. The calibration-specific changes are in the top commit.

Test plan

  • pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — CLI arg parsing tests
  • pytest policyengine_us_data/tests/test_calibration/test_target_config.py — target config filtering + package round-trip tests
  • Manual: make calibrate-build produces package, --package-path loads it and fits

🤖 Generated with Claude Code

Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!

Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?

@juaristi22 juaristi22 force-pushed the calibration-pipeline-improvements branch from 4c51b32 to 61523d8 Compare February 18, 2026 14:46
@baogorek baogorek force-pushed the calibration-pipeline-improvements branch from 61523d8 to 6744481 Compare February 18, 2026 16:47
baogorek and others added 10 commits February 19, 2026 14:33
…ter CLI

- Add build-only mode to save calibration matrix as pickle package
- Add target config YAML for declarative target exclusion rules
- Add CLI flags for beta, lambda_l2, learning_rate hyperparameters
- Add streaming subprocess output in Modal runner
- Add calibration pipeline documentation
- Add tests for target config filtering and CLI arg parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal calibration runner was missing --lambda-l0 passthrough.
Also fix KeyError: Ellipsis when load_dataset() returns dicts
instead of h5py datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upload a pre-built calibration package to Modal and run only the
fitting phase, skipping HuggingFace download and matrix build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunked training with per-target CSV log matching notebook format
- Wire --log-freq through CLI and Modal runner
- Create output directory if missing (fixes Modal container error)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set verbose_freq=chunk so epoch counts don't reset each chunk
- Rename: diagnostics -> unified_diagnostics.csv,
  epoch log -> calibration_log.csv (matches dashboard expectation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of creating a new Microsimulation per clone (~3 min each,
22 hours for 436 clones), precompute values for all 51 states on
one sim object (~3 min total), then assemble per-clone values via
numpy fancy indexing (~microseconds per clone).

New methods: _build_state_values, _assemble_clone_values,
_evaluate_constraints_from_values, _calculate_target_values_from_values.
DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds.
Takeup re-randomization deferred to future post-processing layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Modal runner: add --package-volume flag to read calibration package
  from a Modal Volume instead of passing 2+ GB as a function argument
- unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to prevent CUDA memory fragmentation during L0 backward pass
- docs/calibration.md: rewrite to lead with lightweight build-then-fit
  workflow, document prerequisites, and add volume-based Modal usage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek baogorek force-pushed the calibration-pipeline-improvements branch from 59b27a8 to 0a0f167 Compare February 19, 2026 23:07
baogorek and others added 11 commits February 19, 2026 18:26
- target_config.yaml: exclude everything except person_count/age
  (~8,766 targets) to isolate fitting issues from zero-target and
  zero-row-sum problems in policy variables
- target_config_full.yaml: backup of the previous full config
- unified_calibration.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to fix CUDA memory fragmentation during backward pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- apply_target_config: support 'include' rules (keep only matching
  targets) in addition to 'exclude' rules; geo_level now optional
- target_config.yaml: 3-line include config replaces 90-line exclusion
  list for age demographics (person_count with age domain, ~8,784 targets)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The roth_ira_contributions target has zero row sum (no CPS records),
making it impossible to calibrate. Remove it from target_config.yaml
so Modal runs don't waste epochs on an unachievable target.

Also adds `python -m policyengine_us_data.calibration.validate_package`
CLI tool for pre-upload package validation, with automatic validation
on --build-only runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Achievability analysis showed 9 district-level IRS dollar variables
have per-household values 5-27x too high in the extended CPS, making
them irreconcilable with count targets (needed_w ~0.04-0.2 vs ~26).
Drop salt, AGI, income_tax, dividend/interest vars, QBI deduction,
taxable IRA distributions, income_tax_positive, traditional IRA.

Add ACA PTC district targets (aca_ptc + tax_unit_count).

Save calibration package BEFORE target_config filtering so the full
matrix can be reused with different configs without rebuilding.

Also: population-based initial weights from age targets per CD,
cumulative epoch numbering in chunked logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PUF cloning already happens upstream in extended_cps.py, so the
--puf-dataset flag in the calibration pipeline was redundant (and
would have doubled the data a second time). Removed the flag,
_build_puf_cloned_dataset function, and all related params.

Added 4 compatible national targets: child_support_expense,
child_support_received, health_insurance_premiums_without_medicare_part_b,
and rent (all needed_w 27-37, compatible with count targets at ~26).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
juaristi22 and others added 3 commits February 26, 2026 14:52
- Add module-level picklable worker functions (_process_single_clone,
  _init_clone_worker) and standalone helpers for constraint evaluation
  and target-value calculation usable by worker processes
- Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem
- Branch clone loop on workers param: parallel (workers>1) uses
  ProcessPoolExecutor with initializer pattern; sequential unchanged
- Add parallel state/county precomputation with per-state fresh sims
- Add tests for picklability, pool creation, parallel branching, and
  clone loop infrastructure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@juaristi22 juaristi22 marked this pull request as ready for review February 26, 2026 17:54
@juaristi22 juaristi22 marked this pull request as draft February 26, 2026 17:54
MaxGhenis and others added 25 commits February 27, 2026 11:04
* Migrate from changelog_entry.yaml to towncrier fragments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Format bump_version.py with black

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace old changelog workflows with towncrier fragment check

- Replace pr_changelog.yaml fork-check + reusable changelog check with
  simple towncrier fragment existence check
- Delete reusable_changelog_check.yaml (no longer needed)
- Delete check-changelog-entry.sh (checked for old changelog_entry.yaml)
- Update versioning.yaml to use towncrier build instead of yaml-changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Runs all ETL scripts (create_database_tables, create_initial_strata,
etl_national_targets, etl_age, etl_medicaid, etl_snap,
etl_state_income_tax, etl_irs_soi, validate_database) in sequence
and validates the resulting SQLite database for:
- Expected tables (strata, stratum_constraints, targets)
- National targets include key variables (snap, social_security, ssi)
- State income tax targets cover 42+ states with CA > $100B
- Congressional district strata for 435+ districts
- All target variables exist in policyengine-us
- Total target count > 1000

This prevents API mismatches and import errors from going undetected
when ETL scripts are modified.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Migrate from changelog_entry.yaml to towncrier fragments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Format bump_version.py with black

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace old changelog workflows with towncrier fragment check

- Replace pr_changelog.yaml fork-check + reusable changelog check with
  simple towncrier fragment existence check
- Delete reusable_changelog_check.yaml (no longer needed)
- Delete check-changelog-entry.sh (checked for old changelog_entry.yaml)
- Update versioning.yaml to use towncrier build instead of yaml-changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Runs all ETL scripts (create_database_tables, create_initial_strata,
etl_national_targets, etl_age, etl_medicaid, etl_snap,
etl_state_income_tax, etl_irs_soi, validate_database) in sequence
and validates the resulting SQLite database for:
- Expected tables (strata, stratum_constraints, targets)
- National targets include key variables (snap, social_security, ssi)
- State income tax targets cover 42+ states with CA > $100B
- Congressional district strata for 435+ districts
- All target variables exist in policyengine-us
- Total target count > 1000

This prevents API mismatches and import errors from going undetected
when ETL scripts are modified.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Add module-level picklable worker functions (_process_single_clone,
  _init_clone_worker) and standalone helpers for constraint evaluation
  and target-value calculation usable by worker processes
- Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem
- Branch clone loop on workers param: parallel (workers>1) uses
  ProcessPoolExecutor with initializer pattern; sequential unchanged
- Add parallel state/county precomputation with per-state fresh sims
- Add tests for picklability, pool creation, parallel branching, and
  clone loop infrastructure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion/

Move all calibration code from datasets/cps/local_area_calibration/ to
calibration/, update imports across the codebase, add validate_staging
module, and improve unified calibration with target config support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After adding main_promote as a second entrypoint, Modal can no longer
infer which function to run without an explicit specifier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build functions (build_state_h5, etc.) print banners to stdout, which
gets captured by the subprocess and mixed with the JSON output. This
caused json.loads() to fail with "Failed to parse output" for all 8
workers, returning empty completed/failed lists. The pipeline then
silently continued past the error check (total_failed == 0) and
uploaded stale files.

Fix: redirect stdout to stderr during worker processing, restore for
JSON output. Also fail the build when errors exist but nothing completed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of trusting worker JSON results alone (which broke when stdout
was polluted), now reload the volume after builds and count actual h5
files. The build fails if the volume has fewer files than expected,
regardless of what workers reported. This makes the checkpoint system
the source of truth for build completeness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At-large districts (AK, DE, ND, SD, VT, WY) have GEOID ending in 00
(e.g., DE=1000) but display as XX-01 via max(cd%100, 1). The worker
naively converted DE-01 back to 1001 which didn't exist in the DB.

Now tries the direct conversion first, then falls back to finding the
sole CD for that state's FIPS prefix (at-large case).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal volume was caching old calibration inputs from previous runs.
The code only checked file existence, not freshness, so new model fits
on HuggingFace were never pulled. Also clear the version build directory
to prevent stale h5 files (built from old weights) from being treated
as completed by the volume checkpoint system.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DC (GEOID 1198, district 98) and at-large states (GEOID XX00, district
00) should all display as XX-01. Previously max(d, 1) only handled 00,
producing DC-98.h5 instead of DC-01.h5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers now always re-draw takeup using block-level seeded draws,
matching the calibration matrix builder's computation. This fixes
H5 files producing aca_ptc values 6-40x off from calibration targets.

Pipeline changes:
- publish_local_area: thread rerandomize_takeup/blocks/filter params
- worker_script: always rerandomize, optionally use calibration blocks
- local_area: pass blocks path to workers when available
- huggingface: optionally download stacked_blocks.npy
- unified_calibration: print BLOCKS_PATH for Modal collection
- remote_calibration_runner: collect, save, and upload blocks to HF

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename w_district_calibration.npy and unified_weights.npy to
  calibration_weights.npy everywhere (HF paths, local defaults, docs)
- Add upload_calibration_artifacts() to huggingface.py for atomic
  multi-file HF uploads (weights + blocks + logs in one commit)
- Add --upload flag (replaces --upload-logs) and --trigger-publish flag
  to remote_calibration_runner.py
- Add _trigger_repository_dispatch() for GitHub workflow auto-trigger
- Remove dead _upload_logs_to_hf() and _upload_calibration_artifact()
- Add scripts/upload_calibration.py CLI + make upload-calibration target
- Update modal_app/README.md with new flags and artifact table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Chains make data, upload-dataset (API direct to HF), calibrate-modal
(GPU fit + upload weights), and stage-h5s (build + stage H5s).
Configurable via GPU, EPOCHS, BRANCH, NUM_WORKERS variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment