Add calibration package checkpointing, target config, and hyperparameter CLI#538

Draft

baogorek wants to merge 57 commits intomainfrom

calibration-pipeline-improvements

Collaborator

baogorek commented Feb 17, 2026 •

edited by juaristi22

Loading

Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562

Summary

Calibration package checkpointing: --build-only saves the expensive matrix build as a pickle, --package-path loads it for fast re-fitting with different hyperparameters or target sets
Target config YAML: Declarative exclusion rules (target_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups
Hyperparameter CLI flags: --beta, --lambda-l2, --learning-rate are now tunable from the command line and Modal runner
Modal runner improvements: Streaming subprocess output, support for new flags
Documentation: docs/calibration.md covers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)
At-large district naming fix: H5 filenames for at-large districts now use XX-01 (conventional 1-based) instead of XX-00
GCS staging fix: GCS uploads moved from staging phase to promotion phase, so both GCS and HuggingFace are updated together during promote

Note: This branch includes commits from #537 (PUF impute) since the calibration pipeline depends on that work. The calibration-specific changes are in the top commit.

Test plan

pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — CLI arg parsing tests
pytest policyengine_us_data/tests/test_calibration/test_target_config.py — target config filtering + package round-trip tests
Manual: make calibrate-build produces package, --package-path loads it and fits

🤖 Generated with Claude Code

juaristi22 reviewed

View reviewed changes

policyengine_us_data/calibration/unified_calibration.py Outdated Show resolved Hide resolved

policyengine_us_data/calibration/unified_calibration.py Outdated Show resolved Hide resolved

juaristi22 reviewed

View reviewed changes

policyengine_us_data/calibration/source_impute.py Outdated Show resolved Hide resolved

juaristi22 reviewed

View reviewed changes

docs/calibration.md Outdated Show resolved Hide resolved

juaristi22 reviewed

View reviewed changes

policyengine_us_data/calibration/source_impute.py Show resolved Hide resolved

juaristi22 reviewed

View reviewed changes

Collaborator

juaristi22 left a comment •

edited

Loading

Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!

Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?

juaristi22 force-pushed the calibration-pipeline-improvements branch from 4c51b32 to 61523d8 Compare

February 18, 2026 14:46

juaristi22 mentioned this pull request

Category takeup rerandomization #540

Open

4 tasks

baogorek force-pushed the calibration-pipeline-improvements branch from 61523d8 to 6744481 Compare

February 18, 2026 16:47

baogorek and others added 10 commits

February 19, 2026 14:33


          Add calibration package checkpointing, target config, and hyperparame…

7a29d76

…ter CLI

- Add build-only mode to save calibration matrix as pickle package
- Add target config YAML for declarative target exclusion rules
- Add CLI flags for beta, lambda_l2, learning_rate hyperparameters
- Add streaming subprocess output in Modal runner
- Add calibration pipeline documentation
- Add tests for target config filtering and CLI arg parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Ignore all calibration run outputs in storage/calibration/

f42e6aa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add --lambda-l0 to Modal runner, fix load_dataset dict handling

29e53f9

The Modal calibration runner was missing --lambda-l0 passthrough.
Also fix KeyError: Ellipsis when load_dataset() returns dicts
instead of h5py datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add --package-path support to Modal runner

a898ebc

Upload a pre-built calibration package to Modal and run only the
fitting phase, skipping HuggingFace download and matrix build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add --log-freq for per-epoch calibration logging, fix output dir

0a9340b

- Chunked training with per-target CSV log matching notebook format
- Wire --log-freq through CLI and Modal runner
- Create output directory if missing (fixes Modal container error)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Create log directory before writing calibration log

fa7ebed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add debug logging for CLI args and command in package path

13ec69c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix chunked epoch display and rename Modal output files

b628997

- Set verbose_freq=chunk so epoch counts don't reset each chunk
- Rename: diagnostics -> unified_diagnostics.csv,
  epoch log -> calibration_log.csv (matches dashboard expectation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Replace per-clone Microsimulation with per-state precomputation

06c465b

Instead of creating a new Microsimulation per clone (~3 min each,
22 hours for 436 clones), precompute values for all 51 states on
one sim object (~3 min total), then assemble per-clone values via
numpy fancy indexing (~microseconds per clone).

New methods: _build_state_values, _assemble_clone_values,
_evaluate_constraints_from_values, _calculate_target_values_from_values.
DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds.
Takeup re-randomization deferred to future post-processing layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add Modal Volume support and fix CUDA OOM fragmentation

0a0f167

- Modal runner: add --package-volume flag to read calibration package
  from a Modal Volume instead of passing 2+ GB as a function argument
- unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to prevent CUDA memory fragmentation during L0 backward pass
- docs/calibration.md: rewrite to lead with lightweight build-then-fit
  workflow, document prerequisites, and add volume-based Modal usage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek force-pushed the calibration-pipeline-improvements branch from 59b27a8 to 0a0f167 Compare

February 19, 2026 23:07

baogorek and others added 11 commits

February 19, 2026 18:26


          Restrict targets to age demographics only for debugging

13f3f30

- target_config.yaml: exclude everything except person_count/age
  (~8,766 targets) to isolate fitting issues from zero-target and
  zero-row-sum problems in policy variables
- target_config_full.yaml: backup of the previous full config
- unified_calibration.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to fix CUDA memory fragmentation during backward pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add include mode to target config, switch to age-only

0b4acf7

- apply_target_config: support 'include' rules (keep only matching
  targets) in addition to 'exclude' rules; geo_level now optional
- target_config.yaml: 3-line include config replaces 90-line exclusion
  list for age demographics (person_count with age domain, ~8,784 targets)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Switch target config to finest-grain include (~18K targets)

32c851b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix at-large district geoid mismatch (7 districts had 0 estimates)

5a04c9f


          Add CLI package validator, drop impossible roth_ira_contributions target

09ae440

The roth_ira_contributions target has zero row sum (no CPS records),
making it impossible to calibrate. Remove it from target_config.yaml
so Modal runs don't waste epochs on an unachievable target.

Also adds `python -m policyengine_us_data.calibration.validate_package`
CLI tool for pre-upload package validation, with automatic validation
on --build-only runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add population-based initial weights for L0 calibration

5cb6d86


          Drop inflated dollar targets, add ACA PTC, save full package

ba97a90

Achievability analysis showed 9 district-level IRS dollar variables
have per-household values 5-27x too high in the extended CPS, making
them irreconcilable with count targets (needed_w ~0.04-0.2 vs ~26).
Drop salt, AGI, income_tax, dividend/interest vars, QBI deduction,
taxable IRA distributions, income_tax_positive, traditional IRA.

Add ACA PTC district targets (aca_ptc + tax_unit_count).

Save calibration package BEFORE target_config filtering so the full
matrix can be reused with different configs without rebuilding.

Also: population-based initial weights from age targets per CD,
cumulative epoch numbering in chunked logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Remove redundant --puf-dataset flag, add national targets

49a1f66

PUF cloning already happens upstream in extended_cps.py, so the
--puf-dataset flag in the calibration pipeline was redundant (and
would have doubled the data a second time). Removed the flag,
_build_puf_cloned_dataset function, and all related params.

Added 4 compatible national targets: child_support_expense,
child_support_received, health_insurance_premiums_without_medicare_part_b,
and rent (all needed_w 27-37, compatible with count targets at ~26).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          fixing the stacked dataset builder

40ba0f2


          Derive cds_ordered from cd_geoid array instead of database query

7c38d55


          Update notebook outputs from successful calibration pipeline run

abe1038

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22 and others added 3 commits

February 26, 2026 14:52


          minor fixes

105bb4a


          small optimizations

23369f3


          Parallelize clone loop in build_matrix() via ProcessPoolExecutor

c86a263

- Add module-level picklable worker functions (_process_single_clone,
  _init_clone_worker) and standalone helpers for constraint evaluation
  and target-value calculation usable by worker processes
- Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem
- Branch clone loop on workers param: parallel (workers>1) uses
  ProcessPoolExecutor with initializer pattern; sequential unchanged
- Add parallel state/county precomputation with per-state fresh sims
- Add tests for picklability, pool creation, parallel branching, and
  clone loop infrastructure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22 marked this pull request as ready for review

February 26, 2026 17:54

juaristi22 marked this pull request as draft

February 26, 2026 17:54

MaxGhenis and others added 25 commits

February 27, 2026 11:04


          Migrate from changelog_entry.yaml to towncrier fragments (#550)

a69d1ee

* Migrate from changelog_entry.yaml to towncrier fragments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Format bump_version.py with black

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace old changelog workflows with towncrier fragment check

- Replace pr_changelog.yaml fork-check + reusable changelog check with
  simple towncrier fragment existence check
- Delete reusable_changelog_check.yaml (no longer needed)
- Delete check-changelog-entry.sh (checked for old changelog_entry.yaml)
- Update versioning.yaml to use towncrier build instead of yaml-changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


          Update package version


          Add end-to-end test for calibration database build pipeline (#556)

0c43746

Runs all ETL scripts (create_database_tables, create_initial_strata,
etl_national_targets, etl_age, etl_medicaid, etl_snap,
etl_state_income_tax, etl_irs_soi, validate_database) in sequence
and validates the resulting SQLite database for:
- Expected tables (strata, stratum_constraints, targets)
- National targets include key variables (snap, social_security, ssi)
- State income tax targets cover 42+ states with CA > $100B
- Congressional district strata for 435+ districts
- All target variables exist in policyengine-us
- Total target count > 1000

This prevents API mismatches and import errors from going undetected
when ETL scripts are modified.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


          Update package version

0a67899


          Add ETL process for pregnancy calibration targets and update document…

da5f1eb

…ation


          Add changelog fragment for pregnancy imputation (#563)

9a30d7c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Update package version

9ef9aac


          Migrate from changelog_entry.yaml to towncrier fragments (#550)

94bdb47

* Migrate from changelog_entry.yaml to towncrier fragments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Format bump_version.py with black

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace old changelog workflows with towncrier fragment check

- Replace pr_changelog.yaml fork-check + reusable changelog check with
  simple towncrier fragment existence check
- Delete reusable_changelog_check.yaml (no longer needed)
- Delete check-changelog-entry.sh (checked for old changelog_entry.yaml)
- Update versioning.yaml to use towncrier build instead of yaml-changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


          Update package version

f543c7f


          Add end-to-end test for calibration database build pipeline (#556)

3eb3eda

Runs all ETL scripts (create_database_tables, create_initial_strata,
etl_national_targets, etl_age, etl_medicaid, etl_snap,
etl_state_income_tax, etl_irs_soi, validate_database) in sequence
and validates the resulting SQLite database for:
- Expected tables (strata, stratum_constraints, targets)
- National targets include key variables (snap, social_security, ssi)
- State income tax targets cover 42+ states with CA > $100B
- Congressional district strata for 435+ districts
- All target variables exist in policyengine-us
- Total target count > 1000

This prevents API mismatches and import errors from going undetected
when ETL scripts are modified.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


          Update package version

915fec8


          Parallelize clone loop in build_matrix() via ProcessPoolExecutor

157e6af

- Add module-level picklable worker functions (_process_single_clone,
  _init_clone_worker) and standalone helpers for constraint evaluation
  and target-value calculation usable by worker processes
- Pre-extract variable_entity_map to avoid pickling TaxBenefitSystem
- Branch clone loop on workers param: parallel (workers>1) uses
  ProcessPoolExecutor with initializer pattern; sequential unchanged
- Add parallel state/county precomputation with per-state fresh sims
- Add tests for picklability, pool creation, parallel branching, and
  clone loop infrastructure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          add target config


          Reorganize calibration modules from local_area_calibration to calibra…

1b720db

…tion/

Move all calibration code from datasets/cps/local_area_calibration/ to
calibration/, update imports across the codebase, add validate_staging
module, and improve unified calibration with target config support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix modal run command to specify ::main entrypoint

519c3c9

After adding main_promote as a second entrypoint, Modal can no longer
infer which function to run without an explicit specifier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix worker stdout pollution breaking JSON result parsing

422ba05

Build functions (build_state_h5, etc.) print banners to stdout, which
gets captured by the subprocess and mixed with the JSON output. This
caused json.loads() to fail with "Failed to parse output" for all 8
workers, returning empty completed/failed lists. The pipeline then
silently continued past the error check (total_failed == 0) and
uploaded stale files.

Fix: redirect stdout to stderr during worker processing, restore for
JSON output. Also fail the build when errors exist but nothing completed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add volume-based verification after worker builds

8e402c7

Instead of trusting worker JSON results alone (which broke when stdout
was polluted), now reload the volume after builds and count actual h5
files. The build fails if the volume has fewer files than expected,
regardless of what workers reported. This makes the checkpoint system
the source of truth for build completeness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix at-large district GEOID round-trip conversion

a6864b8

At-large districts (AK, DE, ND, SD, VT, WY) have GEOID ending in 00
(e.g., DE=1000) but display as XX-01 via max(cd%100, 1). The worker
naively converted DE-01 back to 1001 which didn't exist in the DB.

Now tries the direct conversion first, then falls back to finding the
sole CD for that state's FIPS prefix (at-large case).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Always fresh-download calibration inputs, clear stale builds

d709386

The Modal volume was caching old calibration inputs from previous runs.
The code only checked file existence, not freshness, so new model fits
on HuggingFace were never pulled. Also clear the version build directory
to prevent stale h5 files (built from old weights) from being treated
as completed by the volume checkpoint system.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Normalize at-large district naming: 00 and 98 both map to 01

45aebc8

DC (GEOID 1198, district 98) and at-large states (GEOID XX00, district
00) should all display as XX-01. Previously max(d, 1) only handled 00,
producing DC-98.h5 instead of DC-01.h5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Enable takeup re-randomization in stacked dataset H5 builds

e3943d2

Workers now always re-draw takeup using block-level seeded draws,
matching the calibration matrix builder's computation. This fixes
H5 files producing aca_ptc values 6-40x off from calibration targets.

Pipeline changes:
- publish_local_area: thread rerandomize_takeup/blocks/filter params
- worker_script: always rerandomize, optionally use calibration blocks
- local_area: pass blocks path to workers when available
- huggingface: optionally download stacked_blocks.npy
- unified_calibration: print BLOCKS_PATH for Modal collection
- remote_calibration_runner: collect, save, and upload blocks to HF

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Streamline calibration pipeline: rename, upload, auto-trigger

9f7f210

- Rename w_district_calibration.npy and unified_weights.npy to
  calibration_weights.npy everywhere (HF paths, local defaults, docs)
- Add upload_calibration_artifacts() to huggingface.py for atomic
  multi-file HF uploads (weights + blocks + logs in one commit)
- Add --upload flag (replaces --upload-logs) and --trigger-publish flag
  to remote_calibration_runner.py
- Add _trigger_repository_dispatch() for GitHub workflow auto-trigger
- Remove dead _upload_logs_to_hf() and _upload_calibration_artifact()
- Add scripts/upload_calibration.py CLI + make upload-calibration target
- Update modal_app/README.md with new flags and artifact table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add make pipeline: data → upload → calibrate → stage in one command

a7a98aa

Chains make data, upload-dataset (API direct to HF), calibrate-modal
(GPU fit + upload weights), and stage-h5s (build + stage H5s).
Configurable via GPU, EPOCHS, BRANCH, NUM_WORKERS variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          documentation

ecc6b0c


          flag

fddd03e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment