Summary
State-level datasets built from the Feb 20 calibration inputs (stratified_extended_cps.h5 + w_district_calibration.npy) produce ~19x inflated income aggregates. These datasets are live on GCS production (gs://policyengine-us-data/states/*.h5) and are being served by the production API.
Example: Louisiana baseline household_net_income.sum() = $3,147B (production/GCS) vs $166B (HuggingFace v1.62.0). National weighted employment income = $59T (should be ~$11T).
Root Cause
PR #537 correctly removed the ~$6.26M AGI ceiling from PUF imputation (fixing #530), but the calibration weights were not re-tuned to account for the much wider income range. The result is that a handful of ultra-high-income PUF records get calibration weights that massively inflate national totals.
Evidence from the new calibration inputs
Base dataset (stratified_extended_cps.h5):
employment_income_before_lsr max: $2.8M (old) → $132.6M (new)
long_term_capital_gains max: $2.1M → $164.3M
- 30+ financial variables inflated 5x–4,270x in the upper tail (full list below)
Calibration weights (w_district_calibration.npy):
- Overall weight sums are similar (ratio 1.03x) — the problem isn't the total weight mass
- But extreme-income records get substantial weights across many CDs:
| Income |
Total calibrated weight |
CDs with nonzero weight |
Weighted contribution |
| $132.6M |
546 |
114 |
$72.4B |
| $106.5M |
3,601 |
194 |
$383.4B |
| $87.2M |
2,292 |
176 |
$199.9B |
Top 20 earners alone contribute $901B of national weighted employment income. Total national weighted employment income = $59T vs the correct ~$11T.
For comparison, in the old (capped) data, the highest earner had income of $2.78M with weight 15,725, contributing $43.8B — still large but 10x smaller than the new extremes.
Variables with >5x inflation in new base dataset
| Variable |
Old abs sum |
New abs sum |
Ratio |
estate_income |
$2.6M |
$7,514M |
2,862x |
general_business_credit |
$0.1M |
$373M |
4,270x |
foreign_tax_credit |
$1.9M |
$2,941M |
1,580x |
unadjusted_basis_qualified_property |
$54M |
$23,871M |
440x |
unrecaptured_section_1250_gain |
$5.4M |
$2,381M |
444x |
long_term_capital_gains |
$477M |
$128,251M |
269x |
amt_foreign_tax_credit |
$1.2M |
$302M |
247x |
miscellaneous_income |
$2.4M |
$560M |
231x |
salt_refund_income |
$8.1M |
$1,349M |
166x |
charitable_non_cash_donations |
$12M |
$2,221M |
185x |
charitable_cash_donations |
$51M |
$5,680M |
113x |
partnership_s_corp_income |
$135M |
$14,260M |
106x |
qualified_dividend_income |
$50M |
$5,150M |
104x |
domestic_production_ald |
$10M |
$571M |
55x |
non_qualified_dividend_income |
$68M |
$2,506M |
37x |
rental_income |
$32M |
$1,044M |
32x |
employment_income_before_lsr |
$1,408M |
$19,375M |
14x |
CPS-native variables (age, household_weight, disability, rent, etc.) are all unchanged (ratio ~1.0).
Production Impact
The upload_to_staging() function in modal_app/local_area.py uploads files directly to GCS production paths before staging on HuggingFace. The v1.69.3 state files went to GCS on ~Feb 20 but the Promote workflow was never run, so:
- GCS (production): v1.69.3 state files with inflated incomes ← live, broken
- HuggingFace production: v1.62.0 state files (correct)
- HuggingFace staging/: v1.69.3 files (7.43 GB, unpromoted)
The production API (policyengine-api) calls get_default_dataset() which returns gs://policyengine-us-data/states/{STATE}.h5 with data_version=None, so it always gets the latest GCS blob — the broken v1.69.3 data.
Suggested Fixes
- Immediate: Roll back GCS state files to the v1.62.0 data to restore correct production behavior
- Calibration: Add constraints to the L0 optimizer to prevent extreme-income records from getting weights that inflate national totals beyond known aggregates (e.g., cap per-record weighted income contribution, or add an income-total constraint)
- Pipeline:
upload_to_staging() should not write to GCS production paths directly — this defeats the staging/promote safety pattern
- Versioning: Add dataset version pinning in the API so state datasets can't be silently updated
Reproduction
import numpy as np
# Load new calibration inputs from HuggingFace
from huggingface_hub import hf_hub_download
w = np.load(hf_hub_download("policyengine/policyengine-us-data",
"calibration/w_district_calibration.npy", repo_type="model"))
# Load old for comparison
w_old = np.load(hf_hub_download("policyengine/policyengine-us-data",
"calibration/w_district_calibration.npy", repo_type="model",
revision="1c91d3b"))
import h5py
ds_new = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
"calibration/stratified_extended_cps.h5", repo_type="model"), "r")
ds_old = h5py.File(hf_hub_download("policyengine/policyengine-us-data",
"calibration/stratified_extended_cps.h5", repo_type="model",
revision="1c91d3b"), "r")
emp_new = ds_new["employment_income_before_lsr"]["2024"][:].astype(float)
emp_old = ds_old["employment_income_before_lsr"]["2024"][:].astype(float)
print(f"Old max income: ${emp_old.max():,.0f}") # $2,783,732
print(f"New max income: ${emp_new.max():,.0f}") # $132,596,760
Related
Summary
State-level datasets built from the Feb 20 calibration inputs (
stratified_extended_cps.h5+w_district_calibration.npy) produce ~19x inflated income aggregates. These datasets are live on GCS production (gs://policyengine-us-data/states/*.h5) and are being served by the production API.Example: Louisiana baseline
household_net_income.sum()= $3,147B (production/GCS) vs $166B (HuggingFace v1.62.0). National weighted employment income = $59T (should be ~$11T).Root Cause
PR #537 correctly removed the ~$6.26M AGI ceiling from PUF imputation (fixing #530), but the calibration weights were not re-tuned to account for the much wider income range. The result is that a handful of ultra-high-income PUF records get calibration weights that massively inflate national totals.
Evidence from the new calibration inputs
Base dataset (
stratified_extended_cps.h5):employment_income_before_lsrmax: $2.8M (old) → $132.6M (new)long_term_capital_gainsmax: $2.1M → $164.3MCalibration weights (
w_district_calibration.npy):Top 20 earners alone contribute $901B of national weighted employment income. Total national weighted employment income = $59T vs the correct ~$11T.
For comparison, in the old (capped) data, the highest earner had income of $2.78M with weight 15,725, contributing $43.8B — still large but 10x smaller than the new extremes.
Variables with >5x inflation in new base dataset
estate_incomegeneral_business_creditforeign_tax_creditunadjusted_basis_qualified_propertyunrecaptured_section_1250_gainlong_term_capital_gainsamt_foreign_tax_creditmiscellaneous_incomesalt_refund_incomecharitable_non_cash_donationscharitable_cash_donationspartnership_s_corp_incomequalified_dividend_incomedomestic_production_aldnon_qualified_dividend_incomerental_incomeemployment_income_before_lsrCPS-native variables (age, household_weight, disability, rent, etc.) are all unchanged (ratio ~1.0).
Production Impact
The
upload_to_staging()function inmodal_app/local_area.pyuploads files directly to GCS production paths before staging on HuggingFace. The v1.69.3 state files went to GCS on ~Feb 20 but the Promote workflow was never run, so:The production API (
policyengine-api) callsget_default_dataset()which returnsgs://policyengine-us-data/states/{STATE}.h5withdata_version=None, so it always gets the latest GCS blob — the broken v1.69.3 data.Suggested Fixes
upload_to_staging()should not write to GCS production paths directly — this defeats the staging/promote safety patternReproduction
Related