Problem
The calibration pipeline (unified_calibration.py) and the stacked dataset builder (stacked_dataset_builder.py) must use the same takeup draws for their outputs to be consistent. Currently this works only because takeup re-randomization is disabled — both read the original draws from the base .h5. But if re-randomization is enabled in calibration (via rerandomize_takeup() or #532), the two will diverge:
- Calibration: generates new takeup draws per block/clone using
seeded_rng(var_name, salt=block_geoid), builds X matrix with those draws, optimizes weights
- Stacked builder: loads the original base
.h5, copies the original draws (not the re-randomized ones), saves to output .h5
The resulting weights would be optimized for takeup patterns that don't exist in the final dataset.
What needs to happen
When the stacked dataset builder assembles each CD (around line 389, after geography is set and before to_input_dataframe()), it must call the same rerandomize_takeup() logic with the same seeds used during calibration. This means:
- After setting
state_fips, county, block_geoid etc. on cd_sim
- Call
rerandomize_takeup(cd_sim, block_geoids, time_period) (or equivalent)
- The seeded RNG contract (
seeded_rng(var_name, salt=block_geoid)) guarantees the same draws as calibration
For the category-dependent variables (#532: EITC, WIC), the stacked builder would also need to run the simulation to determine categories before re-drawing.
Current state
--skip-takeup-rerandomize is effectively a no-op (hardcoded skip at line 1058-1062)
- The stacked builder has no re-randomization code at all
- This is self-consistent today but blocks enabling takeup re-randomization
Acceptance criteria
Ref: #532, #531
Problem
The calibration pipeline (
unified_calibration.py) and the stacked dataset builder (stacked_dataset_builder.py) must use the same takeup draws for their outputs to be consistent. Currently this works only because takeup re-randomization is disabled — both read the original draws from the base.h5. But if re-randomization is enabled in calibration (viarerandomize_takeup()or #532), the two will diverge:seeded_rng(var_name, salt=block_geoid), builds X matrix with those draws, optimizes weights.h5, copies the original draws (not the re-randomized ones), saves to output.h5The resulting weights would be optimized for takeup patterns that don't exist in the final dataset.
What needs to happen
When the stacked dataset builder assembles each CD (around line 389, after geography is set and before
to_input_dataframe()), it must call the samererandomize_takeup()logic with the same seeds used during calibration. This means:state_fips,county,block_geoidetc. oncd_simrerandomize_takeup(cd_sim, block_geoids, time_period)(or equivalent)seeded_rng(var_name, salt=block_geoid)) guarantees the same draws as calibrationFor the category-dependent variables (#532: EITC, WIC), the stacked builder would also need to run the simulation to determine categories before re-drawing.
Current state
--skip-takeup-rerandomizeis effectively a no-op (hardcoded skip at line 1058-1062)Acceptance criteria
.h5, verify takeup draws match what the calibration matrix usedRef: #532, #531