Lagged prediction pipeline: predict wave 8 depression (CES-D–based binary outcome) from wave 7 predictors. Wave choice (6–7–8) and variable list should match your group’s audit and UKDA documentation.
Data governance: Do not commit ELSA microdata to GitHub. This repository contains code only. Use data under your UK Data Service licence.
- Python 3.10+ (Google Colab is fine)
- Install:
pip install -r requirements.txtFrom this folder (elsa_ml_project):
python main.py --demoThis writes outputs/metrics.csv, outputs/metrics.json, and ROC PNGs. Use this to verify Colab/local setup before pointing at real data.
- Obtain core wave
.dtafiles via UK Data Service (authorised user only). - Point to the directory that contains files named like
wave_7_elsa_data*.dta(often.../UKDA-5050-stata/stata/stata13_se).
Linux / macOS / Colab:
export ELSA_STATA_DIR="/path/to/stata13_se"
python main.pyWindows (PowerShell):
$env:ELSA_STATA_DIR = "C:\path\to\stata13_se"
python main.pyGoogle Colab: mount Drive, upload or clone this repo, then:
import os
os.environ["ELSA_STATA_DIR"] = "/content/drive/MyDrive/ELSA/data/UKDA-5050-stata/stata/stata13_se"!cd /content/path/to/elsa_ml_project && pip install -r requirements.txt && python main.pyIf ELSA_STATA_DIR is unset and you are not on Colab, the code looks for ./data/stata13_se under this project (optional local layout).
Edit config.py:
| Setting | Purpose |
|---|---|
WAVE_FEATURES, WAVE_OUTCOME |
Default 7 → 8 |
CESD_BINARY_THRESHOLD |
Cut-off for case definition (confirm with ELSA docs) |
CESD_MAP_12_TO_01 |
Map item codes 1/2 to 1/0 before summing (typical ELSA pattern) |
PREDICTOR_COLUMNS |
Explicit predictor names from the feature wave (empty = auto numeric) |
AUTO_SELECT_NUMERIC_PREDICTORS |
If True and list empty, use all numeric columns except ID/outcome |
| File | Role |
|---|---|
config.py |
Paths, seeds, wave numbers, outcome definition |
data_load.py |
Find and load .dta, normalise IDs |
recode.py |
Survey missing codes, CES-D sum, binary outcome |
build_panel.py |
Merge waves; make_demo_panel() for tests |
features.py |
Optional composites and numeric column helpers |
train_eval.py |
Logistic regression + random forest, metrics, ROC |
main.py |
CLI entry point |
- Logistic regression (balanced class weights, median imputation, scaling)
- Random forest (balanced class weights, median imputation)
Outputs: test-set accuracy, precision, recall, F1, ROC-AUC, confusion matrix, ROC curves in outputs/.
- Fix predictor variable names to match your codebook and theory.
- Validate CES-D item coding (0/1 vs 1/2) against UKDA documentation.
- Document roles, meeting minutes, and limitations (observational data, attrition, missingness) in the presentation and report.