ELSA mental health modelling (MSc AI & Healthcare)

Lagged prediction pipeline: predict wave 8 depression (CES-D–based binary outcome) from wave 7 predictors. Wave choice (6–7–8) and variable list should match your group’s audit and UKDA documentation.

Data governance: Do not commit ELSA microdata to GitHub. This repository contains code only. Use data under your UK Data Service licence.

Requirements

Python 3.10+ (Google Colab is fine)
Install:

pip install -r requirements.txt

Quick check (no ELSA files)

From this folder (elsa_ml_project):

python main.py --demo

This writes outputs/metrics.csv, outputs/metrics.json, and ROC PNGs. Use this to verify Colab/local setup before pointing at real data.

Real ELSA run

Obtain core wave .dta files via UK Data Service (authorised user only).
Point to the directory that contains files named like wave_7_elsa_data*.dta (often .../UKDA-5050-stata/stata/stata13_se).

Linux / macOS / Colab:

export ELSA_STATA_DIR="/path/to/stata13_se"
python main.py

Windows (PowerShell):

$env:ELSA_STATA_DIR = "C:\path\to\stata13_se"
python main.py

Google Colab: mount Drive, upload or clone this repo, then:

import os
os.environ["ELSA_STATA_DIR"] = "/content/drive/MyDrive/ELSA/data/UKDA-5050-stata/stata/stata13_se"

!cd /content/path/to/elsa_ml_project && pip install -r requirements.txt && python main.py

If ELSA_STATA_DIR is unset and you are not on Colab, the code looks for ./data/stata13_se under this project (optional local layout).

Configuration

Edit config.py:

Setting	Purpose
`WAVE_FEATURES`, `WAVE_OUTCOME`	Default 7 → 8
`CESD_BINARY_THRESHOLD`	Cut-off for case definition (confirm with ELSA docs)
`CESD_MAP_12_TO_01`	Map item codes 1/2 to 1/0 before summing (typical ELSA pattern)
`PREDICTOR_COLUMNS`	Explicit predictor names from the feature wave (empty = auto numeric)
`AUTO_SELECT_NUMERIC_PREDICTORS`	If True and list empty, use all numeric columns except ID/outcome

Project layout

File	Role
`config.py`	Paths, seeds, wave numbers, outcome definition
`data_load.py`	Find and load `.dta`, normalise IDs
`recode.py`	Survey missing codes, CES-D sum, binary outcome
`build_panel.py`	Merge waves; `make_demo_panel()` for tests
`features.py`	Optional composites and numeric column helpers
`train_eval.py`	Logistic regression + random forest, metrics, ROC
`main.py`	CLI entry point

Models

Logistic regression (balanced class weights, median imputation, scaling)
Random forest (balanced class weights, median imputation)

Outputs: test-set accuracy, precision, recall, F1, ROC-AUC, confusion matrix, ROC curves in outputs/.

Next steps for the coursework

Fix predictor variable names to match your codebook and theory.
Validate CES-D item coding (0/1 vs 1/2) against UKDA documentation.
Document roles, meeting minutes, and limitations (observational data, attrition, missingness) in the presentation and report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELSA mental health modelling (MSc AI & Healthcare)

Requirements

Quick check (no ELSA files)

Real ELSA run

Configuration

Project layout

Models

Next steps for the coursework

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
build_panel.py		build_panel.py
config.py		config.py
data_load.py		data_load.py
features.py		features.py
main.py		main.py
recode.py		recode.py
requirements.txt		requirements.txt
train_eval.py		train_eval.py

Folders and files

Latest commit

History

Repository files navigation

ELSA mental health modelling (MSc AI & Healthcare)

Requirements

Quick check (no ELSA files)

Real ELSA run

Configuration

Project layout

Models

Next steps for the coursework

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages