Skip to content

TheFinix13/Elsa_Sample_Project

Repository files navigation

ELSA mental health modelling (MSc AI & Healthcare)

Lagged prediction pipeline: predict wave 8 depression (CES-D–based binary outcome) from wave 7 predictors. Wave choice (6–7–8) and variable list should match your group’s audit and UKDA documentation.

Data governance: Do not commit ELSA microdata to GitHub. This repository contains code only. Use data under your UK Data Service licence.

Requirements

  • Python 3.10+ (Google Colab is fine)
  • Install:
pip install -r requirements.txt

Quick check (no ELSA files)

From this folder (elsa_ml_project):

python main.py --demo

This writes outputs/metrics.csv, outputs/metrics.json, and ROC PNGs. Use this to verify Colab/local setup before pointing at real data.

Real ELSA run

  1. Obtain core wave .dta files via UK Data Service (authorised user only).
  2. Point to the directory that contains files named like wave_7_elsa_data*.dta (often .../UKDA-5050-stata/stata/stata13_se).

Linux / macOS / Colab:

export ELSA_STATA_DIR="/path/to/stata13_se"
python main.py

Windows (PowerShell):

$env:ELSA_STATA_DIR = "C:\path\to\stata13_se"
python main.py

Google Colab: mount Drive, upload or clone this repo, then:

import os
os.environ["ELSA_STATA_DIR"] = "/content/drive/MyDrive/ELSA/data/UKDA-5050-stata/stata/stata13_se"
!cd /content/path/to/elsa_ml_project && pip install -r requirements.txt && python main.py

If ELSA_STATA_DIR is unset and you are not on Colab, the code looks for ./data/stata13_se under this project (optional local layout).

Configuration

Edit config.py:

Setting Purpose
WAVE_FEATURES, WAVE_OUTCOME Default 7 → 8
CESD_BINARY_THRESHOLD Cut-off for case definition (confirm with ELSA docs)
CESD_MAP_12_TO_01 Map item codes 1/2 to 1/0 before summing (typical ELSA pattern)
PREDICTOR_COLUMNS Explicit predictor names from the feature wave (empty = auto numeric)
AUTO_SELECT_NUMERIC_PREDICTORS If True and list empty, use all numeric columns except ID/outcome

Project layout

File Role
config.py Paths, seeds, wave numbers, outcome definition
data_load.py Find and load .dta, normalise IDs
recode.py Survey missing codes, CES-D sum, binary outcome
build_panel.py Merge waves; make_demo_panel() for tests
features.py Optional composites and numeric column helpers
train_eval.py Logistic regression + random forest, metrics, ROC
main.py CLI entry point

Models

  • Logistic regression (balanced class weights, median imputation, scaling)
  • Random forest (balanced class weights, median imputation)

Outputs: test-set accuracy, precision, recall, F1, ROC-AUC, confusion matrix, ROC curves in outputs/.

Next steps for the coursework

  1. Fix predictor variable names to match your codebook and theory.
  2. Validate CES-D item coding (0/1 vs 1/2) against UKDA documentation.
  3. Document roles, meeting minutes, and limitations (observational data, attrition, missingness) in the presentation and report.

About

ELSA ML sample project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages