CHB-MIT Scalp EEG Dataset

My goal is to become more acquainted/familiar with time-series data, digital signal processing, preprocessing pipelines, and ML/AI solutions that utilize these type of data. The CHB-MIT scalp EEG dataset contains real time-series data of pediatric subjects. The focus of this project is to create a classifier that can distinguish seizure and non-seizure states.

Technology Used

python 3.11, mne (EEG/MEG library), jupyter lab

Data Acquisition

Anyone can access the data, download via wget in terminal.

wget -r -N -c -np https://physionet.org/files/chbmit/1.0.0/

# or with s2 (this is way quicker)
aws s3 sync --no-sign-request s3://physionet-open/chbmit/1.0.0/ DESTINATION

# Note: The initial download creates many subfolders, move them up as many levels as you'd like

Setup

# install dependencies and run any project command
uv sync

# or just run directly (uv sync is implicit)
uv run jupyter lab

Running Jupyter lab on a remote server

uv run jupyter lab --no-browser --ip=<your_server_ip> --port=8080 > logs/jupyter.log 2>&1 &

Inspiration

Professor Millan was the instructor for one of my last undergraduate classes at UT. His class, neural engineering, covered many advanced topics: cochlear implants, robotic prosthesis, TMS/TACS, BCI's, Deep brain stimulation, etc.. Every lecture was eye opening, inspiring, and interesting. My favorite part involved a homework where we created an ML model (SVM) to predict whether a rat's sciatic nerve was experiencing a pinch, flexion, or at rest. Multiclass classification. In order to whet that interest, I decided (as an ML/AI hobbyist) to take a real EEG dataset and try my hand at building a model, so binary classification of seizure states seemed most similar and doable.

Progress

Exploratory data analysis — understanding data volume, which patients experience the most seizures, most/least seizure time, total interictal vs ictal time, basic visualization
Parsing summary files — each patient's data files came with summary files detailing seizure onsets and end times; wrote a parser to extract these for downstream labeling
Single-patient workflow — selected the patient with the highest seizure time and one of their files to develop an end-to-end workflow:
- Preprocessing — dropping dead channels, annotating seizure onsets/duration, segmenting data with configurable window size and overlap, creating a label vector for training
- Feature extraction — time-domain (mean, variance, MAV, skewness) and frequency-domain (absolute/relative band power, spectral entropy, peak frequency)
Generalizing the pipeline — extended the single-file workflow to process all patient files
Model experiments — SVM and neural network classifiers (see models/)
- SVM failed to predict ictal states due to extreme class imbalance (~1,144:1 non-ictal to ictal)
- Deep NN with class-weighted cross-entropy loss had the same problem — 0% recall on ictal class
- Added resampling (random oversampling of ictal + undersampling of non-ictal) to the pipeline, which got the model to finally produce ictal predictions
Current challenge: feature quality — the model predicts ictal states now but with very low recall. Visual inspection of feature distributions shows that separation between classes is highly channel-dependent (e.g., beta/theta band power on C3-P3 shows clear signal, while other channels show none). Next steps:
- Feature selection — drop low-signal features and channels that add noise
- Focus on the most discriminative features (band power features, especially beta and theta)
- Potentially reduce from 345 features to a smaller, higher-quality set

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
analysis		analysis
extraction		extraction
models		models
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
nomenclature.md		nomenclature.md
pyproject.toml		pyproject.toml
run_jupyter.sh		run_jupyter.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHB-MIT Scalp EEG Dataset

Technology Used

Data Acquisition

Setup

Running Jupyter lab on a remote server

Inspiration

Progress

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CHB-MIT Scalp EEG Dataset

Technology Used

Data Acquisition

Setup

Running Jupyter lab on a remote server

Inspiration

Progress

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages