CDC Measles Case Data Scraper

Automated pipeline for collecting and structuring weekly CDC measles surveillance data. The scraper captures data from the CDC Measles Cases and Outbreaks page — both the live site and historical snapshots via the Wayback Machine — and parses it into a structured CSV.

Output

measles_structured.csv — one row per CDC update week with the following columns:

Column	Description
`snapshot_date`	Date the page was captured (YYYY-MM-DD)
`update_date`	CDC's stated data update date
`total_cases`	Total confirmed measles cases
`age_under5_n`, `age_under5_pct`	Cases among children under 5 (count and %)
`age_5_19_n`, `age_5_19_pct`	Cases among ages 5–19
`age_20plus_n`, `age_20plus_pct`	Cases among adults 20+
`age_unknown_n`, `age_unknown_pct`	Cases with unknown age
`vax_unvax_or_unknown_pct`	% unvaccinated or unknown vaccination status
`vax_one_mmr_pct`	% with one MMR dose
`vax_two_mmr_pct`	% with two MMR doses
`hosp_total_n`, `hosp_total_pct`	Total hospitalizations (count and %)
`hosp_under5_n`, `hosp_under5_pct`	Hospitalizations among under 5
`hosp_5_19_n`, `hosp_5_19_pct`	Hospitalizations among 5–19
`hosp_20plus_n`, `hosp_20plus_pct`	Hospitalizations among 20+
`hosp_unknown_n`, `hosp_unknown_pct`	Hospitalizations with unknown age

Missing values are coded as NA.

How it works

measles_cdc_scraper.py uses Playwright to render the CDC page in a headless Chromium browser and extract the visible text. This is necessary because the page uses JavaScript to render its data tables.
Each scraped page is cached as a plain text file in raw/ (named by timestamp, e.g., raw/20260304022912.txt). Pages that are already cached are skipped on subsequent runs.
parse_raw_to_csv.py reads all files in raw/ and extracts structured fields into measles_structured.csv.
The Wayback Machine CDX API is used to discover all historical snapshots, deduplicated to one per CDC update week (CDC updates on Tuesdays).

Setup

pip install playwright
playwright install chromium

Usage

# Scrape the live CDC page + backfill any missing Wayback Machine history
python measles_cdc_scraper.py

# Scrape the live CDC page only (used by the GitHub Action)
python measles_cdc_scraper.py --live

# Backfill Wayback Machine history only
python measles_cdc_scraper.py --history

# Re-parse all raw files into the structured CSV
python parse_raw_to_csv.py

Automated weekly updates

A GitHub Action (.github/workflows/weekly-scrape.yml) runs every Wednesday at 10:00 UTC — the day after the CDC's Tuesday data update. It:

Scrapes the live CDC page
Parses all raw files into measles_structured.csv
Commits and pushes any new data

The workflow can also be triggered manually from the Actions tab.

Repository structure

measles_age_cdc_scraper/
  measles_cdc_scraper.py      # Main scraper (Playwright + Wayback Machine)
  parse_raw_to_csv.py         # Parser: raw text -> structured CSV
  measles_structured.csv      # Output: structured weekly data
  measles_cases_parser.html   # Browser-based parser tool (standalone)
  cdc_measles_urls.txt        # Tracked Wayback Machine snapshot URLs
  raw/                        # Cached scraped text (one .txt per snapshot)
  requirements.txt            # Python dependencies
  .github/workflows/          # GitHub Actions for weekly automation

Data source

All data originates from the CDC's Measles Cases and Outbreaks page. Historical data is retrieved via the Wayback Machine.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude		.claude
.github/workflows		.github/workflows
__pycache__		__pycache__
plots		plots
raw		raw
.gitignore		.gitignore
README.md		README.md
cdc_measles_urls.txt		cdc_measles_urls.txt
measles_cases_parser.html		measles_cases_parser.html
measles_cdc_scraper.py		measles_cdc_scraper.py
measles_structured.csv		measles_structured.csv
output.txt		output.txt
parse_raw_to_csv.py		parse_raw_to_csv.py
requirements.txt		requirements.txt
visualize_measles.R		visualize_measles.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDC Measles Case Data Scraper

Output

How it works

Setup

Usage

Automated weekly updates

Repository structure

Data source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CDC Measles Case Data Scraper

Output

How it works

Setup

Usage

Automated weekly updates

Repository structure

Data source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages