Automated pipeline for collecting and structuring weekly CDC measles surveillance data. The scraper captures data from the CDC Measles Cases and Outbreaks page — both the live site and historical snapshots via the Wayback Machine — and parses it into a structured CSV.
measles_structured.csv — one row per CDC update week with the following columns:
| Column | Description |
|---|---|
snapshot_date |
Date the page was captured (YYYY-MM-DD) |
update_date |
CDC's stated data update date |
total_cases |
Total confirmed measles cases |
age_under5_n, age_under5_pct |
Cases among children under 5 (count and %) |
age_5_19_n, age_5_19_pct |
Cases among ages 5–19 |
age_20plus_n, age_20plus_pct |
Cases among adults 20+ |
age_unknown_n, age_unknown_pct |
Cases with unknown age |
vax_unvax_or_unknown_pct |
% unvaccinated or unknown vaccination status |
vax_one_mmr_pct |
% with one MMR dose |
vax_two_mmr_pct |
% with two MMR doses |
hosp_total_n, hosp_total_pct |
Total hospitalizations (count and %) |
hosp_under5_n, hosp_under5_pct |
Hospitalizations among under 5 |
hosp_5_19_n, hosp_5_19_pct |
Hospitalizations among 5–19 |
hosp_20plus_n, hosp_20plus_pct |
Hospitalizations among 20+ |
hosp_unknown_n, hosp_unknown_pct |
Hospitalizations with unknown age |
Missing values are coded as NA.
-
measles_cdc_scraper.pyuses Playwright to render the CDC page in a headless Chromium browser and extract the visible text. This is necessary because the page uses JavaScript to render its data tables. -
Each scraped page is cached as a plain text file in
raw/(named by timestamp, e.g.,raw/20260304022912.txt). Pages that are already cached are skipped on subsequent runs. -
parse_raw_to_csv.pyreads all files inraw/and extracts structured fields intomeasles_structured.csv. -
The Wayback Machine CDX API is used to discover all historical snapshots, deduplicated to one per CDC update week (CDC updates on Tuesdays).
pip install playwright
playwright install chromium# Scrape the live CDC page + backfill any missing Wayback Machine history
python measles_cdc_scraper.py
# Scrape the live CDC page only (used by the GitHub Action)
python measles_cdc_scraper.py --live
# Backfill Wayback Machine history only
python measles_cdc_scraper.py --history
# Re-parse all raw files into the structured CSV
python parse_raw_to_csv.pyA GitHub Action (.github/workflows/weekly-scrape.yml) runs every Wednesday at 10:00 UTC — the day after the CDC's Tuesday data update. It:
- Scrapes the live CDC page
- Parses all raw files into
measles_structured.csv - Commits and pushes any new data
The workflow can also be triggered manually from the Actions tab.
measles_age_cdc_scraper/
measles_cdc_scraper.py # Main scraper (Playwright + Wayback Machine)
parse_raw_to_csv.py # Parser: raw text -> structured CSV
measles_structured.csv # Output: structured weekly data
measles_cases_parser.html # Browser-based parser tool (standalone)
cdc_measles_urls.txt # Tracked Wayback Machine snapshot URLs
raw/ # Cached scraped text (one .txt per snapshot)
requirements.txt # Python dependencies
.github/workflows/ # GitHub Actions for weekly automation
All data originates from the CDC's Measles Cases and Outbreaks page. Historical data is retrieved via the Wayback Machine.