Skip to content

saywurdson/txwc

Repository files navigation

Texas Workers' Compensation Healthcare Analytics Pipeline

End-to-end data pipeline that extracts Texas workers' compensation medical billing data from the Texas Open Data Portal, transforms it through a 3-layer dbt model architecture, and loads it into the OMOP Common Data Model v5.4 using DuckDB.

Python dlt DuckDB dbt OMOP CDM License


Overview

The Texas Department of Insurance, Division of Workers' Compensation (DWC) publishes medical billing data for injured employees as open data. This project builds an automated pipeline to:

  1. Extract 12 datasets (professional, institutional, pharmacy claims) via the Socrata API
  2. Enrich with clinical reference data from RxNav, VSAC/FHIR, and OMOP vocabularies
  3. Transform raw billing records into 20+ standardized OMOP CDM tables using dbt
  4. Analyze with interactive dashboards and Jupyter notebooks

What is OMOP CDM?

The OMOP Common Data Model is a universal standard for healthcare data. Just as USB-C lets any device connect to any charger, OMOP CDM lets healthcare data from any source be analyzed using the same tools and queries. Developed by the OHDSI community and used by 100+ institutions worldwide, it maps proprietary billing codes (ICD-9/10, NDC, HCPCS/CPT) into standardized clinical concepts (SNOMED, RxNorm).


Architecture

data.texas.gov (Socrata API)
    │
    │  load_data.py ── dlt pipeline with rate limiting,
    │                   patient-cohort sampling
    ▼
┌─────────────────────────────────────────────┐
│  DuckDB (tx_workers_comp.db)                │
│                                             │
│  ┌─────────┐    dbt (48 SQL models)         │
│  │ raw.*   │───────────────────────────────► │
│  │ 12 tbls │    staging → intermediate →    │
│  └─────────┘    final                       │
│                          ┌──────────┐       │
│  Reference Data ────────►│ omop.*   │       │
│  • OMOP Vocabularies     │ 20+ tbls │       │
│  • RxClass (RxNav API)   └──────────┘       │
│  • VSAC (FHIR API)                          │
└─────────────────────────────────────────────┘
    │
    ▼
Streamlit Dashboard & Jupyter Notebooks

Data Sources

Source Description Loader
Texas Open Data Portal 12 workers' comp medical billing datasets (professional, institutional, pharmacy — headers & details, current & historical) load_data.py
RxNav / RxClass API Drug classification data (EPC, ATC, MOA) for all RxNorm concepts load_rxclass.py
VSAC FHIR API Clinical value sets (SNOMED, ICD-10, LOINC, CPT groupings) load_vsac.py
OMOP Vocabularies (Athena) CONCEPT, CONCEPT_RELATIONSHIP, CONCEPT_ANCESTOR, DRUG_STRENGTH reference tables load_vocab.py

Claim Types

Type Form Description
Professional (SV1) CMS-1500 Doctor/provider office visits
Institutional (SV2) CMS-1450/UB-04 Hospital/facility bills
Pharmacy (SV4) DWC Form-066 Prescription drug bills

OMOP CDM Tables Produced

The dbt transformation layer maps raw billing data into these standardized clinical tables:

Clinical: person, visit_occurrence, visit_detail, condition_occurrence, procedure_occurrence, drug_exposure, measurement, observation, device_exposure, specimen, death

Derived: drug_era, condition_era, dose_era, observation_period

Administrative: cost, payer_plan_period, care_site, location, provider

Relationships: fact_relationship (links each procedure_occurrence to the condition_occurrence it was performed for, using the CMS-1500 Box 24E diagnosis pointer on professional claim lines)

Workers'-comp specific enrichments

Several OMOP columns are populated with WC-specific semantics that a generic claims ETL would leave empty or guess at:

  • Injury date observation (observation_concept_id = 40771952 LOINC "Injury date") — one row per patient, carrying employee_date_of_injury from the raw headers. Used to anchor observation_period.observation_period_start_date via LEAST(injury_date, earliest_clinical_event) so every clinical event stays OMOP-compliant.
  • Employer identifier observation (observation_concept_id = 21492865 LOINC "Employer name [Identifier]") — stores each bill's employer_fein in value_as_string for employer-level cohort segmentation.
  • WC-correct cost mathpaid_by_patient, paid_patient_copay, paid_patient_coinsurance, paid_patient_deductible are all hardcoded to 0 (workers' comp is no-fault; no patient responsibility). amount_allowed = total_charge - service_adjustment_amount surfaces ~$4M of contractual write-offs that were previously invisible.
  • Injury → first-treatment cohort analytics via the injury-date observation.

Quick Start

Prerequisites

  • Docker
  • A Socrata app token (register here)
  • (Optional) UMLS API key for VSAC/RxClass enrichment
  • (Required for OMOP) OMOP vocabulary files from Athena — see below

1. Clone and configure credentials

git clone https://github.com/saywurdson/txwc.git
cd txwc

# Copy the secrets template
cp .dlt/secrets.toml.example .dlt/secrets.toml

Edit .dlt/secrets.toml with your API keys:

[sources.txwc]
application_token = "your_socrata_app_token"

[sources.vsac]
api_key = "your_umls_api_key"

Note: .dlt/secrets.toml is gitignored and should never be committed. The database destination is configured in .dlt/config.toml (defaults to tx_workers_comp.db).

2. Build and run the Docker container

# Build the image
docker build -t txwc .

# Run interactively (mounts the project so data and config persist)
docker run -it -p 8501:8501 -v "$(pwd):/workspaces/txwc" -w /workspaces/txwc txwc bash

All commands below run inside the container.

3. Set up the project Python environment

This repo uses a uv-managed virtual environment at .venv/ for the pipeline, dbt, and dashboard dependencies. The heavyweight scientific stack (Jupyter, PySpark, xgboost) stays in the base conda env and is only used by exploratory notebooks.

# One-time: create the venv from pyproject.toml + uv.lock
uv sync

# Optional: install dbt's package dependencies (dbt_utils, etc.)
cd omop && uv run dbt deps && cd ..

Everything Python-related in this project runs via uv run <command> so it picks up the venv automatically.

4. Load data

# Start with a small patient sample for fast iteration (~2 min)
uv run python load_data.py --sample_patients 500 --complex

# Or do a full load (all datasets, incremental for historical — takes longer)
uv run python load_data.py

# Other options:
uv run python load_data.py --dataset professional --time_period current  # specific claim type
uv run python load_data.py --sample_patients 500 --complex               # most complex patients
uv run python load_data.py --report_only                                  # database summary only

Load reference data (optional, needed for some dbt models and dashboard features):

uv run python load_rxclass.py   # Drug classifications from RxNav (public, no key needed)
uv run python load_vsac.py      # Clinical value sets from VSAC (requires UMLS API key)

4. Download OMOP Vocabularies

The dbt models require standard vocabulary files to map billing codes (ICD-10, NDC, HCPCS/CPT) to OMOP concepts. These are not included in the repo due to licensing.

  1. Register at Athena (athena.ohdsi.org)
  2. Click Download and select at minimum these vocabularies:
    • SNOMED — standard clinical concepts
    • ICD10CM — diagnosis codes
    • ICD9CM — legacy diagnosis codes
    • RxNorm / RxNorm Extension — drug concepts
    • HCPCS / CPT4 — procedure codes (CPT4 requires a separate UMLS license)
    • NDC — pharmacy drug codes
    • LOINC — lab/observation codes
  3. Download and extract the zip file
  4. Copy the CSV files into vocab/:
    cp /path/to/athena_download/*.csv vocab/
  5. Load into DuckDB:
    uv run python load_vocab.py

The required files are: CONCEPT.csv, CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, DRUG_STRENGTH.csv. The remaining files in the Athena download (CONCEPT_SYNONYM, VOCABULARY, RELATIONSHIP, DOMAIN, CONCEPT_CLASS) are not used by the dbt models but can be loaded for reference.

5. Transform with dbt

cd omop
uv run dbt deps
uv run dbt build   # runs + tests in topological order (recommended)
# or separately:
uv run dbt run     # build all OMOP CDM tables
uv run dbt test    # run data quality tests

6. Launch Dashboard

uv run streamlit run dashboard.py
# Opens on http://localhost:8501

The dashboard has 7 tabs:

Tab What it shows
Overview Patient counts, visit trends, demographics, seasonality
Injury Profile Body region analysis, injury-to-treatment delay, treatment episode duration, PT vs imaging pathway
Condition Intelligence Top diagnoses, ICD-9/10 transition, co-occurrence, patient complexity
Cost & Payments Charges vs. payments, procedure cost efficiency, insurance carriers
Rx & Opioid Monitor Opioid prescribing trends, escalation tracker, era durations
Provider Analytics Caseload concentration, top providers, network distribution
Geography Cities, states, facility concentration, patient-to-facility ratios

Toggle "Show explanations" in the sidebar for plain-language descriptions of each chart.


Project Structure

txwc/
├── pyproject.toml         # uv-managed project dependencies (dbt, dlt, duckdb, streamlit, ...)
├── uv.lock                # Fully pinned dependency lockfile (committed for reproducibility)
├── .venv/                 # Project virtualenv (gitignored — created by `uv sync`)
│
├── load_data.py           # Socrata API ingestion via dlt pipeline
├── load_rxclass.py        # RxNav drug classification loader
├── load_vsac.py           # VSAC FHIR value set loader
├── load_vocab.py          # OMOP vocabulary CSV loader
├── dashboard.py           # Streamlit analytics dashboard
├── Dockerfile             # Container build
│
├── .dlt/                  # dlt configuration
│   ├── config.toml        # Destination & runtime settings (tracked)
│   ├── secrets.toml       # API keys & credentials (gitignored)
│   └── secrets.toml.example  # Credential template
│
├── omop/                  # dbt project
│   ├── dbt_project.yml
│   ├── profiles.yml
│   ├── models/
│   │   ├── staging/       # Raw data extraction (ephemeral)
│   │   ├── intermediate/  # Type casting & concept mapping (ephemeral)
│   │   └── final/         # OMOP CDM tables (materialized)
│   ├── macros/            # Reusable SQL (concept mapping, ID derivation)
│   └── tests/
│
├── vocab/                 # OMOP vocabulary CSVs from Athena (not tracked)
│
└── notebooks/             # Analysis notebooks
    ├── top10_analysis.ipynb
    ├── pdc.ipynb
    └── add_ccsr_value_sets.ipynb

Python environment model

This project uses two separate Python environments to keep dependency closures small and avoid conflicts:

Env Manager Location What lives here How to invoke
Project venv uv .venv/ (from pyproject.toml + uv.lock) dbt-core, dbt-duckdb, duckdb, dlt, streamlit, plotly, pandas, tqdm, python-dotenv uv run <command>
Conda base env conda /opt/conda Jupyter, PySpark, xgboost, and other heavyweight scientific libs used only by exploratory notebooks /opt/conda/bin/python (or via notebook kernel)
MCP servers uvx / uv run ephemeral per-call dbt-mcp, mcp-server-motherduck, dlt-workspace-mcp Configured in Claude Code MCP settings

Rules of thumb:

  • All pipeline, dbt, and dashboard Python runs via uv run. The venv owns the data path.
  • Notebooks can stay on the conda kernel for the scientific stack, or install the venv as a Jupyter kernel (uv run python -m ipykernel install --user --name txwc --display-name "TXWC (venv)") to get the same dlt/duckdb/dbt as the pipeline.
  • Never pip install into /opt/conda for pipeline packages — that's how dependency pins collide across unrelated tools. Add deps with uv add <pkg> instead.

Key Engineering Features

  • dlt pipelines — All loaders use dlt for schema management, type handling, and idempotent loads into DuckDB
  • Patient-cohort sampling — Development mode that selects N patients (random or by complexity score) and fetches only their complete records across all claim types
  • Full schema enforcement — Post-load step queries Socrata metadata API to ensure all columns exist even when sampling small patient cohorts
  • Concept mapping — Custom dbt macros map ICD-9/10 → SNOMED, NDC → RxNorm, HCPCS/CPT → standard concepts via OMOP vocabulary lookups
  • Hash-based patient identity — Deterministic person_id derived from xxhash64 of demographics for deduplication across claim types

Technologies

Category Tools
Language Python 3.11
Data Loading dlt with secrets.toml credential management
Database DuckDB
Transformation dbt (dbt-duckdb)
Data Model OMOP CDM v5.4
APIs Socrata Open Data, VSAC FHIR, RxNav REST, UMLS
Visualization Streamlit, Plotly, Jupyter
Infrastructure Docker, VS Code Dev Containers

License

This project is licensed under the MIT License — see the LICENSE file for details.

The source data is public-use open data published by the Texas Department of Insurance, Division of Workers' Compensation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors