Texas Workers' Compensation Healthcare Analytics Pipeline

End-to-end data pipeline that extracts Texas workers' compensation medical billing data from the Texas Open Data Portal, transforms it through a 3-layer dbt model architecture, and loads it into the OMOP Common Data Model v5.4 using DuckDB.

Overview

The Texas Department of Insurance, Division of Workers' Compensation (DWC) publishes medical billing data for injured employees as open data. This project builds an automated pipeline to:

Extract 12 datasets (professional, institutional, pharmacy claims) via the Socrata API
Enrich with clinical reference data from RxNav, VSAC/FHIR, and OMOP vocabularies
Transform raw billing records into 20+ standardized OMOP CDM tables using dbt
Analyze with interactive dashboards and Jupyter notebooks

What is OMOP CDM?

The OMOP Common Data Model is a universal standard for healthcare data. Just as USB-C lets any device connect to any charger, OMOP CDM lets healthcare data from any source be analyzed using the same tools and queries. Developed by the OHDSI community and used by 100+ institutions worldwide, it maps proprietary billing codes (ICD-9/10, NDC, HCPCS/CPT) into standardized clinical concepts (SNOMED, RxNorm).

Architecture

data.texas.gov (Socrata API)
    │
    │  load_data.py ── dlt pipeline with rate limiting,
    │                   patient-cohort sampling
    ▼
┌─────────────────────────────────────────────┐
│  DuckDB (tx_workers_comp.db)                │
│                                             │
│  ┌─────────┐    dbt (48 SQL models)         │
│  │ raw.*   │───────────────────────────────► │
│  │ 12 tbls │    staging → intermediate →    │
│  └─────────┘    final                       │
│                          ┌──────────┐       │
│  Reference Data ────────►│ omop.*   │       │
│  • OMOP Vocabularies     │ 20+ tbls │       │
│  • RxClass (RxNav API)   └──────────┘       │
│  • VSAC (FHIR API)                          │
└─────────────────────────────────────────────┘
    │
    ▼
Streamlit Dashboard & Jupyter Notebooks

Data Sources

Source	Description	Loader
Texas Open Data Portal	12 workers' comp medical billing datasets (professional, institutional, pharmacy — headers & details, current & historical)	`load_data.py`
RxNav / RxClass API	Drug classification data (EPC, ATC, MOA) for all RxNorm concepts	`load_rxclass.py`
VSAC FHIR API	Clinical value sets (SNOMED, ICD-10, LOINC, CPT groupings)	`load_vsac.py`
OMOP Vocabularies (Athena)	CONCEPT, CONCEPT_RELATIONSHIP, CONCEPT_ANCESTOR, DRUG_STRENGTH reference tables	`load_vocab.py`

Claim Types

Type	Form	Description
Professional (SV1)	CMS-1500	Doctor/provider office visits
Institutional (SV2)	CMS-1450/UB-04	Hospital/facility bills
Pharmacy (SV4)	DWC Form-066	Prescription drug bills

OMOP CDM Tables Produced

The dbt transformation layer maps raw billing data into these standardized clinical tables:

Clinical: person, visit_occurrence, visit_detail, condition_occurrence, procedure_occurrence, drug_exposure, measurement, observation, device_exposure, specimen, death

Derived: drug_era, condition_era, dose_era, observation_period

Administrative: cost, payer_plan_period, care_site, location, provider

Relationships: fact_relationship (links each procedure_occurrence to the condition_occurrence it was performed for, using the CMS-1500 Box 24E diagnosis pointer on professional claim lines)

Workers'-comp specific enrichments

Several OMOP columns are populated with WC-specific semantics that a generic claims ETL would leave empty or guess at:

Injury date observation (observation_concept_id = 40771952 LOINC "Injury date") — one row per patient, carrying employee_date_of_injury from the raw headers. Used to anchor observation_period.observation_period_start_date via LEAST(injury_date, earliest_clinical_event) so every clinical event stays OMOP-compliant.
Employer identifier observation (observation_concept_id = 21492865 LOINC "Employer name [Identifier]") — stores each bill's employer_fein in value_as_string for employer-level cohort segmentation.
WC-correct cost math — paid_by_patient, paid_patient_copay, paid_patient_coinsurance, paid_patient_deductible are all hardcoded to 0 (workers' comp is no-fault; no patient responsibility). amount_allowed = total_charge - service_adjustment_amount surfaces ~$4M of contractual write-offs that were previously invisible.
Injury → first-treatment cohort analytics via the injury-date observation.

Quick Start

Prerequisites

Docker
A Socrata app token (register here)
(Optional) UMLS API key for VSAC/RxClass enrichment
(Required for OMOP) OMOP vocabulary files from Athena — see below

1. Clone and configure credentials

git clone https://github.com/saywurdson/txwc.git
cd txwc

# Copy the secrets template
cp .dlt/secrets.toml.example .dlt/secrets.toml

Edit .dlt/secrets.toml with your API keys:

[sources.txwc]
application_token = "your_socrata_app_token"

[sources.vsac]
api_key = "your_umls_api_key"

Note: .dlt/secrets.toml is gitignored and should never be committed. The database destination is configured in .dlt/config.toml (defaults to tx_workers_comp.db).

2. Build and run the Docker container

# Build the image
docker build -t txwc .

# Run interactively (mounts the project so data and config persist)
docker run -it -p 8501:8501 -v "$(pwd):/workspaces/txwc" -w /workspaces/txwc txwc bash

All commands below run inside the container.

3. Set up the project Python environment

This repo uses a uv-managed virtual environment at .venv/ for the pipeline, dbt, and dashboard dependencies. The heavyweight scientific stack (Jupyter, PySpark, xgboost) stays in the base conda env and is only used by exploratory notebooks.

# One-time: create the venv from pyproject.toml + uv.lock
uv sync

# Optional: install dbt's package dependencies (dbt_utils, etc.)
cd omop && uv run dbt deps && cd ..

Everything Python-related in this project runs via uv run <command> so it picks up the venv automatically.

4. Load data

# Start with a small patient sample for fast iteration (~2 min)
uv run python load_data.py --sample_patients 500 --complex

# Or do a full load (all datasets, incremental for historical — takes longer)
uv run python load_data.py

# Other options:
uv run python load_data.py --dataset professional --time_period current  # specific claim type
uv run python load_data.py --sample_patients 500 --complex               # most complex patients
uv run python load_data.py --report_only                                  # database summary only

Load reference data (optional, needed for some dbt models and dashboard features):

uv run python load_rxclass.py   # Drug classifications from RxNav (public, no key needed)
uv run python load_vsac.py      # Clinical value sets from VSAC (requires UMLS API key)

4. Download OMOP Vocabularies

The dbt models require standard vocabulary files to map billing codes (ICD-10, NDC, HCPCS/CPT) to OMOP concepts. These are not included in the repo due to licensing.

Register at Athena (athena.ohdsi.org)
Click Download and select at minimum these vocabularies:
- SNOMED — standard clinical concepts
- ICD10CM — diagnosis codes
- ICD9CM — legacy diagnosis codes
- RxNorm / RxNorm Extension — drug concepts
- HCPCS / CPT4 — procedure codes (CPT4 requires a separate UMLS license)
- NDC — pharmacy drug codes
- LOINC — lab/observation codes
Download and extract the zip file

Copy the CSV files into vocab/:

cp /path/to/athena_download/*.csv vocab/

Load into DuckDB:
```
uv run python load_vocab.py
```

The required files are: CONCEPT.csv, CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, DRUG_STRENGTH.csv. The remaining files in the Athena download (CONCEPT_SYNONYM, VOCABULARY, RELATIONSHIP, DOMAIN, CONCEPT_CLASS) are not used by the dbt models but can be loaded for reference.

5. Transform with dbt

cd omop
uv run dbt deps
uv run dbt build   # runs + tests in topological order (recommended)
# or separately:
uv run dbt run     # build all OMOP CDM tables
uv run dbt test    # run data quality tests

6. Launch Dashboard

uv run streamlit run dashboard.py
# Opens on http://localhost:8501

The dashboard has 7 tabs:

Tab	What it shows
Overview	Patient counts, visit trends, demographics, seasonality
Injury Profile	Body region analysis, injury-to-treatment delay, treatment episode duration, PT vs imaging pathway
Condition Intelligence	Top diagnoses, ICD-9/10 transition, co-occurrence, patient complexity
Cost & Payments	Charges vs. payments, procedure cost efficiency, insurance carriers
Rx & Opioid Monitor	Opioid prescribing trends, escalation tracker, era durations
Provider Analytics	Caseload concentration, top providers, network distribution
Geography	Cities, states, facility concentration, patient-to-facility ratios

Toggle "Show explanations" in the sidebar for plain-language descriptions of each chart.

Project Structure

txwc/
├── pyproject.toml         # uv-managed project dependencies (dbt, dlt, duckdb, streamlit, ...)
├── uv.lock                # Fully pinned dependency lockfile (committed for reproducibility)
├── .venv/                 # Project virtualenv (gitignored — created by `uv sync`)
│
├── load_data.py           # Socrata API ingestion via dlt pipeline
├── load_rxclass.py        # RxNav drug classification loader
├── load_vsac.py           # VSAC FHIR value set loader
├── load_vocab.py          # OMOP vocabulary CSV loader
├── dashboard.py           # Streamlit analytics dashboard
├── Dockerfile             # Container build
│
├── .dlt/                  # dlt configuration
│   ├── config.toml        # Destination & runtime settings (tracked)
│   ├── secrets.toml       # API keys & credentials (gitignored)
│   └── secrets.toml.example  # Credential template
│
├── omop/                  # dbt project
│   ├── dbt_project.yml
│   ├── profiles.yml
│   ├── models/
│   │   ├── staging/       # Raw data extraction (ephemeral)
│   │   ├── intermediate/  # Type casting & concept mapping (ephemeral)
│   │   └── final/         # OMOP CDM tables (materialized)
│   ├── macros/            # Reusable SQL (concept mapping, ID derivation)
│   └── tests/
│
├── vocab/                 # OMOP vocabulary CSVs from Athena (not tracked)
│
└── notebooks/             # Analysis notebooks
    ├── top10_analysis.ipynb
    ├── pdc.ipynb
    └── add_ccsr_value_sets.ipynb

Python environment model

This project uses two separate Python environments to keep dependency closures small and avoid conflicts:

Env	Manager	Location	What lives here	How to invoke
Project venv	`uv`	`.venv/` (from `pyproject.toml` + `uv.lock`)	`dbt-core`, `dbt-duckdb`, `duckdb`, `dlt`, `streamlit`, `plotly`, `pandas`, `tqdm`, `python-dotenv`	`uv run <command>`
Conda base env	conda	`/opt/conda`	Jupyter, PySpark, xgboost, and other heavyweight scientific libs used only by exploratory notebooks	`/opt/conda/bin/python` (or via notebook kernel)
MCP servers	`uvx` / `uv run`	ephemeral per-call	`dbt-mcp`, `mcp-server-motherduck`, `dlt-workspace-mcp`	Configured in Claude Code MCP settings

Rules of thumb:

All pipeline, dbt, and dashboard Python runs via uv run. The venv owns the data path.
Notebooks can stay on the conda kernel for the scientific stack, or install the venv as a Jupyter kernel (uv run python -m ipykernel install --user --name txwc --display-name "TXWC (venv)") to get the same dlt/duckdb/dbt as the pipeline.
Never pip install into /opt/conda for pipeline packages — that's how dependency pins collide across unrelated tools. Add deps with uv add <pkg> instead.

Key Engineering Features

dlt pipelines — All loaders use dlt for schema management, type handling, and idempotent loads into DuckDB
Patient-cohort sampling — Development mode that selects N patients (random or by complexity score) and fetches only their complete records across all claim types
Full schema enforcement — Post-load step queries Socrata metadata API to ensure all columns exist even when sampling small patient cohorts
Concept mapping — Custom dbt macros map ICD-9/10 → SNOMED, NDC → RxNorm, HCPCS/CPT → standard concepts via OMOP vocabulary lookups
Hash-based patient identity — Deterministic person_id derived from xxhash64 of demographics for deduplication across claim types

Technologies

Category	Tools
Language	Python 3.11
Data Loading	dlt with `secrets.toml` credential management
Database	DuckDB
Transformation	dbt (dbt-duckdb)
Data Model	OMOP CDM v5.4
APIs	Socrata Open Data, VSAC FHIR, RxNav REST, UMLS
Visualization	Streamlit, Plotly, Jupyter
Infrastructure	Docker, VS Code Dev Containers

License

This project is licensed under the MIT License — see the LICENSE file for details.

The source data is public-use open data published by the Texas Department of Insurance, Division of Workers' Compensation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Texas Workers' Compensation Healthcare Analytics Pipeline

Overview

What is OMOP CDM?

Architecture

Data Sources

Claim Types

OMOP CDM Tables Produced

Workers'-comp specific enrichments

Quick Start

Prerequisites

1. Clone and configure credentials

2. Build and run the Docker container

3. Set up the project Python environment

4. Load data

4. Download OMOP Vocabularies

5. Transform with dbt

6. Launch Dashboard

Project Structure

Python environment model

Key Engineering Features

Technologies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.devcontainer		.devcontainer
.dlt		.dlt
notebooks		notebooks
omop		omop
vocab		vocab
.claudeignore		.claudeignore
.gitignore		.gitignore
.mcp.json		.mcp.json
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dashboard.py		dashboard.py
load_data.py		load_data.py
load_rxclass.py		load_rxclass.py
load_vocab.py		load_vocab.py
load_vsac.py		load_vsac.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Texas Workers' Compensation Healthcare Analytics Pipeline

Overview

What is OMOP CDM?

Architecture

Data Sources

Claim Types

OMOP CDM Tables Produced

Workers'-comp specific enrichments

Quick Start

Prerequisites

1. Clone and configure credentials

2. Build and run the Docker container

3. Set up the project Python environment

4. Load data

4. Download OMOP Vocabularies

5. Transform with dbt

6. Launch Dashboard

Project Structure

Python environment model

Key Engineering Features

Technologies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages