End-to-end data pipeline that extracts Texas workers' compensation medical billing data from the Texas Open Data Portal, transforms it through a 3-layer dbt model architecture, and loads it into the OMOP Common Data Model v5.4 using DuckDB.
The Texas Department of Insurance, Division of Workers' Compensation (DWC) publishes medical billing data for injured employees as open data. This project builds an automated pipeline to:
- Extract 12 datasets (professional, institutional, pharmacy claims) via the Socrata API
- Enrich with clinical reference data from RxNav, VSAC/FHIR, and OMOP vocabularies
- Transform raw billing records into 20+ standardized OMOP CDM tables using dbt
- Analyze with interactive dashboards and Jupyter notebooks
The OMOP Common Data Model is a universal standard for healthcare data. Just as USB-C lets any device connect to any charger, OMOP CDM lets healthcare data from any source be analyzed using the same tools and queries. Developed by the OHDSI community and used by 100+ institutions worldwide, it maps proprietary billing codes (ICD-9/10, NDC, HCPCS/CPT) into standardized clinical concepts (SNOMED, RxNorm).
data.texas.gov (Socrata API)
│
│ load_data.py ── dlt pipeline with rate limiting,
│ patient-cohort sampling
▼
┌─────────────────────────────────────────────┐
│ DuckDB (tx_workers_comp.db) │
│ │
│ ┌─────────┐ dbt (48 SQL models) │
│ │ raw.* │───────────────────────────────► │
│ │ 12 tbls │ staging → intermediate → │
│ └─────────┘ final │
│ ┌──────────┐ │
│ Reference Data ────────►│ omop.* │ │
│ • OMOP Vocabularies │ 20+ tbls │ │
│ • RxClass (RxNav API) └──────────┘ │
│ • VSAC (FHIR API) │
└─────────────────────────────────────────────┘
│
▼
Streamlit Dashboard & Jupyter Notebooks
| Source | Description | Loader |
|---|---|---|
| Texas Open Data Portal | 12 workers' comp medical billing datasets (professional, institutional, pharmacy — headers & details, current & historical) | load_data.py |
| RxNav / RxClass API | Drug classification data (EPC, ATC, MOA) for all RxNorm concepts | load_rxclass.py |
| VSAC FHIR API | Clinical value sets (SNOMED, ICD-10, LOINC, CPT groupings) | load_vsac.py |
| OMOP Vocabularies (Athena) | CONCEPT, CONCEPT_RELATIONSHIP, CONCEPT_ANCESTOR, DRUG_STRENGTH reference tables | load_vocab.py |
| Type | Form | Description |
|---|---|---|
| Professional (SV1) | CMS-1500 | Doctor/provider office visits |
| Institutional (SV2) | CMS-1450/UB-04 | Hospital/facility bills |
| Pharmacy (SV4) | DWC Form-066 | Prescription drug bills |
The dbt transformation layer maps raw billing data into these standardized clinical tables:
Clinical: person, visit_occurrence, visit_detail, condition_occurrence, procedure_occurrence, drug_exposure, measurement, observation, device_exposure, specimen, death
Derived: drug_era, condition_era, dose_era, observation_period
Administrative: cost, payer_plan_period, care_site, location, provider
Relationships: fact_relationship (links each procedure_occurrence to the condition_occurrence it was performed for, using the CMS-1500 Box 24E diagnosis pointer on professional claim lines)
Several OMOP columns are populated with WC-specific semantics that a generic claims ETL would leave empty or guess at:
- Injury date observation (
observation_concept_id = 40771952LOINC "Injury date") — one row per patient, carryingemployee_date_of_injuryfrom the raw headers. Used to anchorobservation_period.observation_period_start_dateviaLEAST(injury_date, earliest_clinical_event)so every clinical event stays OMOP-compliant. - Employer identifier observation (
observation_concept_id = 21492865LOINC "Employer name [Identifier]") — stores each bill'semployer_feininvalue_as_stringfor employer-level cohort segmentation. - WC-correct cost math —
paid_by_patient,paid_patient_copay,paid_patient_coinsurance,paid_patient_deductibleare all hardcoded to0(workers' comp is no-fault; no patient responsibility).amount_allowed = total_charge - service_adjustment_amountsurfaces ~$4M of contractual write-offs that were previously invisible. - Injury → first-treatment cohort analytics via the injury-date observation.
- Docker
- A Socrata app token (register here)
- (Optional) UMLS API key for VSAC/RxClass enrichment
- (Required for OMOP) OMOP vocabulary files from Athena — see below
git clone https://github.com/saywurdson/txwc.git
cd txwc
# Copy the secrets template
cp .dlt/secrets.toml.example .dlt/secrets.tomlEdit .dlt/secrets.toml with your API keys:
[sources.txwc]
application_token = "your_socrata_app_token"
[sources.vsac]
api_key = "your_umls_api_key"Note:
.dlt/secrets.tomlis gitignored and should never be committed. The database destination is configured in.dlt/config.toml(defaults totx_workers_comp.db).
# Build the image
docker build -t txwc .
# Run interactively (mounts the project so data and config persist)
docker run -it -p 8501:8501 -v "$(pwd):/workspaces/txwc" -w /workspaces/txwc txwc bashAll commands below run inside the container.
This repo uses a uv-managed virtual environment at .venv/ for the pipeline, dbt, and dashboard dependencies. The heavyweight scientific stack (Jupyter, PySpark, xgboost) stays in the base conda env and is only used by exploratory notebooks.
# One-time: create the venv from pyproject.toml + uv.lock
uv sync
# Optional: install dbt's package dependencies (dbt_utils, etc.)
cd omop && uv run dbt deps && cd ..Everything Python-related in this project runs via uv run <command> so it picks up the venv automatically.
# Start with a small patient sample for fast iteration (~2 min)
uv run python load_data.py --sample_patients 500 --complex
# Or do a full load (all datasets, incremental for historical — takes longer)
uv run python load_data.py
# Other options:
uv run python load_data.py --dataset professional --time_period current # specific claim type
uv run python load_data.py --sample_patients 500 --complex # most complex patients
uv run python load_data.py --report_only # database summary onlyLoad reference data (optional, needed for some dbt models and dashboard features):
uv run python load_rxclass.py # Drug classifications from RxNav (public, no key needed)
uv run python load_vsac.py # Clinical value sets from VSAC (requires UMLS API key)The dbt models require standard vocabulary files to map billing codes (ICD-10, NDC, HCPCS/CPT) to OMOP concepts. These are not included in the repo due to licensing.
- Register at Athena (athena.ohdsi.org)
- Click Download and select at minimum these vocabularies:
- SNOMED — standard clinical concepts
- ICD10CM — diagnosis codes
- ICD9CM — legacy diagnosis codes
- RxNorm / RxNorm Extension — drug concepts
- HCPCS / CPT4 — procedure codes (CPT4 requires a separate UMLS license)
- NDC — pharmacy drug codes
- LOINC — lab/observation codes
- Download and extract the zip file
- Copy the CSV files into
vocab/:cp /path/to/athena_download/*.csv vocab/ - Load into DuckDB:
uv run python load_vocab.py
The required files are: CONCEPT.csv, CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, DRUG_STRENGTH.csv. The remaining files in the Athena download (CONCEPT_SYNONYM, VOCABULARY, RELATIONSHIP, DOMAIN, CONCEPT_CLASS) are not used by the dbt models but can be loaded for reference.
cd omop
uv run dbt deps
uv run dbt build # runs + tests in topological order (recommended)
# or separately:
uv run dbt run # build all OMOP CDM tables
uv run dbt test # run data quality testsuv run streamlit run dashboard.py
# Opens on http://localhost:8501The dashboard has 7 tabs:
| Tab | What it shows |
|---|---|
| Overview | Patient counts, visit trends, demographics, seasonality |
| Injury Profile | Body region analysis, injury-to-treatment delay, treatment episode duration, PT vs imaging pathway |
| Condition Intelligence | Top diagnoses, ICD-9/10 transition, co-occurrence, patient complexity |
| Cost & Payments | Charges vs. payments, procedure cost efficiency, insurance carriers |
| Rx & Opioid Monitor | Opioid prescribing trends, escalation tracker, era durations |
| Provider Analytics | Caseload concentration, top providers, network distribution |
| Geography | Cities, states, facility concentration, patient-to-facility ratios |
Toggle "Show explanations" in the sidebar for plain-language descriptions of each chart.
txwc/
├── pyproject.toml # uv-managed project dependencies (dbt, dlt, duckdb, streamlit, ...)
├── uv.lock # Fully pinned dependency lockfile (committed for reproducibility)
├── .venv/ # Project virtualenv (gitignored — created by `uv sync`)
│
├── load_data.py # Socrata API ingestion via dlt pipeline
├── load_rxclass.py # RxNav drug classification loader
├── load_vsac.py # VSAC FHIR value set loader
├── load_vocab.py # OMOP vocabulary CSV loader
├── dashboard.py # Streamlit analytics dashboard
├── Dockerfile # Container build
│
├── .dlt/ # dlt configuration
│ ├── config.toml # Destination & runtime settings (tracked)
│ ├── secrets.toml # API keys & credentials (gitignored)
│ └── secrets.toml.example # Credential template
│
├── omop/ # dbt project
│ ├── dbt_project.yml
│ ├── profiles.yml
│ ├── models/
│ │ ├── staging/ # Raw data extraction (ephemeral)
│ │ ├── intermediate/ # Type casting & concept mapping (ephemeral)
│ │ └── final/ # OMOP CDM tables (materialized)
│ ├── macros/ # Reusable SQL (concept mapping, ID derivation)
│ └── tests/
│
├── vocab/ # OMOP vocabulary CSVs from Athena (not tracked)
│
└── notebooks/ # Analysis notebooks
├── top10_analysis.ipynb
├── pdc.ipynb
└── add_ccsr_value_sets.ipynb
This project uses two separate Python environments to keep dependency closures small and avoid conflicts:
| Env | Manager | Location | What lives here | How to invoke |
|---|---|---|---|---|
| Project venv | uv |
.venv/ (from pyproject.toml + uv.lock) |
dbt-core, dbt-duckdb, duckdb, dlt, streamlit, plotly, pandas, tqdm, python-dotenv |
uv run <command> |
| Conda base env | conda | /opt/conda |
Jupyter, PySpark, xgboost, and other heavyweight scientific libs used only by exploratory notebooks | /opt/conda/bin/python (or via notebook kernel) |
| MCP servers | uvx / uv run |
ephemeral per-call | dbt-mcp, mcp-server-motherduck, dlt-workspace-mcp |
Configured in Claude Code MCP settings |
Rules of thumb:
- All pipeline, dbt, and dashboard Python runs via
uv run. The venv owns the data path. - Notebooks can stay on the conda kernel for the scientific stack, or install the venv as a Jupyter kernel (
uv run python -m ipykernel install --user --name txwc --display-name "TXWC (venv)") to get the samedlt/duckdb/dbtas the pipeline. - Never
pip installinto/opt/condafor pipeline packages — that's how dependency pins collide across unrelated tools. Add deps withuv add <pkg>instead.
- dlt pipelines — All loaders use dlt for schema management, type handling, and idempotent loads into DuckDB
- Patient-cohort sampling — Development mode that selects N patients (random or by complexity score) and fetches only their complete records across all claim types
- Full schema enforcement — Post-load step queries Socrata metadata API to ensure all columns exist even when sampling small patient cohorts
- Concept mapping — Custom dbt macros map ICD-9/10 → SNOMED, NDC → RxNorm, HCPCS/CPT → standard concepts via OMOP vocabulary lookups
- Hash-based patient identity — Deterministic
person_idderived fromxxhash64of demographics for deduplication across claim types
| Category | Tools |
|---|---|
| Language | Python 3.11 |
| Data Loading | dlt with secrets.toml credential management |
| Database | DuckDB |
| Transformation | dbt (dbt-duckdb) |
| Data Model | OMOP CDM v5.4 |
| APIs | Socrata Open Data, VSAC FHIR, RxNav REST, UMLS |
| Visualization | Streamlit, Plotly, Jupyter |
| Infrastructure | Docker, VS Code Dev Containers |
This project is licensed under the MIT License — see the LICENSE file for details.
The source data is public-use open data published by the Texas Department of Insurance, Division of Workers' Compensation.