Clinical Document Intelligence Platform

AI-powered FDA drug label intelligence system for regulatory analysis, formulary decision support, and competitive intelligence.

Built by Erick K. Yegon, PhD — Director-level Data Scientist & Epidemiologist with 17+ years in global health analytics.

The Problem

Pharmaceutical HEOR teams, formulary committees, and regulatory analysts spend 40+ hours per drug manually reviewing FDA labels, comparing safety profiles across drug classes, and tracking label changes. This work is repetitive, error-prone, and doesn't scale.

The Solution

A production-grade RAG (Retrieval-Augmented Generation) system that:

Ingests 70,000+ FDA drug labels from the openFDA API and clinical trial data from ClinicalTrials.gov
Indexes content with section-aware chunking that never splits safety-critical sections (Black Box Warnings, Contraindications)
Retrieves relevant information using hybrid search: dense vector retrieval → metadata filtering → section priority boosting → MMR diversification → cross-encoder re-ranking
Generates citation-grounded answers with source traceability back to specific label sections
Compares drug labels across products for competitive intelligence and formulary decisions
Guards against PHI exposure, prompt injection, unsupported clinical claims, and hallucination

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        FastAPI REST API                              │
│              /query    /compare    /drugs    /health                 │
├────────────┬────────────────────────────────────┬───────────────────┤
│  INPUT     │        RAG PIPELINE                │    OUTPUT         │
│ GUARDRAILS │                                    │   GUARDRAILS      │
│            │  ┌──────────────────────────────┐  │                   │
│ • PHI      │  │  Query Rewrite (Module 6)    │  │ • Confidence      │
│   Detection│  │  ↓                           │  │   threshold       │
│ • Injection│  │  Hybrid Retrieval (Module 7) │  │ • Citation        │
│   Defense  │  │  • Dense vector search       │  │   validation      │
│ • Schema   │  │  • Metadata filtering        │  │ • Unsupported     │
│   Validate │  │  • Section priority boost    │  │   claim detection │
│            │  │  • MMR diversification       │  │ • Clinical        │
│            │  │  • Cross-encoder re-rank     │  │   disclaimer      │
│            │  │  ↓                           │  │                   │
│            │  │  Context Assembly (Module 8)  │  │                   │
│            │  │  ↓                           │  │                   │
│            │  │  LLM Generation (Module 3)   │  │                   │
│            │  │  • Provider switching         │  │                   │
│            │  │  • Cost tracking             │  │                   │
│            │  │  • Automatic fallback        │  │                   │
│            │  │  ↓                           │  │                   │
│            │  │  Citation Extraction         │  │                   │
│            │  └──────────────────────────────┘  │                   │
├────────────┴────────────────────────────────────┴───────────────────┤
│                     DATA LAYER                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────────┐ │
│  │ openFDA API  │  │ ClinicalTrial│  │ ChromaDB Vector Store      │ │
│  │ 70K+ labels  │  │ .gov API V2  │  │ Section-aware chunks       │ │
│  │              │  │ 400K+ trials │  │ Metadata: drug, section,   │ │
│  │              │  │              │  │ therapeutic area, safety    │ │
│  └──────────────┘  └──────────────┘  └────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│                    EVALUATION (Module 10)                             │
│  Faithfulness | Relevance | Citation Accuracy | Safety Completeness  │
│  Latency | Token Cost | Error Rate                                   │
└─────────────────────────────────────────────────────────────────────┘

Key Features

1. Clinical Q&A with Citations

Ask questions about any FDA-approved drug and receive answers grounded in official label data with section-level citations.

Query: "What are the contraindications for empagliflozin?"

Answer: Empagliflozin (JARDIANCE) is contraindicated in patients with severe 
renal impairment (eGFR < 30 mL/min/1.73 m²), end-stage renal disease, or 
patients on dialysis. It is also contraindicated in patients with known 
hypersensitivity to empagliflozin or any excipients.
[Source: JARDIANCE | Contraindications | label_abc123]

2. Cross-Drug Comparison

Compare safety profiles, dosing, and contraindications across drugs in a class.

Query: "Compare black box warnings for SGLT2 inhibitors"
Drugs: ["JARDIANCE", "FARXIGA", "INVOKANA"]

3. Safety-First Retrieval

Black Box Warnings and Contraindications are never split during chunking and receive priority boosting during retrieval, ensuring safety-critical information always surfaces first.

4. Multi-Provider LLM Routing

Automatic provider switching across OpenAI, Groq, and AWS Bedrock with per-query cost tracking and fallback logic.

5. Production Guardrails

Input: PHI detection (SSN, MRN, DOB patterns), prompt injection defense, query sanitization
Output: Confidence thresholds, citation validation, unsupported clinical claim detection, mandatory disclaimers

Tech Stack

Layer	Technology	Purpose
API	FastAPI	Production REST API with OpenAPI docs
Frontend	Streamlit	Interactive clinical intelligence dashboard
LLM	OpenAI GPT-4o / Groq Llama 3.1 / AWS Bedrock	Multi-provider with fallback
Vector DB	ChromaDB	Section-aware document storage
Embeddings	all-MiniLM-L6-v2 (local) / text-embedding-3-small	Cost-tiered embedding
Re-ranking	cross-encoder/ms-marco-MiniLM (optional)	Precision refinement
Prompts	Jinja2 + YAML	Production-grade prompt management
Guardrails	Pydantic + custom validators	Input/output safety
Evaluation	Custom RAGAS-inspired metrics	Domain-specific clinical eval
Data	openFDA API, ClinicalTrials.gov API V2	Real U.S. regulatory data
Cache	Redis (optional)	Response and embedding caching
Deployment	Docker + AWS ECS/Fargate	Cloud-native containerization

Quick Start

1. Clone and Setup

git clone https://github.com/erickyegon/clinical-doc-intelligence.git
cd clinical-doc-intelligence

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your API keys (Groq free tier works for development)

2. Download FDA Drug Labels

# Download labels for key therapeutic classes (diabetes, cardiovascular, oncology)
python scripts/seed_data.py

# Or download a specific drug
python scripts/seed_data.py --drug "Ozempic"

3. Index into Vector Store

python scripts/ingest.py

# Verify indexing
python scripts/ingest.py --stats

4. Start the API

uvicorn src.api.main:app --reload

# API docs at http://localhost:8000/docs

5. Launch the Interactive Frontend

# In a second terminal (keep the API running)
streamlit run app.py

# Opens at http://localhost:8501

5. Query

# Clinical Q&A
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the contraindications for Jardiance?"}'

# Drug Comparison
curl -X POST http://localhost:8000/compare \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Compare adverse reactions",
    "drug_names": ["JARDIANCE", "FARXIGA"]
  }'

Docker

docker-compose up        # Development with hot reload
docker-compose --profile production up  # Production build

Testing

pytest tests/ -v

# With coverage
pytest tests/ --cov=src --cov-report=term-missing

Evaluation

The platform includes a built-in evaluation framework with domain-specific metrics:

Metric	Description	Target
Retrieval Precision	Correct drugs in retrieved docs	> 0.85
Faithfulness	Answer grounded in retrieved context	> 0.90
Citation Accuracy	All citations valid and traceable	> 0.95
Safety Completeness	Safety info included when relevant	> 0.95
Latency (p95)	End-to-end response time	< 5s
Cost per Query	Average LLM cost	< $0.03

Project Structure

clinical-doc-intel/
├── app.py                       # Streamlit interactive frontend
├── .streamlit/config.toml       # Streamlit theme & config
├── .github/workflows/ci.yml     # GitHub Actions CI/CD pipeline
├── Makefile                     # One-command operations
├── config/
│   ├── settings.py              # Central configuration
│   └── prompts.yaml             # Jinja2 prompt templates
├── src/
│   ├── ingestion/
│   │   ├── fda_labels.py        # openFDA API client
│   │   ├── clinical_trials.py   # ClinicalTrials.gov V2 client
│   │   └── pipeline.py          # Ingestion orchestration
│   ├── processing/
│   │   ├── chunker.py           # Section-aware chunking
│   │   └── metadata.py          # Metadata enrichment
│   ├── retrieval/
│   │   ├── vector_store.py      # ChromaDB management
│   │   └── hybrid_search.py     # Multi-stage retrieval + re-ranking
│   ├── generation/
│   │   └── rag_chain.py         # RAG pipeline with citations
│   ├── orchestration/
│   │   └── model_router.py      # Multi-provider LLM routing
│   ├── guardrails/
│   │   └── validators.py        # Input/output safety validation
│   ├── evaluation/
│   │   └── evaluator.py         # RAG evaluation framework
│   └── api/
│       └── main.py              # FastAPI application
├── scripts/
│   ├── seed_data.py             # Download FDA label data
│   └── ingest.py                # Index data into vector store
├── tests/
│   └── test_platform.py         # Comprehensive test suite
├── docs/adr/
│   ├── 001-vector-store-selection.md
│   └── 002-llm-provider-strategy.md
├── deployment/
│   └── (AWS ECS task definitions, CI/CD configs)
├── Dockerfile                   # Multi-stage production build
├── docker-compose.yml           # Local dev + production profiles
├── requirements.txt
├── .env.example
└── README.md

Course Module Coverage

This project implements concepts from across the Full Stack Generative AI BootCamp:

Module	Topic	Implementation
1	Foundations of GenAI	Embeddings, vector space, similarity
3	API for Accessing LLMs	Multi-provider router (OpenAI/Groq/Bedrock)
5	LLM Hosting & API	FastAPI endpoint exposure
6	Prompt Engineering	YAML prompt library, Jinja2 templates, CoT
7	RAG Systems	Full pipeline: ingest → chunk → embed → retrieve → generate
8	Advanced RAG	Query rewriting, re-ranking, caching, multimodal
10	Evaluation	Faithfulness, relevance, safety metrics, cost tracking
11	Guardrails	PHI detection, injection defense, output validation
12	MCP	Standardized tool interfaces (extensible)
16	E2E Deployment	Docker, FastAPI, cloud-native architecture

Related Projects

Medicare RAF Pipeline — Risk adjustment factor prediction for Medicare Advantage (ATT −$391/member)
RWE Evidence Synthesis Agent — Multi-agent system for systematic literature review automation (coming soon)
HEDIS/STARS Quality Agent — Automated care gap identification for managed care (coming soon)

License

MIT

Contact

Erick K. Yegon, PhD
Director-Level Data Science & Epidemiology
📧 keyegonaws@gmail.com
🔗 LinkedIn | GitHub | ORCID

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Document Intelligence Platform

The Problem

The Solution

Architecture

Key Features

1. Clinical Q&A with Citations

2. Cross-Drug Comparison

3. Safety-First Retrieval

4. Multi-Provider LLM Routing

5. Production Guardrails

Tech Stack

Quick Start

1. Clone and Setup

2. Download FDA Drug Labels

3. Index into Vector Store

4. Start the API

5. Launch the Interactive Frontend

5. Query

Docker

Testing

Evaluation

Project Structure

Course Module Coverage

Related Projects

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
config		config
data		data
deployment		deployment
docs/adr		docs/adr
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Clinical Document Intelligence Platform

The Problem

The Solution

Architecture

Key Features

1. Clinical Q&A with Citations

2. Cross-Drug Comparison

3. Safety-First Retrieval

4. Multi-Provider LLM Routing

5. Production Guardrails

Tech Stack

Quick Start

1. Clone and Setup

2. Download FDA Drug Labels

3. Index into Vector Store

4. Start the API

5. Launch the Interactive Frontend

5. Query

Docker

Testing

Evaluation

Project Structure

Course Module Coverage

Related Projects

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages