Skip to content

VivaanGupta17/trial-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrialPredictor: ML-Driven Clinical Trial Outcome Prediction

Python 3.10+ License: MIT Code style: black Tests Data: ClinicalTrials.gov Models: XGBoost+TabNet


Overview

TrialPredictor is a machine learning system that estimates the probability of clinical trial success before a trial begins — combining compound properties, trial design choices, indication-specific history, and sponsor track records into a unified probabilistic model.

The system addresses a core challenge in pharmaceutical R&D: 90% of drug candidates that enter clinical trials fail, yet most portfolio decisions are still made with limited data and high subjectivity. By grounding go/no-go decisions in quantitative predictions, TrialPredictor enables:

  • Portfolio prioritization: Allocate R&D capital toward trials with highest predicted probability of success
  • Trial design optimization: Identify design parameters (enrollment size, endpoint type, duration) most predictive of success
  • Risk stratification: Quantify uncertainty to differentiate high-confidence from speculative programs
  • eNPV modeling: Integrate predictions into expected net present value calculations for asset valuation

The project covers the full pipeline: data collection from public sources (ClinicalTrials.gov, DrugBank, PubChem), feature engineering, model training, calibrated probability output, survival analysis for trial timelines, and a portfolio simulation engine that translates ML predictions into R&D value.


Clinical Motivation

Drug development is one of the most capital-intensive processes in any industry:

Stage Average Cost Duration Success Rate
Preclinical $1–5M 3–6 years
Phase I $10–30M 1–2 years ~60% → Phase II
Phase II $30–100M 2–3 years ~35% → Phase III
Phase III $100–500M 3–5 years ~55% → NDA/BLA
FDA Review ~$10M 1–2 years ~85% approval

Overall Phase I → Approval: ~12–14%

The Tufts Center for the Study of Drug Development estimates the fully-loaded cost of bringing a new drug to market at $2.6 billion, largely driven by the cost of failure. A predictive model that improves Phase II → Phase III transition rates by even 5 percentage points can generate hundreds of millions in preserved capital annually for a large pharma company.

Failure modes are not random. They cluster around:

  • Safety signals not predicted by preclinical data (~30% of failures)
  • Insufficient efficacy in broader patient populations (~55% of failures)
  • Trial design flaws (underpowering, wrong endpoint, enrollment failure) (~15%)
  • Commercial/strategic withdrawal (competitive landscape, changing priority)

TrialPredictor explicitly models each failure mode and generates interpretable features that clinical teams can act on.


Architecture

clinicaltrials.gov ─┐
DrugBank ──────────┤──▶ Feature Builder ──▶ Gradient Boosting ─┐
PubChem ───────────┤                   ──▶ Neural (TabNet)    ──┤──▶ Calibrated P(success)
FDA Drug Labels ───┘                   ──▶ Survival (CoxPH)  ──┘
                                                                │
                                              Portfolio Simulator ──▶ eNPV / Decision Analysis

Model Performance

Evaluated on held-out trials from 2018–2023 (train on 2000–2017), with temporal validation to prevent look-ahead leakage:

Binary Success/Failure Classification

Model AUROC AUPRC Brier Score Calibration ECE
XGBoost (tuned) 0.791 0.682 0.178 0.041
LightGBM 0.784 0.671 0.183 0.048
CatBoost 0.779 0.665 0.186 0.053
TabNet 0.763 0.651 0.195 0.062
Logistic Regression 0.711 0.598 0.214 0.087

Phase-Specific Performance (XGBoost)

Phase Transition AUROC N (test)
Phase I → II 0.734 1,842
Phase II → III 0.812 2,105
Phase III → Approval 0.778 892

Survival Analysis (Trial Timeline)

Model C-Index Mean Abs. Error (months)
DeepSurv 0.714 8.3
Cox PH 0.698 9.7
Kaplan-Meier (baseline) 0.500 14.2

Portfolio Simulation (eNPV)

Using model predictions to guide a simulated 20-asset portfolio vs. random selection over 1000 bootstrap runs:

Strategy Mean eNPV ($M) 95% CI Improvement vs. Random
Model-guided (top quartile) $847M [$612M, $1,091M] +38%
Random selection $614M [$401M, $826M] baseline
Industry benchmark (historical) $721M [$533M, $912M] +17%

Top Predictive Features (SHAP Analysis)

Rank Feature SHAP Importance Direction
1 Sponsor historical success rate (indication) 0.142 Positive
2 Lipinski violations 0.118 Negative
3 Phase II prior data available 0.097 Positive
4 Orphan drug designation 0.089 Positive
5 Enrollment size (log) 0.076 Positive (up to ~500)
6 Number of primary endpoints 0.071 Negative (>2 hurts)
7 Mechanism-of-action validation score 0.068 Positive
8 Indication competitive density 0.064 Negative
9 Trial duration (months) 0.058 Inverted-U
10 Molecular weight 0.052 Negative (>600 Da)

Data Sources

Source Access Content Update Frequency
ClinicalTrials.gov Free, public API Trial metadata, results, interventions Daily
AACT Database Free registration Full relational DB of ClinicalTrials.gov Monthly snapshots
DrugBank Academic license Drug properties, targets, mechanisms Quarterly
PubChem Free, public API Molecular structures, physicochemical properties Real-time
FDA Drug Approvals Free, public Approved drugs, indication, PDUFA dates Continuous

See docs/DATA_SOURCES.md for schema details and data quality notes.


Installation

# Clone repository
git clone https://github.com/yourusername/trial-predictor.git
cd trial-predictor

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

Optional: AACT Database (PostgreSQL)

For full-scale training, download the AACT database:

# Download latest AACT snapshot (requires free registration)
# https://aact.ctti-clinicaltrials.org/snapshots
psql -d aact -f /path/to/aact_snapshot.dmp

Quick Start

# 1. Fetch trial data (uses ClinicalTrials.gov API by default)
python scripts/fetch_data.py --source api --phases 2 3 --output data/raw/

# 2. Build features
python scripts/fetch_data.py --build-features --input data/raw/ --output data/processed/

# 3. Train models
python scripts/train.py --config configs/trial_config.yaml --model xgboost

# 4. Evaluate
python scripts/evaluate.py --model-path models/xgboost_best.pkl --test-data data/processed/test.parquet

# 5. Run portfolio simulation
python scripts/analyze.py --mode portfolio --model-path models/xgboost_best.pkl

Project Structure

trial-predictor/
├── src/
│   ├── data/
│   │   ├── clinicaltrials_fetcher.py   # ClinicalTrials.gov API + AACT
│   │   ├── drugbank_loader.py          # DrugBank drug property extraction
│   │   └── feature_builder.py         # Feature engineering pipeline
│   ├── models/
│   │   ├── gradient_boosting.py        # XGBoost / LightGBM / CatBoost
│   │   ├── neural_trial.py             # TabNet with entity embeddings
│   │   └── survival_model.py          # DeepSurv / Cox PH
│   ├── evaluation/
│   │   ├── trial_metrics.py           # AUROC, calibration, decision analysis
│   │   └── portfolio_simulator.py     # eNPV / portfolio optimization
│   └── analysis/
│       ├── failure_analyzer.py        # Failure mode clustering
│       └── indication_profiler.py     # Therapeutic area analysis
├── configs/
│   └── trial_config.yaml             # Experiment configuration
├── scripts/
│   ├── fetch_data.py                 # Data collection entry point
│   ├── train.py                      # Model training entry point
│   ├── evaluate.py                   # Evaluation entry point
│   └── analyze.py                   # Analysis entry point
├── docs/
│   ├── DATA_SOURCES.md               # Data source documentation
│   └── PHARMA_CONTEXT.md             # Drug development pipeline context
├── tests/                            # Unit and integration tests
├── notebooks/                        # Exploratory analysis
├── requirements.txt
├── setup.py
└── README.md

Pharmaceutical Context

This project is built with a pharmaceutical R&D mindset, not just an ML mindset. Key design decisions:

Temporal validation: All models are validated on future trials to prevent look-ahead leakage — mimicking real deployment where you predict trials before they complete.

Calibration priority: In portfolio decisions, calibrated probabilities matter more than raw discrimination. A model that says "70% success" should be right 70% of the time. We enforce calibration via isotonic regression and Platt scaling.

Phase-specific models: Phase II and Phase III have fundamentally different failure modes. We train separate models per phase transition rather than forcing one model to generalize across phases.

Interpretability: SHAP values are computed for every prediction. Clinical decision-makers need to understand why a trial is predicted to succeed or fail — not just the score.

Regulatory awareness: The system is explicitly framed as a decision-support tool, not a clinical decision tool. All documentation reflects FDA guidance on the appropriate use of AI/ML in drug development.

See docs/PHARMA_CONTEXT.md for a full drug development pipeline overview.


Limitations

  • Label noise: "Completed" trials may still fail to achieve approval; outcome labels are proxy measures
  • Publication bias: Successful trials are more likely to publish results, biasing the training signal
  • Indication shifts: Novel indications (e.g., first-in-class mechanisms) have limited historical comparators
  • External validity: Models trained on publicly registered trials may not generalize to internal proprietary trials with different documentation standards
  • Regulatory changes: FDA guidance evolves; models may need retraining after major policy shifts

Citation

@software{trialpredictor2024,
  title  = {TrialPredictor: ML-Driven Clinical Trial Outcome Prediction},
  author = {Your Name},
  year   = {2024},
  url    = {https://github.com/yourusername/trial-predictor}
}

License

MIT License — see LICENSE for details.

This project uses publicly available data from ClinicalTrials.gov (public domain) and DrugBank (academic license required for commercial use). DrugBank data must not be redistributed without a license from Wishart Lab.

About

ML-Driven Clinical Trial Outcome Prediction — predicting Phase II/III success from compound features, trial design, and sponsor history with portfolio NPV simulation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors