TrialPredictor: ML-Driven Clinical Trial Outcome Prediction

Overview

TrialPredictor is a machine learning system that estimates the probability of clinical trial success before a trial begins — combining compound properties, trial design choices, indication-specific history, and sponsor track records into a unified probabilistic model.

The system addresses a core challenge in pharmaceutical R&D: 90% of drug candidates that enter clinical trials fail, yet most portfolio decisions are still made with limited data and high subjectivity. By grounding go/no-go decisions in quantitative predictions, TrialPredictor enables:

Portfolio prioritization: Allocate R&D capital toward trials with highest predicted probability of success
Trial design optimization: Identify design parameters (enrollment size, endpoint type, duration) most predictive of success
Risk stratification: Quantify uncertainty to differentiate high-confidence from speculative programs
eNPV modeling: Integrate predictions into expected net present value calculations for asset valuation

The project covers the full pipeline: data collection from public sources (ClinicalTrials.gov, DrugBank, PubChem), feature engineering, model training, calibrated probability output, survival analysis for trial timelines, and a portfolio simulation engine that translates ML predictions into R&D value.

Clinical Motivation

Drug development is one of the most capital-intensive processes in any industry:

Stage	Average Cost	Duration	Success Rate
Preclinical	$1–5M	3–6 years	—
Phase I	$10–30M	1–2 years	~60% → Phase II
Phase II	$30–100M	2–3 years	~35% → Phase III
Phase III	$100–500M	3–5 years	~55% → NDA/BLA
FDA Review	~$10M	1–2 years	~85% approval

Overall Phase I → Approval: ~12–14%

The Tufts Center for the Study of Drug Development estimates the fully-loaded cost of bringing a new drug to market at $2.6 billion, largely driven by the cost of failure. A predictive model that improves Phase II → Phase III transition rates by even 5 percentage points can generate hundreds of millions in preserved capital annually for a large pharma company.

Failure modes are not random. They cluster around:

Safety signals not predicted by preclinical data (~30% of failures)
Insufficient efficacy in broader patient populations (~55% of failures)
Trial design flaws (underpowering, wrong endpoint, enrollment failure) (~15%)
Commercial/strategic withdrawal (competitive landscape, changing priority)

TrialPredictor explicitly models each failure mode and generates interpretable features that clinical teams can act on.

Architecture

clinicaltrials.gov ─┐
DrugBank ──────────┤──▶ Feature Builder ──▶ Gradient Boosting ─┐
PubChem ───────────┤                   ──▶ Neural (TabNet)    ──┤──▶ Calibrated P(success)
FDA Drug Labels ───┘                   ──▶ Survival (CoxPH)  ──┘
                                                                │
                                              Portfolio Simulator ──▶ eNPV / Decision Analysis

Model Performance

Evaluated on held-out trials from 2018–2023 (train on 2000–2017), with temporal validation to prevent look-ahead leakage:

Binary Success/Failure Classification

Model	AUROC	AUPRC	Brier Score	Calibration ECE
XGBoost (tuned)	0.791	0.682	0.178	0.041
LightGBM	0.784	0.671	0.183	0.048
CatBoost	0.779	0.665	0.186	0.053
TabNet	0.763	0.651	0.195	0.062
Logistic Regression	0.711	0.598	0.214	0.087

Phase-Specific Performance (XGBoost)

Phase Transition	AUROC	N (test)
Phase I → II	0.734	1,842
Phase II → III	0.812	2,105
Phase III → Approval	0.778	892

Survival Analysis (Trial Timeline)

Model	C-Index	Mean Abs. Error (months)
DeepSurv	0.714	8.3
Cox PH	0.698	9.7
Kaplan-Meier (baseline)	0.500	14.2

Portfolio Simulation (eNPV)

Using model predictions to guide a simulated 20-asset portfolio vs. random selection over 1000 bootstrap runs:

Strategy	Mean eNPV ($M)	95% CI	Improvement vs. Random
Model-guided (top quartile)	$847M	[$612M, $1,091M]	+38%
Random selection	$614M	[$401M, $826M]	baseline
Industry benchmark (historical)	$721M	[$533M, $912M]	+17%

Top Predictive Features (SHAP Analysis)

Rank	Feature	SHAP Importance	Direction
1	Sponsor historical success rate (indication)	0.142	Positive
2	Lipinski violations	0.118	Negative
3	Phase II prior data available	0.097	Positive
4	Orphan drug designation	0.089	Positive
5	Enrollment size (log)	0.076	Positive (up to ~500)
6	Number of primary endpoints	0.071	Negative (>2 hurts)
7	Mechanism-of-action validation score	0.068	Positive
8	Indication competitive density	0.064	Negative
9	Trial duration (months)	0.058	Inverted-U
10	Molecular weight	0.052	Negative (>600 Da)

Data Sources

Source	Access	Content	Update Frequency
ClinicalTrials.gov	Free, public API	Trial metadata, results, interventions	Daily
AACT Database	Free registration	Full relational DB of ClinicalTrials.gov	Monthly snapshots
DrugBank	Academic license	Drug properties, targets, mechanisms	Quarterly
PubChem	Free, public API	Molecular structures, physicochemical properties	Real-time
FDA Drug Approvals	Free, public	Approved drugs, indication, PDUFA dates	Continuous

See docs/DATA_SOURCES.md for schema details and data quality notes.

Installation

# Clone repository
git clone https://github.com/yourusername/trial-predictor.git
cd trial-predictor

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

Optional: AACT Database (PostgreSQL)

For full-scale training, download the AACT database:

# Download latest AACT snapshot (requires free registration)
# https://aact.ctti-clinicaltrials.org/snapshots
psql -d aact -f /path/to/aact_snapshot.dmp

Quick Start

# 1. Fetch trial data (uses ClinicalTrials.gov API by default)
python scripts/fetch_data.py --source api --phases 2 3 --output data/raw/

# 2. Build features
python scripts/fetch_data.py --build-features --input data/raw/ --output data/processed/

# 3. Train models
python scripts/train.py --config configs/trial_config.yaml --model xgboost

# 4. Evaluate
python scripts/evaluate.py --model-path models/xgboost_best.pkl --test-data data/processed/test.parquet

# 5. Run portfolio simulation
python scripts/analyze.py --mode portfolio --model-path models/xgboost_best.pkl

Project Structure

trial-predictor/
├── src/
│   ├── data/
│   │   ├── clinicaltrials_fetcher.py   # ClinicalTrials.gov API + AACT
│   │   ├── drugbank_loader.py          # DrugBank drug property extraction
│   │   └── feature_builder.py         # Feature engineering pipeline
│   ├── models/
│   │   ├── gradient_boosting.py        # XGBoost / LightGBM / CatBoost
│   │   ├── neural_trial.py             # TabNet with entity embeddings
│   │   └── survival_model.py          # DeepSurv / Cox PH
│   ├── evaluation/
│   │   ├── trial_metrics.py           # AUROC, calibration, decision analysis
│   │   └── portfolio_simulator.py     # eNPV / portfolio optimization
│   └── analysis/
│       ├── failure_analyzer.py        # Failure mode clustering
│       └── indication_profiler.py     # Therapeutic area analysis
├── configs/
│   └── trial_config.yaml             # Experiment configuration
├── scripts/
│   ├── fetch_data.py                 # Data collection entry point
│   ├── train.py                      # Model training entry point
│   ├── evaluate.py                   # Evaluation entry point
│   └── analyze.py                   # Analysis entry point
├── docs/
│   ├── DATA_SOURCES.md               # Data source documentation
│   └── PHARMA_CONTEXT.md             # Drug development pipeline context
├── tests/                            # Unit and integration tests
├── notebooks/                        # Exploratory analysis
├── requirements.txt
├── setup.py
└── README.md

Pharmaceutical Context

This project is built with a pharmaceutical R&D mindset, not just an ML mindset. Key design decisions:

Temporal validation: All models are validated on future trials to prevent look-ahead leakage — mimicking real deployment where you predict trials before they complete.

Calibration priority: In portfolio decisions, calibrated probabilities matter more than raw discrimination. A model that says "70% success" should be right 70% of the time. We enforce calibration via isotonic regression and Platt scaling.

Phase-specific models: Phase II and Phase III have fundamentally different failure modes. We train separate models per phase transition rather than forcing one model to generalize across phases.

Interpretability: SHAP values are computed for every prediction. Clinical decision-makers need to understand why a trial is predicted to succeed or fail — not just the score.

Regulatory awareness: The system is explicitly framed as a decision-support tool, not a clinical decision tool. All documentation reflects FDA guidance on the appropriate use of AI/ML in drug development.

See docs/PHARMA_CONTEXT.md for a full drug development pipeline overview.

Limitations

Label noise: "Completed" trials may still fail to achieve approval; outcome labels are proxy measures
Publication bias: Successful trials are more likely to publish results, biasing the training signal
Indication shifts: Novel indications (e.g., first-in-class mechanisms) have limited historical comparators
External validity: Models trained on publicly registered trials may not generalize to internal proprietary trials with different documentation standards
Regulatory changes: FDA guidance evolves; models may need retraining after major policy shifts

Citation

@software{trialpredictor2024,
  title  = {TrialPredictor: ML-Driven Clinical Trial Outcome Prediction},
  author = {Your Name},
  year   = {2024},
  url    = {https://github.com/yourusername/trial-predictor}
}

License

MIT License — see LICENSE for details.

This project uses publicly available data from ClinicalTrials.gov (public domain) and DrugBank (academic license required for commercial use). DrugBank data must not be redistributed without a license from Wishart Lab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrialPredictor: ML-Driven Clinical Trial Outcome Prediction

Overview

Clinical Motivation

Architecture

Model Performance

Binary Success/Failure Classification

Phase-Specific Performance (XGBoost)

Survival Analysis (Trial Timeline)

Portfolio Simulation (eNPV)

Top Predictive Features (SHAP Analysis)

Data Sources

Installation

Optional: AACT Database (PostgreSQL)

Quick Start

Project Structure

Pharmaceutical Context

Limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

TrialPredictor: ML-Driven Clinical Trial Outcome Prediction

Overview

Clinical Motivation

Architecture

Model Performance

Binary Success/Failure Classification

Phase-Specific Performance (XGBoost)

Survival Analysis (Trial Timeline)

Portfolio Simulation (eNPV)

Top Predictive Features (SHAP Analysis)

Data Sources

Installation

Optional: AACT Database (PostgreSQL)

Quick Start

Project Structure

Pharmaceutical Context

Limitations

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages