π Clinical Study Report (CSR) LLM Extraction & Summarization
Python + Groq LLaMA 3.1 | End-to-End Data Extraction Pipeline
This project demonstrates how Large Language Models (LLMs) can extract structured information and generate clinically relevant summaries from CSR-like free text. It replicates real-world tasks used in clinical development, pharmacometrics, regulatory writing, and AI-assisted medical data workflows.
This is a portfolio project created to showcase skills relevant to roles such as Data Scientist (AI/ML), LLM Engineer, and positions at companies like Certara, JAX, Genentech, Pfizer, etc.
π What This Tool Does
Given an input text file (e.g., sample_csr.txt), the pipeline:
βοΈ 1. Extracts structured clinical trial fields
The LLM parses CSR narrative text into machine-friendly JSON:
trial_id
title
indication
phase
sample_size
arms_and_treatments
primary_endpoint
key_results
serious_adverse_events
sponsor
location
This mirrors real extraction tasks needed for:
Clinical trial registries
Pharmacovigilance
Medical writing automation
Data standardization for modeling (PK/PD, survival, etc.)
βοΈ 2. Generates dual-layer summaries
The system produces:
πΉ Plain-Language Summary
β Accessible to non-scientific readers, patients, caregivers.
πΉ Technical Summary for Clinicians
β Uses clinical terminology and meaningful endpoints (Mayo score, remission, AEs, etc.)
These summaries are useful for:
Study synopses
CSR-to-protocol automation
Patient engagement documents
Internal medical review
π§ Example Output
{
Β "structured": {
Β "trial_id": "Not provided",
Β "title": "Evaluating Drug X in Adults with Moderate to Severe Ulcerative Colitis",
Β "indication": "Ulcerative colitis",
Β "phase": "Phase II",
Β "sample_size": "120 patients",
Β "arms_and_treatments": "Drug X 100 mg once daily vs. placebo",
Β "primary_endpoint": "Clinical remission at week 12 based on the Mayo score",
Β "key_results": "45% remission with Drug X vs. 20% with placebo",
Β "serious_adverse_events": "Mild to moderate headache and nausea",
Β "sponsor": "Example Pharma",
Β "location": "20 sites in the US and Europe"
Β },
Β "summaries": {
Β "summary": "..."
Β }
}
ποΈ Project Architecture
csr_project/
ββ data/
β ββ sample_csr.txt
ββ src/
β ββ main.py # CLI orchestrator
β ββ llm_client.py # Groq API client wrapper
β ββ extract_structured.py # JSON field extraction
β ββ summarize_csr.py # Summary generation
β ββ prompts.py # Prompt templates
β ββ config.py # Field schema
β ββ __init__.py
ββ README.md
ββ requirements.txt
ββ .gitignore
ββ .env # <--- NOT committed
βοΈ Tech Stack
Component Details
Language Python 3.12
LLM Provider Groq (ultra-fast inference)
Model llama-3.1-8b-instant
CLI argparse
Environment python-dotenv, virtualenv
This stack closely matches real AI/ML workflows used in biotech and pharma.
π§ Installation
1. Clone the repo
git clone
cd csr_project
2. Create virtual environment
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt
3. Set up your .env file
Create a file named .env in the project root:
GROQ_API_KEY=your-groq-key-here
python -m src.main --input data\sample_csr.txt
Produces combined structured + summarized output to the terminal.
π― Goals of This Project
Demonstrate applied LLM engineering on biomedical text
Show ability to build end-to-end data extraction pipelines
Implement prompt design, JSON post-processing, and error handling
Build a lightweight clinical NLP tool from scratch
Provide a real-world example aligned with clinical data workflows used in
modeling & simulation
medical writing
regulatory submissions
pharmacovigilance
data standardization
π Next Extensions (optional future improvements)
Add R/Shiny or Streamlit UI
Add PDF ingestion
Add automatic schema validation
Add vector search / RAG for large CSRs
Add evaluation metrics (BLEU, RougeL, JSON correctness)
π« Contact
Author: Kibrom M. Alula
LinkedIn: https://www.linkedin.com/in/kibrom-m-alula/
GitHub: https://github.com/KayMan2025