Skip to content

KayMan2025/csr-llm-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python Model Status License

πŸ“„ Clinical Study Report (CSR) LLM Extraction & Summarization

Python + Groq LLaMA 3.1 | End-to-End Data Extraction Pipeline

This project demonstrates how Large Language Models (LLMs) can extract structured information and generate clinically relevant summaries from CSR-like free text. It replicates real-world tasks used in clinical development, pharmacometrics, regulatory writing, and AI-assisted medical data workflows.

This is a portfolio project created to showcase skills relevant to roles such as Data Scientist (AI/ML), LLM Engineer, and positions at companies like Certara, JAX, Genentech, Pfizer, etc.

πŸš€ What This Tool Does

Given an input text file (e.g., sample_csr.txt), the pipeline:

βœ”οΈ 1. Extracts structured clinical trial fields

The LLM parses CSR narrative text into machine-friendly JSON:

trial_id

title

indication

phase

sample_size

arms_and_treatments

primary_endpoint

key_results

serious_adverse_events

sponsor

location

This mirrors real extraction tasks needed for:

Clinical trial registries

Pharmacovigilance

Medical writing automation

Data standardization for modeling (PK/PD, survival, etc.)

βœ”οΈ 2. Generates dual-layer summaries

The system produces:

πŸ”Ή Plain-Language Summary

β€” Accessible to non-scientific readers, patients, caregivers.

πŸ”Ή Technical Summary for Clinicians

β€” Uses clinical terminology and meaningful endpoints (Mayo score, remission, AEs, etc.)

These summaries are useful for:

Study synopses

CSR-to-protocol automation

Patient engagement documents

Internal medical review

🧠 Example Output

{

Β  "structured": {

Β  "trial_id": "Not provided",

Β  "title": "Evaluating Drug X in Adults with Moderate to Severe Ulcerative Colitis",

Β  "indication": "Ulcerative colitis",

Β  "phase": "Phase II",

Β  "sample_size": "120 patients",

Β  "arms_and_treatments": "Drug X 100 mg once daily vs. placebo",

Β  "primary_endpoint": "Clinical remission at week 12 based on the Mayo score",

Β  "key_results": "45% remission with Drug X vs. 20% with placebo",

Β  "serious_adverse_events": "Mild to moderate headache and nausea",

Β  "sponsor": "Example Pharma",

Β  "location": "20 sites in the US and Europe"

Β  },

Β  "summaries": {

Β  "summary": "..."

Β  }

}

πŸ—οΈ Project Architecture

csr_project/

β”œβ”€ data/

β”‚ └─ sample_csr.txt

β”œβ”€ src/

β”‚ β”œβ”€ main.py # CLI orchestrator

β”‚ β”œβ”€ llm_client.py # Groq API client wrapper

β”‚ β”œβ”€ extract_structured.py # JSON field extraction

β”‚ β”œβ”€ summarize_csr.py # Summary generation

β”‚ β”œβ”€ prompts.py # Prompt templates

β”‚ β”œβ”€ config.py # Field schema

β”‚ └─ __init__.py

β”œβ”€ README.md

β”œβ”€ requirements.txt

β”œβ”€ .gitignore

└─ .env # <--- NOT committed

βš™οΈ Tech Stack

Component Details

Language Python 3.12

LLM Provider Groq (ultra-fast inference)

Model llama-3.1-8b-instant

CLI argparse

Environment python-dotenv, virtualenv

This stack closely matches real AI/ML workflows used in biotech and pharma.

πŸ”§ Installation

1. Clone the repo

git clone

cd csr_project

2. Create virtual environment

python -m venv .venv

.\.venv\Scripts\activate

pip install -r requirements.txt

3. Set up your .env file

Create a file named .env in the project root:

GROQ_API_KEY=your-groq-key-here

⚠️ This file is ignored by .gitignore to protect your secrets.

▢️ Run the Tool

python -m src.main --input data\sample_csr.txt

Produces combined structured + summarized output to the terminal.

🎯 Goals of This Project

Demonstrate applied LLM engineering on biomedical text

Show ability to build end-to-end data extraction pipelines

Implement prompt design, JSON post-processing, and error handling

Build a lightweight clinical NLP tool from scratch

Provide a real-world example aligned with clinical data workflows used in

modeling & simulation

medical writing

regulatory submissions

pharmacovigilance

data standardization

πŸ“Œ Next Extensions (optional future improvements)

Add R/Shiny or Streamlit UI

Add PDF ingestion

Add automatic schema validation

Add vector search / RAG for large CSRs

Add evaluation metrics (BLEU, RougeL, JSON correctness)

πŸ“« Contact

Author: Kibrom M. Alula

LinkedIn: https://www.linkedin.com/in/kibrom-m-alula/

GitHub: https://github.com/KayMan2025

About

End-to-end pipeline that extracts structured fields and generates summaries from CSR-like clinical text using Groq LLaMA 3.1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages