GitHub - KayMan2025/csr-llm-extraction: End-to-end pipeline that extracts structured fields and generates summaries from CSR-like clinical text using Groq LLaMA 3.1

📄 Clinical Study Report (CSR) LLM Extraction & Summarization

Python + Groq LLaMA 3.1 | End-to-End Data Extraction Pipeline

This project demonstrates how Large Language Models (LLMs) can extract structured information and generate clinically relevant summaries from CSR-like free text. It replicates real-world tasks used in clinical development, pharmacometrics, regulatory writing, and AI-assisted medical data workflows.

This is a portfolio project created to showcase skills relevant to roles such as Data Scientist (AI/ML), LLM Engineer, and positions at companies like Certara, JAX, Genentech, Pfizer, etc.

🚀 What This Tool Does

Given an input text file (e.g., sample_csr.txt), the pipeline:

✔️ 1. Extracts structured clinical trial fields

The LLM parses CSR narrative text into machine-friendly JSON:

trial_id

title

indication

phase

sample_size

arms_and_treatments

primary_endpoint

key_results

serious_adverse_events

sponsor

location

This mirrors real extraction tasks needed for:

Clinical trial registries

Pharmacovigilance

Medical writing automation

Data standardization for modeling (PK/PD, survival, etc.)

✔️ 2. Generates dual-layer summaries

The system produces:

🔹 Plain-Language Summary

— Accessible to non-scientific readers, patients, caregivers.

🔹 Technical Summary for Clinicians

— Uses clinical terminology and meaningful endpoints (Mayo score, remission, AEs, etc.)

These summaries are useful for:

Study synopses

CSR-to-protocol automation

Patient engagement documents

Internal medical review

🧠 Example Output

{

"structured": {

"trial_id": "Not provided",

"title": "Evaluating Drug X in Adults with Moderate to Severe Ulcerative Colitis",

"indication": "Ulcerative colitis",

"phase": "Phase II",

"sample_size": "120 patients",

"arms_and_treatments": "Drug X 100 mg once daily vs. placebo",

"primary_endpoint": "Clinical remission at week 12 based on the Mayo score",

"key_results": "45% remission with Drug X vs. 20% with placebo",

"serious_adverse_events": "Mild to moderate headache and nausea",

"sponsor": "Example Pharma",

"location": "20 sites in the US and Europe"

},

"summaries": {

"summary": "..."

}

🏗️ Project Architecture

csr_project/

├─ data/

│ └─ sample_csr.txt

├─ src/

│ ├─ main.py # CLI orchestrator

│ ├─ llm_client.py # Groq API client wrapper

│ ├─ extract_structured.py # JSON field extraction

│ ├─ summarize_csr.py # Summary generation

│ ├─ prompts.py # Prompt templates

│ ├─ config.py # Field schema

│ └─ __init__.py

├─ README.md

├─ requirements.txt

├─ .gitignore

└─ .env # <--- NOT committed

⚙️ Tech Stack

Component Details

Language Python 3.12

LLM Provider Groq (ultra-fast inference)

Model llama-3.1-8b-instant

CLI argparse

Environment python-dotenv, virtualenv

This stack closely matches real AI/ML workflows used in biotech and pharma.

🔧 Installation

1. Clone the repo

git clone

cd csr_project

2. Create virtual environment

python -m venv .venv

.\.venv\Scripts\activate

pip install -r requirements.txt

3. Set up your .env file

Create a file named .env in the project root:

GROQ_API_KEY=your-groq-key-here

⚠️ This file is ignored by .gitignore to protect your secrets.

▶️ Run the Tool

python -m src.main --input data\sample_csr.txt

Produces combined structured + summarized output to the terminal.

🎯 Goals of This Project

Demonstrate applied LLM engineering on biomedical text

Show ability to build end-to-end data extraction pipelines

Implement prompt design, JSON post-processing, and error handling

Build a lightweight clinical NLP tool from scratch

Provide a real-world example aligned with clinical data workflows used in

modeling & simulation

medical writing

regulatory submissions

pharmacovigilance

data standardization

📌 Next Extensions (optional future improvements)

Add R/Shiny or Streamlit UI

Add PDF ingestion

Add automatic schema validation

Add vector search / RAG for large CSRs

Add evaluation metrics (BLEU, RougeL, JSON correctness)

📫 Contact

Author: Kibrom M. Alula

LinkedIn: https://www.linkedin.com/in/kibrom-m-alula/

GitHub: https://github.com/KayMan2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages