RAG-based medical assistant prototype that retrieves evidence from the Merck Manuals PDF to generate grounded clinical answers with evaluable outputs.
This project demonstrates retrieval-augmented generation (RAG) by grounding LLM responses in a large medical reference corpus (Merck Manuals) to reduce information overload and support faster clinical decision-making.
Healthcare professionals must make time-sensitive decisions while navigating an overwhelming volume of medical information. Reliably locating relevant, up-to-date clinical guidance is difficult under pressure, especially when the knowledge is spread across large manuals and research references.
This project builds a RAG-based AI assistant that enables clinicians to ask questions in natural language and receive answers grounded in authoritative medical content. The intent is decision support: improving access to information and standardizing references used during diagnostic and treatment planning.
- The system generally produces medically relevant, context-grounded responses when retrieval succeeds.
- Output quality variability is driven more by generation limits and evaluation instability than by fundamental retrieval failure.
- Truncation/incomplete answers indicate the need for larger
max_tokensand careful tuning of retrieved context size. - Automated self-scoring of groundedness/relevance was noisy, suggesting improvements are needed in prompt formatting, parsing, and evaluation criteria.
The solution follows a standard RAG pipeline:
- Document ingestion from a large PDF corpus (Merck Manuals).
- Chunking + embedding to create a searchable knowledge index.
- Retrieval (top-k) of the most relevant chunks for a given user query.
- Generation using an LLM constrained to the retrieved context.
- Evaluation via automated scoring plus qualitative review.
- Stabilize evaluation by standardizing prompts, output parsing, and scoring rules before scaling.
- Introduce human-in-the-loop review for clinical validation and risk control.
- Improve completeness and consistency by tuning context size (k) and generation limits.
- Consider a separate evaluator model to reduce bias in groundedness/relevance scoring.
- Explore improved encoders/models for better retrieval precision and robustness.
The most stable decoding configuration tested:
k = 4max_tokens = 1024temperature = 0.1top_p = 0.9top_k = 40
This prototype is intended for decision support and information retrieval, not autonomous clinical diagnosis. Outputs require clinical judgment and verification, and the system should be deployed with appropriate safeguards.
Qwen_Full_Code_NLP_RAG_Project_Notebook.ipynb
End-to-end implementation of the RAG pipeline, including PDF ingestion, text chunking, embedding generation, vector-based retrieval, LLM response generation, and automated groundedness/relevance evaluation.
- Python
- NLP / Information Retrieval
- Embeddings + Vector Search
- LLM prompting (RAG)
- PDF ingestion and chunking pipeline