An end-to-end Retrieval-Augmented Generation (RAG) system that lets you upload documents (PDFs, code, text) and ask questions grounded in their content — no hallucinations, just answers backed by your own data.
Built with FastAPI · ChromaDB · sentence-transformers · Groq (LLaMA 3)
Large Language Models are powerful, but they suffer from two hard limitations: they hallucinate, and they have no access to your private or recent data.
RAG solves this by retrieving relevant context from your documents and injecting it directly into the prompt — so the model answers from your knowledge base, not from memory.
User Query → Embed → Vector Search → Top-K Chunks → Rerank → Prompt → LLM → Grounded Answer + Citations
- 📄 Upload documents —
.pdf,.txt,.md, code files - ✂️ Smart chunking with overlap for context continuity
- 🔎 Semantic search via sentence-transformer embeddings
- 🧠 Persistent vector store with ChromaDB
- 🤖 LLM generation via Groq (LLaMA 3) — low-latency inference
- 📌 Grounded answers with chunk citations
[Chunk X] - 🧾 Raw retrieved chunks returned alongside the answer
- 🔁 Retrieval reranking via keyword overlap scoring
- 🎯 Metadata filtering by
document_id - 📊 Multi-document support
- ⚡ Fast, async-ready API with FastAPI
| Layer | Technology |
|---|---|
| API | FastAPI + Uvicorn |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| Vector Store | ChromaDB (persistent, local) |
| LLM | Groq — LLaMA 3 |
| Validation | Pydantic |
| Containerisation | Docker (optional) |
rag-doc-assistant/
├── 📁 backend/
│ ├── 📁 app/
│ │ ├── 📁 api/ # Routes
│ │ │ ├── 🐍 upload.py # POST /upload/ — ingest docs, chunk & embed
│ │ │ └── 🐍 ask.py # POST /ask/ — retrieve, rerank & generate
│ │ ├── 📁 services/ # AI Logic
│ │ │ ├── 🐍 chunker.py # Fixed-size overlapping text chunking
│ │ │ ├── 🐍 embedder.py # all-MiniLM-L6-v2 → 384-dim vectors
│ │ │ ├── 🐍 retriever.py # ChromaDB top-K cosine similarity search
│ │ │ ├── 🐍 reranker.py # Keyword overlap re-scoring
│ │ │ └── 🐍 generator.py # Groq / LLaMA 3 grounded generation
│ │ ├── 📁 core/ # DB
│ │ │ └── 🐍 vector_store.py # ChromaDB client, collections & upserts
│ │ ├── 📁 schemas/
│ │ └── 🐍 main.py # Entry — FastAPI app factory & router setup
│ ├── 📄 requirements.txt
│ └── 🐳 Dockerfile
├── 🗄️ chroma_db/ # Persistent vector store (mount as volume)
└── 📄 README.md
git clone https://github.com/ApplexX7/rag-doc-assistant.git
cd rag-doc-assistant/backendpip install -r requirements.txtexport GROQ_API_KEY="your_api_key"uvicorn app.main:app --reloadhttp://127.0.0.1:8000/docs
POST /upload/
Content-Type: multipart/form-dataResponse
{
"document_id": "abc123",
"filename": "resume.pdf",
"chunk_count": 6
}POST /ask/
Content-Type: application/json{
"question": "What backend technologies does he know?",
"top_k": 3,
"document_id": "abc123"
}Response
{
"question": "What backend technologies does he know?",
"answer": "He has experience with Node.js, Fastify, and FastAPI... [Chunk 1]",
"results": [
{ "chunk_id": 1, "text": "...", "score": 0.91 }
]
}Fixed-size chunks with overlap ensure context continuity across boundaries. The tradeoff is precision (smaller chunks) vs. recall (larger chunks with more context).
all-MiniLM-L6-v2 from sentence-transformers — lightweight (80MB), fast, and strong on semantic similarity tasks. No GPU required.
ChromaDB runs locally with persistence out of the box. No external service needed — just mount chroma_db/ as a Docker volume to survive restarts.
Top-K semantic search gives high recall. The reranker then re-scores by keyword overlap to improve precision before passing chunks to the LLM.
The prompt template explicitly instructs the model to cite sources as [Chunk X] and to not answer beyond the retrieved context — minimising hallucinations by design.
- Retrieval quality is sensitive to chunk size tuning
- No hybrid search (BM25 + vector) — pure semantic only
- No conversation memory across turns
- Reranker is keyword-based, not ML (cross-encoder)
- No evaluation metrics or benchmarking yet
- Hybrid search — BM25 + dense embeddings
- Cross-encoder reranking
- Code-aware chunking (by function / class)
- Streaming responses (SSE)
- Multi-hop retrieval
- Query rewriting before embedding
- Context compression
- Hallucination detection layer
- Next.js frontend with chat UI
- File upload dashboard
- Source highlighting in answers
- Chunk visualisation
- Async ingestion pipeline with background workers
- Redis caching for repeated queries
- Postgres metadata layer
"I built a full RAG system from scratch — document ingestion, embedding, vector storage, semantic retrieval, reranking, and grounded LLM generation. I made deliberate tradeoffs around chunking strategy, embedding model size, and reranking approach, and the prompt architecture enforces citation-based answers to reduce hallucinations."
What this project demonstrates:
- AI system design and LLM integration
- Production-style backend engineering
- Data pipeline design
- Tradeoff thinking — latency vs. accuracy, precision vs. recall
- Practical knowledge of RAG patterns used in real AI products
Mohammed Hilali
- 🌐 Portfolio: applexx.me
- 🐙 GitHub: @ApplexX7
This is not just a demo — it's a production-style RAG system reflecting real-world AI engineering patterns used in modern startups and AI products.