🚀 RAG Codebase Documentation Assistant

An end-to-end Retrieval-Augmented Generation (RAG) system that lets you upload documents (PDFs, code, text) and ask questions grounded in their content — no hallucinations, just answers backed by your own data.

Built with FastAPI · ChromaDB · sentence-transformers · Groq (LLaMA 3)

🧠 What Is RAG?

Large Language Models are powerful, but they suffer from two hard limitations: they hallucinate, and they have no access to your private or recent data.

RAG solves this by retrieving relevant context from your documents and injecting it directly into the prompt — so the model answers from your knowledge base, not from memory.

User Query  →  Embed  →  Vector Search  →  Top-K Chunks  →  Rerank  →  Prompt  →  LLM  →  Grounded Answer + Citations

✨ Features

Core

📄 Upload documents — .pdf, .txt, .md, code files
✂️ Smart chunking with overlap for context continuity
🔎 Semantic search via sentence-transformer embeddings
🧠 Persistent vector store with ChromaDB
🤖 LLM generation via Groq (LLaMA 3) — low-latency inference
📌 Grounded answers with chunk citations [Chunk X]
🧾 Raw retrieved chunks returned alongside the answer

Advanced

🔁 Retrieval reranking via keyword overlap scoring
🎯 Metadata filtering by document_id
📊 Multi-document support
⚡ Fast, async-ready API with FastAPI

🛠️ Tech Stack

Layer	Technology
API	FastAPI + Uvicorn
Embeddings	sentence-transformers (`all-MiniLM-L6-v2`)
Vector Store	ChromaDB (persistent, local)
LLM	Groq — LLaMA 3
Validation	Pydantic
Containerisation	Docker (optional)

📂 Project Structure

rag-doc-assistant/
├── 📁 backend/
│   ├── 📁 app/
│   │   ├── 📁 api/                     # Routes
│   │   │   ├── 🐍 upload.py            # POST /upload/ — ingest docs, chunk & embed
│   │   │   └── 🐍 ask.py               # POST /ask/   — retrieve, rerank & generate
│   │   ├── 📁 services/                # AI Logic
│   │   │   ├── 🐍 chunker.py           # Fixed-size overlapping text chunking
│   │   │   ├── 🐍 embedder.py          # all-MiniLM-L6-v2 → 384-dim vectors
│   │   │   ├── 🐍 retriever.py         # ChromaDB top-K cosine similarity search
│   │   │   ├── 🐍 reranker.py          # Keyword overlap re-scoring
│   │   │   └── 🐍 generator.py         # Groq / LLaMA 3 grounded generation
│   │   ├── 📁 core/                    # DB
│   │   │   └── 🐍 vector_store.py      # ChromaDB client, collections & upserts
│   │   ├── 📁 schemas/
│   │   └── 🐍 main.py                  # Entry — FastAPI app factory & router setup
│   ├── 📄 requirements.txt
│   └── 🐳 Dockerfile
├── 🗄️  chroma_db/                      # Persistent vector store (mount as volume)
└── 📄 README.md

🚀 Getting Started

1. Clone the repo

git clone https://github.com/ApplexX7/rag-doc-assistant.git
cd rag-doc-assistant/backend

2. Install dependencies

pip install -r requirements.txt

3. Set environment variables

export GROQ_API_KEY="your_api_key"

4. Run the server

uvicorn app.main:app --reload

5. Open Swagger UI

http://127.0.0.1:8000/docs

🧪 API Usage

Upload a document

POST /upload/
Content-Type: multipart/form-data

Response

{
  "document_id": "abc123",
  "filename": "resume.pdf",
  "chunk_count": 6
}

Ask a question

POST /ask/
Content-Type: application/json

{
  "question": "What backend technologies does he know?",
  "top_k": 3,
  "document_id": "abc123"
}

Response

{
  "question": "What backend technologies does he know?",
  "answer": "He has experience with Node.js, Fastify, and FastAPI... [Chunk 1]",
  "results": [
    { "chunk_id": 1, "text": "...", "score": 0.91 }
  ]
}

🧠 Key Design Decisions

1. Chunking strategy

Fixed-size chunks with overlap ensure context continuity across boundaries. The tradeoff is precision (smaller chunks) vs. recall (larger chunks with more context).

2. Embeddings

all-MiniLM-L6-v2 from sentence-transformers — lightweight (80MB), fast, and strong on semantic similarity tasks. No GPU required.

3. Vector store

ChromaDB runs locally with persistence out of the box. No external service needed — just mount chroma_db/ as a Docker volume to survive restarts.

4. Retrieval + reranking

Top-K semantic search gives high recall. The reranker then re-scores by keyword overlap to improve precision before passing chunks to the LLM.

5. Grounded generation

The prompt template explicitly instructs the model to cite sources as [Chunk X] and to not answer beyond the retrieved context — minimising hallucinations by design.

⚠️ Current Limitations

Retrieval quality is sensitive to chunk size tuning
No hybrid search (BM25 + vector) — pure semantic only
No conversation memory across turns
Reranker is keyword-based, not ML (cross-encoder)
No evaluation metrics or benchmarking yet

🔮 Roadmap

🔥 High Impact

Hybrid search — BM25 + dense embeddings
Cross-encoder reranking
Code-aware chunking (by function / class)
Streaming responses (SSE)

🧠 AI Enhancements

Multi-hop retrieval
Query rewriting before embedding
Context compression
Hallucination detection layer

🎨 Product / UX

Next.js frontend with chat UI
File upload dashboard
Source highlighting in answers
Chunk visualisation

⚙️ System Design

Async ingestion pipeline with background workers
Redis caching for repeated queries
Postgres metadata layer

💡 Interview Talking Points

"I built a full RAG system from scratch — document ingestion, embedding, vector storage, semantic retrieval, reranking, and grounded LLM generation. I made deliberate tradeoffs around chunking strategy, embedding model size, and reranking approach, and the prompt architecture enforces citation-based answers to reduce hallucinations."

What this project demonstrates:

AI system design and LLM integration
Production-style backend engineering
Data pipeline design
Tradeoff thinking — latency vs. accuracy, precision vs. recall
Practical knowledge of RAG patterns used in real AI products

📌 Author

Mohammed Hilali

🌐 Portfolio: applexx.me
🐙 GitHub: @ApplexX7

This is not just a demo — it's a production-style RAG system reflecting real-world AI engineering patterns used in modern startups and AI products.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
roc-codebase-document-assistant/back-end		roc-codebase-document-assistant/back-end
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🚀 RAG Codebase Documentation Assistant

🧠 What Is RAG?

✨ Features

Core

Advanced

🛠️ Tech Stack

📂 Project Structure

🚀 Getting Started

1. Clone the repo

2. Install dependencies

3. Set environment variables

4. Run the server

5. Open Swagger UI

🧪 API Usage

Upload a document

Ask a question

🧠 Key Design Decisions

1. Chunking strategy

2. Embeddings

3. Vector store

4. Retrieval + reranking

5. Grounded generation

⚠️ Current Limitations

🔮 Roadmap

🔥 High Impact

🧠 AI Enhancements

🎨 Product / UX

⚙️ System Design

💡 Interview Talking Points

📌 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages