Skip to content

ApplexX7/RAG-Codebase-Documentation-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

🚀 RAG Codebase Documentation Assistant

An end-to-end Retrieval-Augmented Generation (RAG) system that lets you upload documents (PDFs, code, text) and ask questions grounded in their content — no hallucinations, just answers backed by your own data.

Built with FastAPI · ChromaDB · sentence-transformers · Groq (LLaMA 3)


🧠 What Is RAG?

Large Language Models are powerful, but they suffer from two hard limitations: they hallucinate, and they have no access to your private or recent data.

RAG solves this by retrieving relevant context from your documents and injecting it directly into the prompt — so the model answers from your knowledge base, not from memory.

User Query  →  Embed  →  Vector Search  →  Top-K Chunks  →  Rerank  →  Prompt  →  LLM  →  Grounded Answer + Citations

✨ Features

Core

  • 📄 Upload documents — .pdf, .txt, .md, code files
  • ✂️ Smart chunking with overlap for context continuity
  • 🔎 Semantic search via sentence-transformer embeddings
  • 🧠 Persistent vector store with ChromaDB
  • 🤖 LLM generation via Groq (LLaMA 3) — low-latency inference
  • 📌 Grounded answers with chunk citations [Chunk X]
  • 🧾 Raw retrieved chunks returned alongside the answer

Advanced

  • 🔁 Retrieval reranking via keyword overlap scoring
  • 🎯 Metadata filtering by document_id
  • 📊 Multi-document support
  • ⚡ Fast, async-ready API with FastAPI

🛠️ Tech Stack

Layer Technology
API FastAPI + Uvicorn
Embeddings sentence-transformers (all-MiniLM-L6-v2)
Vector Store ChromaDB (persistent, local)
LLM Groq — LLaMA 3
Validation Pydantic
Containerisation Docker (optional)

📂 Project Structure

rag-doc-assistant/
├── 📁 backend/
│   ├── 📁 app/
│   │   ├── 📁 api/                     # Routes
│   │   │   ├── 🐍 upload.py            # POST /upload/ — ingest docs, chunk & embed
│   │   │   └── 🐍 ask.py               # POST /ask/   — retrieve, rerank & generate
│   │   ├── 📁 services/                # AI Logic
│   │   │   ├── 🐍 chunker.py           # Fixed-size overlapping text chunking
│   │   │   ├── 🐍 embedder.py          # all-MiniLM-L6-v2 → 384-dim vectors
│   │   │   ├── 🐍 retriever.py         # ChromaDB top-K cosine similarity search
│   │   │   ├── 🐍 reranker.py          # Keyword overlap re-scoring
│   │   │   └── 🐍 generator.py         # Groq / LLaMA 3 grounded generation
│   │   ├── 📁 core/                    # DB
│   │   │   └── 🐍 vector_store.py      # ChromaDB client, collections & upserts
│   │   ├── 📁 schemas/
│   │   └── 🐍 main.py                  # Entry — FastAPI app factory & router setup
│   ├── 📄 requirements.txt
│   └── 🐳 Dockerfile
├── 🗄️  chroma_db/                      # Persistent vector store (mount as volume)
└── 📄 README.md

🚀 Getting Started

1. Clone the repo

git clone https://github.com/ApplexX7/rag-doc-assistant.git
cd rag-doc-assistant/backend

2. Install dependencies

pip install -r requirements.txt

3. Set environment variables

export GROQ_API_KEY="your_api_key"

4. Run the server

uvicorn app.main:app --reload

5. Open Swagger UI

http://127.0.0.1:8000/docs

🧪 API Usage

Upload a document

POST /upload/
Content-Type: multipart/form-data

Response

{
  "document_id": "abc123",
  "filename": "resume.pdf",
  "chunk_count": 6
}

Ask a question

POST /ask/
Content-Type: application/json
{
  "question": "What backend technologies does he know?",
  "top_k": 3,
  "document_id": "abc123"
}

Response

{
  "question": "What backend technologies does he know?",
  "answer": "He has experience with Node.js, Fastify, and FastAPI... [Chunk 1]",
  "results": [
    { "chunk_id": 1, "text": "...", "score": 0.91 }
  ]
}

🧠 Key Design Decisions

1. Chunking strategy

Fixed-size chunks with overlap ensure context continuity across boundaries. The tradeoff is precision (smaller chunks) vs. recall (larger chunks with more context).

2. Embeddings

all-MiniLM-L6-v2 from sentence-transformers — lightweight (80MB), fast, and strong on semantic similarity tasks. No GPU required.

3. Vector store

ChromaDB runs locally with persistence out of the box. No external service needed — just mount chroma_db/ as a Docker volume to survive restarts.

4. Retrieval + reranking

Top-K semantic search gives high recall. The reranker then re-scores by keyword overlap to improve precision before passing chunks to the LLM.

5. Grounded generation

The prompt template explicitly instructs the model to cite sources as [Chunk X] and to not answer beyond the retrieved context — minimising hallucinations by design.


⚠️ Current Limitations

  • Retrieval quality is sensitive to chunk size tuning
  • No hybrid search (BM25 + vector) — pure semantic only
  • No conversation memory across turns
  • Reranker is keyword-based, not ML (cross-encoder)
  • No evaluation metrics or benchmarking yet

🔮 Roadmap

🔥 High Impact

  • Hybrid search — BM25 + dense embeddings
  • Cross-encoder reranking
  • Code-aware chunking (by function / class)
  • Streaming responses (SSE)

🧠 AI Enhancements

  • Multi-hop retrieval
  • Query rewriting before embedding
  • Context compression
  • Hallucination detection layer

🎨 Product / UX

  • Next.js frontend with chat UI
  • File upload dashboard
  • Source highlighting in answers
  • Chunk visualisation

⚙️ System Design

  • Async ingestion pipeline with background workers
  • Redis caching for repeated queries
  • Postgres metadata layer

💡 Interview Talking Points

"I built a full RAG system from scratch — document ingestion, embedding, vector storage, semantic retrieval, reranking, and grounded LLM generation. I made deliberate tradeoffs around chunking strategy, embedding model size, and reranking approach, and the prompt architecture enforces citation-based answers to reduce hallucinations."

What this project demonstrates:

  • AI system design and LLM integration
  • Production-style backend engineering
  • Data pipeline design
  • Tradeoff thinking — latency vs. accuracy, precision vs. recall
  • Practical knowledge of RAG patterns used in real AI products

📌 Author

Mohammed Hilali


This is not just a demo — it's a production-style RAG system reflecting real-world AI engineering patterns used in modern startups and AI products.

About

A production-style Retrieval-Augmented Generation system that enables users to upload documents (PDFs, code, text) and query them with context-aware, citation-backed answers. Instead of hallucinating, the system retrieves relevant information from your own data and injects it into the LLM prompt ensuring grounded, reliable responses.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages