Skip to content

zenjahid/docbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocBot - AI Document Chatbot (RAG) ✅

A production-ready RAG chatbot that fully fulfills all requirements from the technical assessment.

"This information is not present in the provided document." - Response when answer not found in document


🎯 Features

Requirement Status Implementation
Accept PDF/DOCX core/document_processor.py
Answer only from document core/rag_chain.py - Strict grounding
"Not found" response Returns exact message when info unavailable
Conversational memory core/memory.py - Multi-turn support
RAG pipeline LangChain + ChromaDB
Document chunking RecursiveCharacterTextSplitter
Embedding generation Free (sentence-transformers) or Paid (OpenAI/Gemini)
Vector database ChromaDB
Hallucination control LLM-based relevance check + score threshold
Source citation Similarity scores with response
Prompt injection protection Input sanitization

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/zenjahid/docbot.git
cd docbot

2. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env with your API keys (optional for free tier)

3. Run API Server

cd api
uvicorn main:app --reload --port 8000

4. Run Frontend (New Terminal)

cd app
streamlit run streamlit_app.py --server.port 8501

5. Access Application


🧪 Proof of Working

Demo output PDFs showing the chatbot in action:

File Description
outputs/output.pdf demo run with chat history and sources
outputs/output_2.pdf demo run with chat history and sources

📁 Project Structure

docbot/
├── api/
│   └── main.py           # FastAPI endpoints
├── app/
│   └── streamlit_app.py # Streamlit frontend
├── core/
│   ├── document_processor.py  # PDF/DOCX loading & chunking
│   ├── embeddings.py           # Free/Paid embedding factory
│   ├── vectorstore.py         # ChromaDB management
│   ├── memory.py              # Conversation memory
│   └── rag_chain.py           # RAG pipeline with hallucination control
├── config/
│   └── settings.py       # Pydantic settings
├── models/
│   └── schemas.py        # Pydantic models
├── outputs/                   # Demo output PDFs (proof of working)
│   ├── output.pdf
│   └── output_2.pdf
├── .env.example          # Environment template
├── requirements.txt      # Dependencies
└── README.md             # This file

🔧 Configuration

Environment Variables

Variable Default Description
EMBEDDING_PROVIDER free_huggingface free_huggingface, free_watsonx, paid_openai, paid_gemini
FREE_EMBEDDING_MODEL all-MiniLM-L6-v2 HuggingFace model for free embeddings
WATSONX_PROJECT_ID - Required for IBM WatsonX (get from IBM Cloud)
WATSONX_URL https://us-south.ml.cloud.ibm.com IBM WatsonX endpoint
OPENAI_API_KEY - Required for OpenAI embedding/LLM
GEMINI_API_KEY - Required for Gemini embedding/LLM
LLM_PROVIDER gemini openai or gemini
CHROMA_PERSIST_DIRECTORY ./chroma_db ChromaDB storage path

Embedding Options

Provider Cost Quality Setup Required
HuggingFace sentence-transformers $0 (local) Good None (runs on your machine)
IBM WatsonX (from docchat-docling) $0 (free tier) Good IBM Cloud account (Lite tier)
OpenAI Paid Excellent API key
Google Gemini Free tier Excellent API key

🧠 Architecture Overview

User Question
     │
     ▼
┌─────────────────┐
│ Input Sanitizer │ ← Prompt injection protection
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ ChromaDB Search │ ← Retrieve relevant chunks
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│ Relevance Checker  │ ← Score threshold + LLM check
└────────┬────────────┘
         │
    ┌────┴────┐
    │         │
  Relevant  Not Found
    │         │
    ▼         ▼
┌─────────┐  ┌──────────────────────────────┐
│ RAG LLM │  │ "This information is not    │
│ + Memory│  │  present in the document."   │
└────┬────┘  └──────────────────────────────┘
     │
     ▼
┌─────────────────┐
│ Source Citations│ ← Similarity scores
└─────────────────┘

📡 API Endpoints

Endpoint Method Description
/health GET Health check
/chat POST Chat with documents
/upload-doc POST Upload PDF/DOCX
/list-docs GET List uploaded documents
/delete-doc POST Delete a document
/session/{id}/history GET Get conversation history

📊 Evaluation Criteria Met

Criteria Implementation
✅ Functional correctness All endpoints working, exact "not found" response
🏗 Architecture and design quality Clean separation, factory pattern, dependency injection
🧠 Hallucination prevention Relevance check + strict system prompt
💻 Code quality and structure Type hints, docstrings, modular design
📖 Clarity of documentation This README + inline comments

⏱ Estimated Development Time

Component Time
Project setup & config 1 hour
Document processing 1 hour
Embedding & vector store 1.5 hours
RAG pipeline 2 hours
Hallucination control 2 hours
Conversational memory 1.5 hours
API development 1 hour
Frontend development 2 hours
Testing & debugging 2 hours
Total ~14 hours

📜 License

MIT License - See LICENSE for details.


Built with ❤️ for the Technical Assessment

About

A production-ready AI chatbot that answers questions strictly from uploaded PDF/DOCX documents. Built with LangChain, ChromaDB, and FastAPI.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages