DocBot - AI Document Chatbot (RAG) ✅

A production-ready RAG chatbot that fully fulfills all requirements from the technical assessment.

"This information is not present in the provided document." - Response when answer not found in document

🎯 Features

Requirement	Status	Implementation
Accept PDF/DOCX	✅	`core/document_processor.py`
Answer only from document	✅	`core/rag_chain.py` - Strict grounding
"Not found" response	✅	Returns exact message when info unavailable
Conversational memory	✅	`core/memory.py` - Multi-turn support
RAG pipeline	✅	LangChain + ChromaDB
Document chunking	✅	RecursiveCharacterTextSplitter
Embedding generation	✅	Free (sentence-transformers) or Paid (OpenAI/Gemini)
Vector database	✅	ChromaDB
Hallucination control	✅	LLM-based relevance check + score threshold
Source citation	✅	Similarity scores with response
Prompt injection protection	✅	Input sanitization

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/zenjahid/docbot.git
cd docbot

2. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env with your API keys (optional for free tier)

3. Run API Server

cd api
uvicorn main:app --reload --port 8000

4. Run Frontend (New Terminal)

cd app
streamlit run streamlit_app.py --server.port 8501

5. Access Application

Streamlit UI: http://localhost:8501
Swagger Docs: http://localhost:8000/docs

🧪 Proof of Working

Demo output PDFs showing the chatbot in action:

File	Description
`outputs/output.pdf`	demo run with chat history and sources
`outputs/output_2.pdf`	demo run with chat history and sources

📁 Project Structure

docbot/
├── api/
│   └── main.py           # FastAPI endpoints
├── app/
│   └── streamlit_app.py # Streamlit frontend
├── core/
│   ├── document_processor.py  # PDF/DOCX loading & chunking
│   ├── embeddings.py           # Free/Paid embedding factory
│   ├── vectorstore.py         # ChromaDB management
│   ├── memory.py              # Conversation memory
│   └── rag_chain.py           # RAG pipeline with hallucination control
├── config/
│   └── settings.py       # Pydantic settings
├── models/
│   └── schemas.py        # Pydantic models
├── outputs/                   # Demo output PDFs (proof of working)
│   ├── output.pdf
│   └── output_2.pdf
├── .env.example          # Environment template
├── requirements.txt      # Dependencies
└── README.md             # This file

🔧 Configuration

Environment Variables

Variable	Default	Description
`EMBEDDING_PROVIDER`	`free_huggingface`	`free_huggingface`, `free_watsonx`, `paid_openai`, `paid_gemini`
`FREE_EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	HuggingFace model for free embeddings
`WATSONX_PROJECT_ID`	-	Required for IBM WatsonX (get from IBM Cloud)
`WATSONX_URL`	`https://us-south.ml.cloud.ibm.com`	IBM WatsonX endpoint
`OPENAI_API_KEY`	-	Required for OpenAI embedding/LLM
`GEMINI_API_KEY`	-	Required for Gemini embedding/LLM
`LLM_PROVIDER`	`gemini`	`openai` or `gemini`
`CHROMA_PERSIST_DIRECTORY`	`./chroma_db`	ChromaDB storage path

Embedding Options

Provider	Cost	Quality	Setup Required
HuggingFace sentence-transformers	$0 (local)	Good	None (runs on your machine)
IBM WatsonX (from docchat-docling)	$0 (free tier)	Good	IBM Cloud account (Lite tier)
OpenAI	Paid	Excellent	API key
Google Gemini	Free tier	Excellent	API key

🧠 Architecture Overview

User Question
     │
     ▼
┌─────────────────┐
│ Input Sanitizer │ ← Prompt injection protection
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ ChromaDB Search │ ← Retrieve relevant chunks
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│ Relevance Checker  │ ← Score threshold + LLM check
└────────┬────────────┘
         │
    ┌────┴────┐
    │         │
  Relevant  Not Found
    │         │
    ▼         ▼
┌─────────┐  ┌──────────────────────────────┐
│ RAG LLM │  │ "This information is not    │
│ + Memory│  │  present in the document."   │
└────┬────┘  └──────────────────────────────┘
     │
     ▼
┌─────────────────┐
│ Source Citations│ ← Similarity scores
└─────────────────┘

📡 API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/chat`	POST	Chat with documents
`/upload-doc`	POST	Upload PDF/DOCX
`/list-docs`	GET	List uploaded documents
`/delete-doc`	POST	Delete a document
`/session/{id}/history`	GET	Get conversation history

📊 Evaluation Criteria Met

Criteria	Implementation
✅ Functional correctness	All endpoints working, exact "not found" response
🏗 Architecture and design quality	Clean separation, factory pattern, dependency injection
🧠 Hallucination prevention	Relevance check + strict system prompt
💻 Code quality and structure	Type hints, docstrings, modular design
📖 Clarity of documentation	This README + inline comments

⏱ Estimated Development Time

Component	Time
Project setup & config	1 hour
Document processing	1 hour
Embedding & vector store	1.5 hours
RAG pipeline	2 hours
Hallucination control	2 hours
Conversational memory	1.5 hours
API development	1 hour
Frontend development	2 hours
Testing & debugging	2 hours
Total	~14 hours

📜 License

MIT License - See LICENSE for details.

Built with ❤️ for the Technical Assessment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocBot - AI Document Chatbot (RAG) ✅

🎯 Features

🚀 Quick Start

1. Clone the Repository

2. Install Dependencies

2. Configure Environment

3. Run API Server

4. Run Frontend (New Terminal)

5. Access Application

🧪 Proof of Working

📁 Project Structure

🔧 Configuration

Environment Variables

Embedding Options

🧠 Architecture Overview

📡 API Endpoints

📊 Evaluation Criteria Met

⏱ Estimated Development Time

📜 License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
app		app
config		config
core		core
models		models
outputs		outputs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DocBot - AI Document Chatbot (RAG) ✅

🎯 Features

🚀 Quick Start

1. Clone the Repository

2. Install Dependencies

2. Configure Environment

3. Run API Server

4. Run Frontend (New Terminal)

5. Access Application

🧪 Proof of Working

📁 Project Structure

🔧 Configuration

Environment Variables

Embedding Options

🧠 Architecture Overview

📡 API Endpoints

📊 Evaluation Criteria Met

⏱ Estimated Development Time

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages