A confidence-gated Retrieval-Augmented Generation system that prioritizes accuracy over completeness by refusing to answer when evidence is insufficient.
- Confidence Gating: Automatically refuses to answer questions when retrieved documents have low relevance scores
- Hybrid Retrieval: Combines semantic search (ChromaDB) with keyword matching (BM25)
- Cross-Encoder Reranking: Uses specialized reranking model for accurate document scoring
- Dual-Mode Operation: Toggle between verified mode (gating enabled) and naive mode (standard RAG)
- REST API: FastAPI backend with comprehensive logging
- Interactive UI: Streamlit dashboard for testing and visualization
- Python 3.11+
- LangChain: RAG framework and document processing
- ChromaDB: Vector database for semantic search
- BM25: Keyword-based sparse retrieval
- Google Gemini 1.5 Flash: LLM for answer generation // we used this because it was free tbh
- FastAPI: REST API backend
- Streamlit: Interactive UI
- Sentence Transformers: Document embeddings and reranking
graph TD
A[User Query] --> B[FastAPI Server]
B --> C[Hybrid Retrieval]
C --> D[ChromaDB<br/>Semantic Search]
C --> E[BM25<br/>Keyword Search]
D --> F[Combine Documents<br/>10 from each, deduplicated]
E --> F
F --> G[Up to 20 Documents]
G --> H[Cross-Encoder Reranker<br/>ms-marco-MiniLM]
H --> I[Top 5 Ranked Documents]
J -->|No| K[Refusal Response<br/>I do not have sufficient information]
J -->|Yes| L[Gemini 1.5 Flash<br/>Answer Generation]
L --> M[Answer + Citations + Sources]
K --> N[Return Response]
M --> N
Query Processing:
sequenceDiagram
participant User
participant API
participant Retriever
participant Reranker
participant Gate
participant LLM
User->>API: POST /query
API->>Retriever: Hybrid search (ChromaDB + BM25)
Retriever-->>API: Up to 20 documents
API->>Reranker: Score all documents
Reranker-->>API: Top 5 ranked with scores
API->>Gate: Check confidence
alt Score >= threshold
Gate->>LLM: Generate answer
LLM-->>API: Answer + citations
else Score < threshold
Gate-->>API: Refusal message
end
API-->>User: Response
1. Ingestion Pipeline (src/ingestion.py)
- Loads PDFs from
data/raw/ - Chunks documents (1000 chars, 200 overlap)
- Generates embeddings (
all-MiniLM-L6-v2) - Stores in ChromaDB and BM25 index
2. Hybrid Retrieval (src/retrieval.py)
- ChromaDB: Semantic similarity search (top 10)
- BM25: Keyword matching (top 10)
- Combined: Merges both sources, deduplicated (up to 20 documents)
3. Reranking
- Model:
cross-encoder/ms-marco-MiniLM-L-6-v2 - Scores all retrieved documents as query-document pairs
- Returns top 5 documents with normalized scores (0.0-1.0)
4. Confidence Gate
- If
top_score < threshold: REFUSE - If
top_score >= threshold: PROCEED to generation - Default threshold: 0.25
5. Answer Generator (src/generator.py)
- Model: Google Gemini 1.5 Flash Latest
- Temperature: 0 (deterministic)
- Citations required for all claims
- Python 3.11 or higher
- Google API key or Open API key etc. (make sure to swap the commands accordingly)
- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create
.envfile:
GOOGLE_API_KEY=your_google_api_key_here-
Place PDF files in
data/raw/directory -
Run ingestion:
python src/ingestion.pypython src/main.pyAPI available at http://localhost:8000
streamlit run src/app.pyUI opens at http://localhost:8501
Health check endpoint.
curl http://localhost:8000/healthResponse:
{
"status": "healthy"
}Submit a query to the RAG system.
Request:
{
"query": "string",
"enable_gating": true,
"threshold_override": 0.25
}Parameters:
query(required): The question to answerenable_gating(optional, default: true): Enable confidence gatingthreshold_override(optional, default: 0.25): Confidence threshold (0.0-1.0)
Response:
{
"answer": "string",
"refusal_triggered": false,
"confidence_score": 0.85,
"sources": [...],
"mode": "verified"
}Examples:
Verified mode (with gating):
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is Tata Motors revenue?",
"enable_gating": true,
"threshold_override": 0.25
}'Naive mode (no gating):
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is Tata Motors revenue?",
"enable_gating": false
}'Python example:
import requests
response = requests.post("http://localhost:8000/query", json={
"query": "What is the revenue?",
"enable_gating": True,
"threshold_override": 0.25
})
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence_score']:.2f}")
print(f"Refused: {result['refusal_triggered']}")Test system on 50 questions (25 answerable, 25 unanswerable):
python eval/evaluate.pyGenerates:
eval/results_verified.json: Results with gating enabledeval/results_naive.json: Results with gating disabled
python eval/analyze_results.pyCreates eval/report.md with precision, recall, and hallucination metrics.
python eval/tune_threshold.pyTests thresholds [0.15, 0.20, 0.25, 0.30, 0.35, 0.40] and recommends best F1 score.
python eval/error_analysis.pyCreates docs/error_analysis.md with failure analysis and recommendations.
Based on golden dataset evaluation:
| Metric | Verified Mode | Naive Mode |
|---|---|---|
| Recall (Answerable) | ~90% | ~95% |
| Precision (Unanswerable) | ~85% | ~20% |
| Hallucination Rate | Low | High |
Key Finding: Verified mode significantly reduces hallucinations while maintaining high recall.
Latency:
- Refusal: 200-500ms (no LLM generation)
- Answer: 1-3 seconds (includes retrieval + reranking + LLM)
Resource Usage:
- ChromaDB: ~100MB per 1000 documents
- BM25: ~50MB per 1000 documents
- Cross-encoder model: ~400MB
- Embedding model: ~80MB
Why Hybrid Retrieval?
- Vector search captures semantic meaning (top 10 results)
- BM25 captures exact keywords (top 10 results)
- Combining both methods provides comprehensive coverage
- Deduplication ensures unique documents
Why Confidence Gating?
- Prevents hallucinations on low-quality retrievals
- Builds user trust through explicit uncertainty
- Better to refuse than provide wrong information
Why Cross-Encoder Reranking?
- More accurate than bi-encoder similarity
- Processes query-document pairs jointly
- Worth the ~200ms overhead for 20 documents
- Scripted by Samuel Israel and Syed Farhan