Skip to content

samuel-isr/VeritasRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VeritasRAG: Hallucination-Resistant RAG System

A confidence-gated Retrieval-Augmented Generation system that prioritizes accuracy over completeness by refusing to answer when evidence is insufficient.

Key Features

  • Confidence Gating: Automatically refuses to answer questions when retrieved documents have low relevance scores
  • Hybrid Retrieval: Combines semantic search (ChromaDB) with keyword matching (BM25)
  • Cross-Encoder Reranking: Uses specialized reranking model for accurate document scoring
  • Dual-Mode Operation: Toggle between verified mode (gating enabled) and naive mode (standard RAG)
  • REST API: FastAPI backend with comprehensive logging
  • Interactive UI: Streamlit dashboard for testing and visualization

Tech Stack

  • Python 3.11+
  • LangChain: RAG framework and document processing
  • ChromaDB: Vector database for semantic search
  • BM25: Keyword-based sparse retrieval
  • Google Gemini 1.5 Flash: LLM for answer generation // we used this because it was free tbh
  • FastAPI: REST API backend
  • Streamlit: Interactive UI
  • Sentence Transformers: Document embeddings and reranking

Architecture

graph TD
    A[User Query] --> B[FastAPI Server]
    B --> C[Hybrid Retrieval]

    C --> D[ChromaDB<br/>Semantic Search]
    C --> E[BM25<br/>Keyword Search]

    D --> F[Combine Documents<br/>10 from each, deduplicated]
    E --> F

    F --> G[Up to 20 Documents]
    G --> H[Cross-Encoder Reranker<br/>ms-marco-MiniLM]

    H --> I[Top 5 Ranked Documents]
    J -->|No| K[Refusal Response<br/>I do not have sufficient information]
    J -->|Yes| L[Gemini 1.5 Flash<br/>Answer Generation]

    L --> M[Answer + Citations + Sources]
    K --> N[Return Response]
    M --> N
Loading

Data Flow

Query Processing:

sequenceDiagram
    participant User
    participant API
    participant Retriever
    participant Reranker
    participant Gate
    participant LLM

    User->>API: POST /query
    API->>Retriever: Hybrid search (ChromaDB + BM25)
    Retriever-->>API: Up to 20 documents
    API->>Reranker: Score all documents
    Reranker-->>API: Top 5 ranked with scores
    API->>Gate: Check confidence
    alt Score >= threshold
        Gate->>LLM: Generate answer
        LLM-->>API: Answer + citations
    else Score < threshold
        Gate-->>API: Refusal message
    end
    API-->>User: Response
Loading

Components

1. Ingestion Pipeline (src/ingestion.py)

  • Loads PDFs from data/raw/
  • Chunks documents (1000 chars, 200 overlap)
  • Generates embeddings (all-MiniLM-L6-v2)
  • Stores in ChromaDB and BM25 index

2. Hybrid Retrieval (src/retrieval.py)

  • ChromaDB: Semantic similarity search (top 10)
  • BM25: Keyword matching (top 10)
  • Combined: Merges both sources, deduplicated (up to 20 documents)

3. Reranking

  • Model: cross-encoder/ms-marco-MiniLM-L-6-v2
  • Scores all retrieved documents as query-document pairs
  • Returns top 5 documents with normalized scores (0.0-1.0)

4. Confidence Gate

  • If top_score < threshold: REFUSE
  • If top_score >= threshold: PROCEED to generation
  • Default threshold: 0.25

5. Answer Generator (src/generator.py)

  • Model: Google Gemini 1.5 Flash Latest
  • Temperature: 0 (deterministic)
  • Citations required for all claims

Installation

Prerequisites

  • Python 3.11 or higher
  • Google API key or Open API key etc. (make sure to swap the commands accordingly)

Setup

  1. Create virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Create .env file:
GOOGLE_API_KEY=your_google_api_key_here
  1. Place PDF files in data/raw/ directory

  2. Run ingestion:

python src/ingestion.py

Usage

Start API Server

python src/main.py

API available at http://localhost:8000

Launch Streamlit UI

streamlit run src/app.py

UI opens at http://localhost:8501

API Endpoints

GET /health

Health check endpoint.

curl http://localhost:8000/health

Response:

{
  "status": "healthy"
}

POST /query

Submit a query to the RAG system.

Request:

{
  "query": "string",
  "enable_gating": true,
  "threshold_override": 0.25
}

Parameters:

  • query (required): The question to answer
  • enable_gating (optional, default: true): Enable confidence gating
  • threshold_override (optional, default: 0.25): Confidence threshold (0.0-1.0)

Response:

{
  "answer": "string",
  "refusal_triggered": false,
  "confidence_score": 0.85,
  "sources": [...],
  "mode": "verified"
}

Examples:

Verified mode (with gating):

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is Tata Motors revenue?",
    "enable_gating": true,
    "threshold_override": 0.25
  }'

Naive mode (no gating):

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is Tata Motors revenue?",
    "enable_gating": false
  }'

Python example:

import requests

response = requests.post("http://localhost:8000/query", json={
    "query": "What is the revenue?",
    "enable_gating": True,
    "threshold_override": 0.25
})

result = response.json()
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence_score']:.2f}")
print(f"Refused: {result['refusal_triggered']}")

Evaluation

Run Evaluation

Test system on 50 questions (25 answerable, 25 unanswerable):

python eval/evaluate.py

Generates:

  • eval/results_verified.json: Results with gating enabled
  • eval/results_naive.json: Results with gating disabled

Generate Metrics Report

python eval/analyze_results.py

Creates eval/report.md with precision, recall, and hallucination metrics.

Find Optimal Threshold

python eval/tune_threshold.py

Tests thresholds [0.15, 0.20, 0.25, 0.30, 0.35, 0.40] and recommends best F1 score.

Error Analysis

python eval/error_analysis.py

Creates docs/error_analysis.md with failure analysis and recommendations.

Evaluation Results

Based on golden dataset evaluation:

Metric Verified Mode Naive Mode
Recall (Answerable) ~90% ~95%
Precision (Unanswerable) ~85% ~20%
Hallucination Rate Low High

Key Finding: Verified mode significantly reduces hallucinations while maintaining high recall.

Performance

Latency:

  • Refusal: 200-500ms (no LLM generation)
  • Answer: 1-3 seconds (includes retrieval + reranking + LLM)

Resource Usage:

  • ChromaDB: ~100MB per 1000 documents
  • BM25: ~50MB per 1000 documents
  • Cross-encoder model: ~400MB
  • Embedding model: ~80MB

Design Decisions

Why Hybrid Retrieval?

  • Vector search captures semantic meaning (top 10 results)
  • BM25 captures exact keywords (top 10 results)
  • Combining both methods provides comprehensive coverage
  • Deduplication ensures unique documents

Why Confidence Gating?

  • Prevents hallucinations on low-quality retrievals
  • Builds user trust through explicit uncertainty
  • Better to refuse than provide wrong information

Why Cross-Encoder Reranking?

  • More accurate than bi-encoder similarity
  • Processes query-document pairs jointly
  • Worth the ~200ms overhead for 20 documents

Devs

  • Scripted by Samuel Israel and Syed Farhan

About

A hallucination-resistant Retrieval-Augmented Generation (RAG) system.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages