VeritasRAG: Hallucination-Resistant RAG System

A confidence-gated Retrieval-Augmented Generation system that prioritizes accuracy over completeness by refusing to answer when evidence is insufficient.

Key Features

Confidence Gating: Automatically refuses to answer questions when retrieved documents have low relevance scores
Hybrid Retrieval: Combines semantic search (ChromaDB) with keyword matching (BM25)
Cross-Encoder Reranking: Uses specialized reranking model for accurate document scoring
Dual-Mode Operation: Toggle between verified mode (gating enabled) and naive mode (standard RAG)
REST API: FastAPI backend with comprehensive logging
Interactive UI: Streamlit dashboard for testing and visualization

Tech Stack

Python 3.11+
LangChain: RAG framework and document processing
ChromaDB: Vector database for semantic search
BM25: Keyword-based sparse retrieval
Google Gemini 1.5 Flash: LLM for answer generation // we used this because it was free tbh
FastAPI: REST API backend
Streamlit: Interactive UI
Sentence Transformers: Document embeddings and reranking

Architecture

graph TD
    A[User Query] --> B[FastAPI Server]
    B --> C[Hybrid Retrieval]

    C --> D[ChromaDB<br/>Semantic Search]
    C --> E[BM25<br/>Keyword Search]

    D --> F[Combine Documents<br/>10 from each, deduplicated]
    E --> F

    F --> G[Up to 20 Documents]
    G --> H[Cross-Encoder Reranker<br/>ms-marco-MiniLM]

    H --> I[Top 5 Ranked Documents]
    J -->|No| K[Refusal Response<br/>I do not have sufficient information]
    J -->|Yes| L[Gemini 1.5 Flash<br/>Answer Generation]

    L --> M[Answer + Citations + Sources]
    K --> N[Return Response]
    M --> N

Data Flow

Query Processing:

sequenceDiagram
    participant User
    participant API
    participant Retriever
    participant Reranker
    participant Gate
    participant LLM

    User->>API: POST /query
    API->>Retriever: Hybrid search (ChromaDB + BM25)
    Retriever-->>API: Up to 20 documents
    API->>Reranker: Score all documents
    Reranker-->>API: Top 5 ranked with scores
    API->>Gate: Check confidence
    alt Score >= threshold
        Gate->>LLM: Generate answer
        LLM-->>API: Answer + citations
    else Score < threshold
        Gate-->>API: Refusal message
    end
    API-->>User: Response

Components

1. Ingestion Pipeline (src/ingestion.py)

Loads PDFs from data/raw/
Chunks documents (1000 chars, 200 overlap)
Generates embeddings (all-MiniLM-L6-v2)
Stores in ChromaDB and BM25 index

2. Hybrid Retrieval (src/retrieval.py)

ChromaDB: Semantic similarity search (top 10)
BM25: Keyword matching (top 10)
Combined: Merges both sources, deduplicated (up to 20 documents)

3. Reranking

Model: cross-encoder/ms-marco-MiniLM-L-6-v2
Scores all retrieved documents as query-document pairs
Returns top 5 documents with normalized scores (0.0-1.0)

4. Confidence Gate

If top_score < threshold: REFUSE
If top_score >= threshold: PROCEED to generation
Default threshold: 0.25

5. Answer Generator (src/generator.py)

Model: Google Gemini 1.5 Flash Latest
Temperature: 0 (deterministic)
Citations required for all claims

Installation

Prerequisites

Python 3.11 or higher
Google API key or Open API key etc. (make sure to swap the commands accordingly)

Setup

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Create .env file:

GOOGLE_API_KEY=your_google_api_key_here

Place PDF files in data/raw/ directory
Run ingestion:

python src/ingestion.py

Usage

Start API Server

python src/main.py

API available at http://localhost:8000

Launch Streamlit UI

streamlit run src/app.py

UI opens at http://localhost:8501

API Endpoints

`GET /health`

Health check endpoint.

curl http://localhost:8000/health

Response:

{
  "status": "healthy"
}

`POST /query`

Submit a query to the RAG system.

Request:

{
  "query": "string",
  "enable_gating": true,
  "threshold_override": 0.25
}

Parameters:

query (required): The question to answer
enable_gating (optional, default: true): Enable confidence gating
threshold_override (optional, default: 0.25): Confidence threshold (0.0-1.0)

Response:

{
  "answer": "string",
  "refusal_triggered": false,
  "confidence_score": 0.85,
  "sources": [...],
  "mode": "verified"
}

Examples:

Verified mode (with gating):

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is Tata Motors revenue?",
    "enable_gating": true,
    "threshold_override": 0.25
  }'

Naive mode (no gating):

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is Tata Motors revenue?",
    "enable_gating": false
  }'

Python example:

import requests

response = requests.post("http://localhost:8000/query", json={
    "query": "What is the revenue?",
    "enable_gating": True,
    "threshold_override": 0.25
})

result = response.json()
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence_score']:.2f}")
print(f"Refused: {result['refusal_triggered']}")

Evaluation

Run Evaluation

Test system on 50 questions (25 answerable, 25 unanswerable):

python eval/evaluate.py

Generates:

eval/results_verified.json: Results with gating enabled
eval/results_naive.json: Results with gating disabled

Generate Metrics Report

python eval/analyze_results.py

Creates eval/report.md with precision, recall, and hallucination metrics.

Find Optimal Threshold

python eval/tune_threshold.py

Tests thresholds [0.15, 0.20, 0.25, 0.30, 0.35, 0.40] and recommends best F1 score.

Error Analysis

python eval/error_analysis.py

Creates docs/error_analysis.md with failure analysis and recommendations.

Evaluation Results

Based on golden dataset evaluation:

Metric	Verified Mode	Naive Mode
Recall (Answerable)	~90%	~95%
Precision (Unanswerable)	~85%	~20%
Hallucination Rate	Low	High

Key Finding: Verified mode significantly reduces hallucinations while maintaining high recall.

Performance

Latency:

Refusal: 200-500ms (no LLM generation)
Answer: 1-3 seconds (includes retrieval + reranking + LLM)

Resource Usage:

ChromaDB: ~100MB per 1000 documents
BM25: ~50MB per 1000 documents
Cross-encoder model: ~400MB
Embedding model: ~80MB

Design Decisions

Why Hybrid Retrieval?

Vector search captures semantic meaning (top 10 results)
BM25 captures exact keywords (top 10 results)
Combining both methods provides comprehensive coverage
Deduplication ensures unique documents

Why Confidence Gating?

Prevents hallucinations on low-quality retrievals
Builds user trust through explicit uncertainty
Better to refuse than provide wrong information

Why Cross-Encoder Reranking?

More accurate than bi-encoder similarity
Processes query-document pairs jointly
Worth the ~200ms overhead for 20 documents

Devs

Scripted by Samuel Israel and Syed Farhan

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
eval		eval
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test_pdfs.py		test_pdfs.py
verify_setup.py		verify_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VeritasRAG: Hallucination-Resistant RAG System

Key Features

Tech Stack

Architecture

Data Flow

Components

Installation

Prerequisites

Setup

Usage

Start API Server

Launch Streamlit UI

API Endpoints

`GET /health`

`POST /query`

Evaluation

Run Evaluation

Generate Metrics Report

Find Optimal Threshold

Error Analysis

Evaluation Results

Performance

Design Decisions

Devs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VeritasRAG: Hallucination-Resistant RAG System

Key Features

Tech Stack

Architecture

Data Flow

Components

Installation

Prerequisites

Setup

Usage

Start API Server

Launch Streamlit UI

API Endpoints

GET /health

POST /query

Evaluation

Run Evaluation

Generate Metrics Report

Find Optimal Threshold

Error Analysis

Evaluation Results

Performance

Design Decisions

Devs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /query`

Packages