This project presents an "Active Retrieval-Augmented Generation (RAG)" chat application built with Streamlit, specifically designed to address and mitigate the challenge of hallucination in Large Language Model (LLM) responses. Inspired by the methodology outlined in the research paper "Active Retrieval Augmented Generation", this implementation extends the RAG paradigm by introducing a dynamic, iterative self-correction mechanism.
The core idea is to enhance the factual accuracy and reliability of LLM-generated answers by integrating real-time validation against external knowledge sources. The application provides a transparent view into this "thinking process," allowing users to observe the AI's internal validation steps. Alongside the interactive chat, an accompanying evaluation script is included to quantitatively assess the effectiveness of this active hallucination mitigation strategy.
- Interactive Chat Interface: A responsive front-end built with Streamlit.
- Active RAG Pipeline: Implements a novel, multi-step iterative process for generating and validating LLM responses.
- Hallucination Mitigation: Significantly reduces the generation of factually incorrect or unverified information by the LLM.
- Transparent Thinking Process: Users can expand a dedicated section within each AI response to view the detailed steps of sentence generation, keyword extraction, validation question formulation, external retrieval, and rechecking.
- External Knowledge Integration: Utilizes Google Custom Search Engine (CSE) for real-time information retrieval and factual validation.
- LLM Backbone: Powered by
gemini-2.0-flash-litefor both content generation and validation tasks.
The heart of this project is a sophisticated, iterative RAG pipeline designed to build and validate answers sentence by sentence. This process involves multiple interactions with the LLM and an external search engine:
- Sentence-by-Sentence Generation: The LLM (
gemini-2.0-flash-lite) is prompted to generate the answer to the user's question one sentence at a time, building upon the previously generated context. - Keyword and Concept Extraction: From each newly generated sentence, the LLM identifies and extracts 2-5 main topics, keyphrases, or factual claims that require verification.
- Validation Question Generation: The LLM then formulates specific, relevant questions around these identified concepts. These questions are designed to be answerable by an external search and to help determine the factual correctness of the original generated sentence.
- External Retrieval (Google CSE): The generated validation questions are used to query a Google Custom Search Engine (CSE). The search results (snippets) provide external evidence.
- LLM-based Validation: Another instance of the LLM (
gemini-2.0-flash-lite, potentially configured with a separate API key for clear separation of roles) takes the validation question and the retrieved snippets. It then determines whether the original claim in the generated sentence is supported by the external evidence, providing a "Yes" or "No" and a brief explanation. - Sentence Correction/Refinement: Finally, the LLM uses these validation results to "recheck" and, if necessary, correct the original generated sentence. If a claim is found to be unverified or incorrect, the sentence is revised to ensure factual accuracy before being added to the final answer.
This iterative process continues until the LLM indicates the answer is complete or a maximum number of sentences is reached.
The effectiveness of this active RAG method was rigorously tested using a subset of the WikiHowNFQA dataset. The wikiHow.py script automates this evaluation, comparing the original LLM responses (before correction) and the actively corrected responses against reference answers using ROUGE-L and BERTScore metrics.
The evaluation, conducted on 1036 questions from the dataset (limited by time and API rate constraints), yielded the following insights:
- 27.89% of responses exhibited improvement in both ROUGE-L and BERTScore, highlighting a substantial positive impact on factual accuracy and semantic similarity.
- 25.90% of responses showed improvement in either ROUGE-L or BERTScore, demonstrating partial but significant enhancement in quality.
- 30.98% of responses remained factually consistent, indicating the initial generation was already accurate and required no correction.
These results underscore the potential of active RAG methods in enhancing LLM reliability, while also pointing to the ongoing challenges in achieving perfect factual consistency.
Follow these steps to set up and run the application locally.
- Python 3.8+
pip(Python package installer)git(for cloning the repository)- Google Cloud Project & API Keys:
- You'll need atleast two Google Cloud Project with the Gemini API enabled . Or you can use just one but you will run out of the daily requests quota very soon.
- Generate two separate API keys for the Gemini API. These will be used for
model_generateandmodel_validaterespectively (e.g.,GEMINI_API_1andGEMINI_API_2). - Google Custom Search Engine (CSE) API Key & Search Engine ID:
- Create a Custom Search Engine (CSE) at Programmable Search Engine.
- Obtain your Search Engine ID (CX) (
GOOGLE_CX_ID). - Enable the Custom Search API in your Google Cloud Project.
- Generate an API key for the Custom Search API (e.g.,
GOOGLE_SEARCH_API).
-
Clone the Repository:
git clone [https://github.com/r1thk4/hallucination-mitigator] cd hallucination-mitigator -
Create a Virtual Environment (Recommended):
python -m venv venv
-
Activate the Virtual Environment:
- On macOS/Linux:
source venv/bin/activate - On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
-
Install Dependencies:
pip install -r requirements.txt
-
Set Environment Variables: Create a
.envfile in the root directory of your project (same level asapp.py) and add your API keys and CSE details:GEMINI_API_1="YOUR_GEMINI_API_KEY_FOR_GENERATION" GEMINI_API_2="YOUR_GEMINI_API_KEY_FOR_VALIDATION" GOOGLE_SEARCH_API="YOUR_GOOGLE_CUSTOM_SEARCH_API_KEY" GOOGLE_CX_ID="YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID"Important: Do not commit your
.envfile to GitHub. It's crucial for security and is typically ignored by Git (via.gitignore).
- Activate your virtual environment (if not already active).
- Navigate to the project's root directory (where
app.pyis located):cd your-github-repo-name - Run the Streamlit app:
This command will open the application in your default web browser.
streamlit run app.py
To replicate the evaluation of the Active RAG pipeline:
- Activate your virtual environment.
- Navigate to the project's root directory (
your-github-repo-name/). - Execute the script as a Python module:
This script will load a subset of the WikiHowNFQA dataset, run the
python -m testingWikiHowNFQA.wikiHow
active_generatepipeline for each question, and print evaluation metrics (ROUGE-L, BERTScore) comparing the original and corrected answers against the reference answers. Be aware of potential rate limits from API calls during extensive evaluation.