RAG Article Query Filter

A hybrid retrieval system for news articles that combines dense vector search and BM25 keyword matching, with a BERT-based toxicity gate on incoming queries.

Built as a self-contained ML engineering project to demonstrate systematic retrieval experimentation, modular Python packaging, and production-ready API design.

Architecture

Raw corpus (CSV)
      │
      ▼
┌──────────────────────┐   SentenceSplitter (256 tok)  ┌────────────────┐
│      src/data        │ ────────────────────────────► │  src/indexing  │
│  minimal_clean_text  │                               │  ChromaDB      │
└─────────────┘                                     │  bge-large-v1.5│
                                                    └───────┬────────┘
                                                            │
                        ┌───────────────────────────────────┤
                        │                                   │
               ┌────────▼────────┐                 ┌────────▼────────┐
               │  Vector Search  │                 │  BM25 Retriever │
               │  bge-large-v1.5 │                 │  word tokeniser │
               └────────┬────────┘                 └────────┬────────┘
                        │                                   │
                        └──────────────┬────────────────────┘
                                       │ Reciprocal Rank Fusion
                                       ▼
                              Ranked document UUIDs
                                       │
                              ┌────────▼────────┐
                              │ src/query_filter │
                              │  BERT toxicity  │
                              └────────┬────────┘
                                       │
                              ┌────────▼────────┐
                              │    src/api       │
                              │    FastAPI       │
                              └─────────────────┘

Results

Five retrieval strategies evaluated on 500 queries (seed 42) from a labelled set of 2 330. Full methodology and findings: results/retrieval_experiments.md

Experiment	F1	Recall	Precision
Baseline — bge-small, top_k=2	0.323	0.271	0.440
BM25 word-tokenised, top_k=5	0.396	0.460	0.373
bge-base vector, top_k=5	0.349	0.396	0.337
bge-base + BM25 hybrid (RRF)	0.403	0.440	0.401
bge-large + BM25 hybrid (RRF)	0.405	0.441	0.408

+25.5% F1 improvement over baseline. Key insight: on this entity-heavy news corpus, BM25 dominates standalone vector search; a fair RRF hybrid closes the precision gap.

Setup

# Install dependencies (uv recommended)
uv sync

# Or with pip
pip install -e .

Requirements: Python 3.11–3.12, ~5 GB disk (models + vector DB).

Usage

1. Prepare the corpus

prepare-data

Produces dataset/corpus_minimal_clean.csv (whitespace-normalised, casing preserved).

2. Run retrieval experiments

run-experiments

Builds ChromaDB indices for each embedding model (cached on re-runs) and prints the full results table. See results/retrieval_experiments.md for the pre-computed results.

3. Run the tests

# Install dev extras first (pytest + httpx)
uv sync --extra dev

pytest tests/

All tests run without a pre-built index or GPU — heavy dependencies (ChromaDB, BERT, embedding models) are mocked at the test boundary.

4. Start the API

uvicorn src.api:app --reload

Interactive docs at http://127.0.0.1:8000/docs.

Note: The API loads the v2_bge_large index at startup. Run run-experiments at least once before serving to ensure the index is built.

Endpoints

Method	Path	Description
`GET`	`/health`	Liveness check
`POST`	`/query?query=…`	Return ranked article UUIDs

The toxicity filter rejects harmful queries with HTTP 422 before retrieval runs.

Dataset

File	Description
`dataset/corpus.csv`	609 news articles (title, body, uuid, category, …)
`dataset/corpus_minimal_clean.csv`	Whitespace-normalised — produced by `prepare-data`
`dataset/queries.csv`	2 330 labelled queries with ground-truth article UUIDs

Tech stack

Component	Library
Vector indexing	LlamaIndex + ChromaDB
Dense embeddings	BAAI/bge-large-en-v1.5 via HuggingFace
Keyword retrieval	BM25 (rank_bm25) with word-level tokeniser
Rank fusion	Reciprocal Rank Fusion (Cormack et al., 2009)
Toxicity filter	JungleLee/bert-toxic-comment-classification
API	FastAPI
Linting	Ruff

Project layout

src/
├── config.py        centralised constants (paths, model names, tuning params)
├── data.py          corpus text cleaning + prepare-data entry point
├── indexing.py      ChromaDB index construction and loading
├── retrieval.py     RRF, hybrid retrieval, reranking utilities
├── evaluation.py    retrieval metrics (recall, precision, F1, hit rate, MRR)
├── query_filter.py  lazy-loaded BERT toxicity classifier
├── experiments.py   5-experiment runner + run-experiments entry point
└── api.py           FastAPI application

results/
└── retrieval_experiments.md   experiment report (methodology + results + findings)

tests/                         tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Article Query Filter

Architecture

Results

Setup

Usage

1. Prepare the corpus

2. Run retrieval experiments

3. Run the tests

4. Start the API

Endpoints

Dataset

Tech stack

Project layout

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RAG Article Query Filter

Architecture

Results

Setup

Usage

1. Prepare the corpus

2. Run retrieval experiments

3. Run the tests

4. Start the API

Endpoints

Dataset

Tech stack

Project layout