ragfallback

ragfallback prevents silent RAG failures across the full pipeline — from bad chunks at ingest, through retrieval outages at runtime, to invisible answer quality degradation in production.

What it prevents

#	Real production failure	Module	Example
1	Query mismatch → silent empty results	`AdaptiveRAGRetriever` + `QueryVariationsStrategy`	`uc6_adaptive_rag.py`
2	Embedding model switch corrupts index dimensions	`EmbeddingGuard`	`uc2_embedding_guard.py`
3	Bad chunks (too short, mid-sentence) poison retrieval	`ChunkQualityChecker`	`uc3_chunk_quality.py`
4	Retrieved chunks overflow LLM context window	`ContextWindowGuard`	`uc4_context_window.py`
5	Keyword queries fail dense retrieval silently	`SmartThresholdHybridRetriever`	`uc5_hybrid_failover.py`
6	Primary retriever outage returns empty, no fallback	`FailoverRetriever`	`uc5_hybrid_failover.py`
7	Multi-step questions always fail single-shot RAG	`MultiHopFallbackStrategy`	`uc6_multi_hop_demo.py`
8	Index serves stale data after document updates	`StaleIndexDetector`	—
9	Answer quality invisible in production	`RAGEvaluator`	`uc7_rag_evaluator.py`
10	Cross-boundary answers lost between adjacent chunks	`OverlappingContextStitcher`	`uc8_context_stitcher.py`
11	Metric regression after model/embedder/chunker change	`GoldenRunner` + `BaselineRegistry`	`examples/ci_regression_gate.py`

Quick start

pip install ragfallback[chroma,huggingface,real-data]

from datasets import load_dataset
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from ragfallback.diagnostics import ChunkQualityChecker, EmbeddingGuard, RetrievalHealthCheck
from ragfallback.evaluation import RAGEvaluator

# 1 — load 50 real Wikipedia passages (SQuAD, CC BY-SA 4.0)
ds = load_dataset("rajpurkar/squad", split="validation")
seen, docs, probes = set(), [], []
for row in ds:
    ctx = row["context"].strip()
    if ctx not in seen and len(seen) < 50:
        seen.add(ctx)
        docs.append(Document(page_content=ctx, metadata={"source": "squad"}))
    if row["answers"]["text"]:
        probes.append({"question": row["question"],
                       "ground_truth": row["answers"]["text"][0]})
print(f"Loaded {len(docs)} real passages, {len(probes)} Q&A pairs")

# 2 — check chunk quality before embedding
report = ChunkQualityChecker().check(docs)
print(report.summary())

# 3 — guard embedding dimensions before writing to any index
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
EmbeddingGuard(expected_dim=384).validate(embeddings).raise_if_failed()

# 4 — build index and smoke-test retrieval with real Q&A probes
store = Chroma.from_documents(docs, embeddings, persist_directory="./my_index")
health = RetrievalHealthCheck(k=4).run_substring_probes(
    store,
    {p["question"]: p["ground_truth"][:50] for p in probes[:10]},
)
print(f"Retrieval hit rate: {health.hit_rate:.0%}")

# 5 — evaluate answer quality on a real question
question = probes[0]["question"]
retrieved = store.as_retriever(search_kwargs={"k": 4}).invoke(question)
answer = retrieved[0].page_content if retrieved else "Not found"
score = RAGEvaluator().evaluate(
    question, answer,
    [d.page_content for d in retrieved],
    ground_truth=probes[0]["ground_truth"],
)
print(score.report())

Expected output (actual numbers — run it yourself):

Loaded 50 real passages, 2627 Q&A pairs
[PASS] chunks=50 | len min/avg/max=144/618/2095
Retrieval hit rate: 100%
========================================================
 RAG evaluation
========================================================
 Context precision  : 100.00%
 Faithfulness       : 95.00%
 Answer relevance   : 40.00%
 Recall (gold hit)  : 100.00%
 Overall            : 84.00%
 Pass (>=70%)       : True

Configuration

Most features work with no API key — chunk checking, embedding validation, hybrid retrieval, and evaluation all run locally.

LLM-dependent features (AdaptiveRAGRetriever, QueryVariationsStrategy, MultiHopFallbackStrategy) need a model. Copy .env.example to .env and fill in:

cp .env.example .env

MISTRAL_API_KEY=your_key_here
MISTRAL_MODEL=mistral-small-latest   # default, override if needed

Get a free Mistral key at console.mistral.ai. The library also supports any LangChain-compatible LLM — pass it directly to AdaptiveRAGRetriever(llm=your_llm).

Full pipeline

Your documents
     │
     ▼
[ChunkQualityChecker]          ← bad splits, short/duplicate chunks
     │
     ▼
[EmbeddingGuard]               ← dimension / NaN / zero-vector checks before write
[EmbeddingQualityProbe]        ← domain mismatch heuristic (generic model on jargon)
[sanitize_documents]           ← JSON-safe metadata before any vector store write
     │
     ▼
Vector store (Chroma / FAISS / Qdrant / …)
     │
     ▼
[StaleIndexDetector]           ← SHA256 manifest: source files vs last build
     │
     ▼
[RetrievalHealthCheck]         ← labeled recall@k or quick substring smoke probes
     │
     ▼
[SmartThresholdHybridRetriever]  ← threshold + optional BM25 fallback
[FailoverRetriever]              ← primary → fallback on exception or empty results
     │
     ▼
[ContextWindowGuard]           ← rank + trim chunks to token budget (8 model presets)
[OverlappingContextStitcher]   ← merge adjacent chunks from same source
     │
     ▼
[AdaptiveRAGRetriever]         ← QueryVariationsStrategy / MultiHopFallbackStrategy
     │
     ▼
[RAGEvaluator]                 ← recall@k, nDCG, faithfulness (heuristic + LLM judge)

Module reference

`ragfallback.diagnostics`

ChunkQualityChecker — detects too-short, too-long, mid-sentence, and duplicate chunks before embedding.

from ragfallback.diagnostics import ChunkQualityChecker
report = ChunkQualityChecker(min_chars=100, max_chars=8000).check(docs)
if report.has_issues:
    fixed = ChunkQualityChecker().auto_fix(docs)

EmbeddingGuard — validates dimension, NaN, and zero-vectors before writing to any index.

from ragfallback.diagnostics import EmbeddingGuard
guard = EmbeddingGuard(expected_dim=384)
guard.validate(embeddings_model).raise_if_failed()        # model-level check
guard.validate_raw_vectors(vectors).raise_if_failed()     # pre-computed vectors

EmbeddingQualityProbe — heuristic domain-fit check: if similarity scores are uniformly low, the model is likely a poor domain match.

from ragfallback.diagnostics import EmbeddingQualityProbe
result = EmbeddingQualityProbe().run(embeddings, query="...", reference_snippets=[...])
if not result.ok:
    print(result.warnings)   # "consider domain-specific model"

RetrievalHealthCheck — labeled recall@k or quick substring smoke probes against a live vector store.

from ragfallback.diagnostics import RetrievalHealthCheck
health = RetrievalHealthCheck(k=5)
report = health.run_substring_probes(vector_store, {"What is Python?": "high-level language"})
print(report.hit_rate, report.avg_latency_ms)

StaleIndexDetector — SHA256 manifest to catch when source files changed since last index build.

from ragfallback.diagnostics import StaleIndexDetector
det = StaleIndexDetector(manifest_path="./index_manifest.json")
det.record_paths(["./docs/policy.md"])          # record after build
report = det.check_paths(["./docs/policy.md"])  # check before serving
if report.has_stale:
    print(report.summary())

ContextWindowGuard — ranks and trims retrieved chunks to fit a token budget; 8 model presets included.

from ragfallback.diagnostics import ContextWindowGuard
guard = ContextWindowGuard.from_model_name("gpt-4o")
selected, report = guard.select(query, retrieved_docs, embeddings)

OverlappingContextStitcher — merges consecutive chunks from the same source so cross-boundary answers aren't split.

from ragfallback.diagnostics import OverlappingContextStitcher
merged = OverlappingContextStitcher().stitch(retrieved_docs)

sanitize_documents — normalizes list/dict/bytes metadata to JSON-safe scalars before any vector store write.

from ragfallback.diagnostics import sanitize_documents
clean_docs = sanitize_documents(dirty_docs)   # safe for Chroma, Pinecone, Qdrant

`ragfallback.retrieval`

SmartThresholdHybridRetriever — score-threshold gating with automatic BM25 fallback when dense scores are weak. Supports distance, similarity, and relative score modes.

from ragfallback.retrieval import SmartThresholdHybridRetriever
retriever = SmartThresholdHybridRetriever.from_documents(
    docs, embeddings, dense_threshold=0.5, k=4
)   # pip install ragfallback[hybrid] for BM25

FailoverRetriever — if the primary retriever raises or returns fewer than min_results docs, automatically switches to a secondary.

from ragfallback.retrieval import FailoverRetriever
retriever = FailoverRetriever(primary=chroma_retriever, fallback=faiss_retriever, min_results=1)

`ragfallback.core`

AdaptiveRAGRetriever — wraps a vector store with retry logic and pluggable fallback strategies. On each attempt it retrieves, scores confidence, and either returns the answer or tries the next strategy.

from ragfallback import AdaptiveRAGRetriever
from ragfallback.strategies import QueryVariationsStrategy

retriever = AdaptiveRAGRetriever(
    vector_store=store,
    llm=llm,                                      # any LangChain LLM
    strategies=[QueryVariationsStrategy(num_variations=2)],
    confidence_threshold=0.7,
    max_attempts=3,
)
result = retriever.retrieve("What is the refund policy?")
print(result.answer, result.confidence, result.attempts_used)

Requires MISTRAL_API_KEY (or any LangChain-compatible LLM passed via llm=).

aquery_with_fallback — native async version of query_with_fallback(). Real coroutine using LangChain ainvoke() — not a thread-pool wrapper. Falls back to thread pool automatically if the underlying LLM doesn't implement ainvoke.

import asyncio

# async-native — LLM API calls overlap instead of serializing
result = await retriever.aquery_with_fallback("What is the refund policy?")
print(result.answer, result.confidence, result.attempts)

# works in FastAPI, GoldenRunner.run_async(), or any async context
asyncio.run(retriever.aquery_with_fallback("How do API tokens expire?"))

`ragfallback.strategies`

QueryVariationsStrategy — LLM rewrites the original query into N variations to broaden retrieval recall. Requires an LLM.

MultiHopFallbackStrategy — decomposes complex multi-step questions into sub-questions, retrieves each independently, then synthesises a final answer. Requires an LLM.

from ragfallback.strategies import MultiHopFallbackStrategy
result = MultiHopFallbackStrategy(max_hops=3).run(question, retriever, llm)
print(result.final_answer, result.total_hops)

`ragfallback.tracking`

CostTracker — token cost ledger for a RAG session. Records spend per operation, enforces an optional budget ceiling, and surfaces a report at the end.

from ragfallback import CostTracker
tracker = CostTracker(budget_usd=1.0)
tracker.record(model="mistral-small-latest", input_tokens=500, output_tokens=200)
print(tracker.get_report())   # total cost, budget remaining

MetricsCollector — records latency, success/failure counts, and confidence scores across retrieval attempts.

from ragfallback import MetricsCollector
metrics = MetricsCollector()
# passed automatically to AdaptiveRAGRetriever; or record manually:
metrics.record_attempt(success=True, latency_ms=120, confidence=0.85)
print(metrics.get_stats())

CacheMonitor — wraps any LangChain retriever to track cache hit rate, per-category latency (hit vs miss), TTL-based expiry, and LRU eviction. Zero new dependencies — stdlib only. Supports both sync invoke() and async ainvoke().

from ragfallback.tracking import CacheMonitor

monitor = CacheMonitor(max_size=512, ttl_seconds=600)
cached_retriever = monitor.wrap_retriever(store.as_retriever(search_kwargs={"k": 4}))

# use cached_retriever exactly like any LangChain retriever
docs = cached_retriever.invoke("What is the refund policy?")

print(monitor.summary())
# → cache hit_rate=34.7% hits=26 misses=49 entries=49 evictions=0

stats = monitor.get_stats()
print(stats.hit_rate, stats.avg_hit_latency_ms, stats.avg_miss_latency_ms)

Pass to GoldenRunner to capture cache efficiency alongside RAGAS scores:

from ragfallback.mlops import GoldenRunner, RagasHook
from ragfallback.tracking import CacheMonitor

monitor = CacheMonitor(max_size=256, ttl_seconds=300)
runner = GoldenRunner(
    retriever=retriever,
    ragas_hook=hook,
    dataset="examples/golden_qa.json",
    cache_monitor=monitor,
)
report = asyncio.run(runner.run_async())
print(report.cache_stats)
# → {"hit_rate": 0.347, "hits": 26, "misses": 49, "evictions": 0, ...}

`ragfallback.evaluation`

RAGEvaluator — scores recall@k, nDCG, and faithfulness without external services. Optional LLM judge hook for higher accuracy.

from ragfallback.evaluation import RAGEvaluator
ev = RAGEvaluator()
score = ev.evaluate(question, answer, context_docs, ground_truth="...")
print(score.overall_score, score.faithfulness_score, score.recall_at_k)
print(ev.batch_summary([score]))

Examples — real public datasets

Example	Dataset	Command
UC-1: retrieval health	SQuAD Wikipedia	`python examples/uc1_retrieval_health.py`
UC-2: embedding guard	— (dimension check)	`python examples/uc2_embedding_guard.py`
UC-3: chunk quality	SQuAD Wikipedia	`python examples/uc3_chunk_quality.py`
UC-4: context window	sample KB	`python examples/uc4_context_window.py`
UC-5: hybrid + failover	FAISS + BM25	`python examples/uc5_hybrid_failover.py`
UC-6: adaptive RAG	SQuAD Wikipedia (needs `MISTRAL_API_KEY` or Ollama)	`python examples/uc6_adaptive_rag.py`
UC-7: RAG evaluator	PubMedQA (MIT) — real medical Q&A	`python examples/uc7_rag_evaluator.py`
UC-8: context stitcher	ChromaDB + HR chunks	`python examples/uc8_context_stitcher.py`
UC-9: embedding probe	— (similarity check)	`python examples/uc9_embedding_probe.py`
UC-10: metadata sanitizer	ChromaDB dirty docs	`python examples/uc10_metadata_sanitizer.py`
End-to-end on SQuAD	SQuAD Wikipedia (CC BY-SA 4.0)	`python examples/real_data_demo.py`
Financial news RAG	nickmuchi/financial-classification (Apache 2.0)	`python examples/financial_risk_analysis.py`
Legal contract RAG	theatticusproject/cuad-qa (CC BY 4.0)	`python examples/legal_document_analysis.py`
Medical abstract RAG	qiaojin/PubMedQA (MIT)	`python examples/medical_research_synthesis.py`
MLOps: build golden dataset	SQuAD (CC BY-SA 4.0) + SciQ (CC BY-NC 3.0)	`python examples/build_golden_dataset.py`
MLOps: full demo	SQuAD golden set, zero API keys	`python examples/mlops_demo.py`
MLOps: CI regression gate	SQuAD golden set, committed baseline	`python examples/ci_regression_gate.py`

Verified numbers — SQuAD Wikipedia validation set

python examples/real_data_demo.py runs every module on 200 real Wikipedia passages. Numbers below are printed by the script on every run — not made up.

Passages indexed     : 200 real Wikipedia passages
Q&A pairs            : 10 570 (ground truth available)
ChunkQualityChecker  : 1 violation  (avg 662 chars/passage)
EmbeddingGuard       : OK — dim 384 matches expected 384

RetrievalHealthCheck (20 real Q&A substring probes):
  Hit rate   : 100.0%
  Avg latency: 25 ms per query

RAGEvaluator (10 real Q&A pairs, heuristic, no LLM judge):
  Pass rate        : 2/10  (heuristic; rises with LLM judge)
  Avg recall@k     : 100.0%
  Avg faithfulness : 79.5%
  Avg overall      : 62.9%

Install: pip install ragfallback[chroma,huggingface,real-data]
Dataset: rajpurkar/squad — CC BY-SA 4.0

Install

pip install ragfallback                              # core only
pip install ragfallback[chroma,huggingface]          # golden path (no API keys)
pip install ragfallback[faiss,huggingface]           # FAISS instead of Chroma
pip install ragfallback[hybrid]                      # adds BM25 (rank_bm25)
pip install ragfallback[real-data]                   # real dataset examples (HuggingFace datasets)
pip install ragfallback[mlops]                       # MLOps eval layer (RAGAS + MLflow + Locust)

Extra	Installs
`chroma`	chromadb
`faiss`	faiss-cpu
`huggingface`	sentence-transformers, huggingface-hub
`hybrid`	rank_bm25, langchain-community
`real-data`	datasets
`openai`	langchain-openai, openai
`mlops`	ragas, mlflow, locust, aiohttp

Subpackage import map

from ragfallback import AdaptiveRAGRetriever, QueryResult, CostTracker, MetricsCollector, CacheMonitor

from ragfallback.diagnostics import (
    ChunkQualityChecker, EmbeddingGuard, EmbeddingQualityProbe,
    RetrievalHealthCheck, StaleIndexDetector, ContextWindowGuard,
    OverlappingContextStitcher, sanitize_documents, sanitize_metadata,
)
from ragfallback.retrieval import SmartThresholdHybridRetriever, FailoverRetriever
from ragfallback.strategies import QueryVariationsStrategy, MultiHopFallbackStrategy
from ragfallback.evaluation import RAGEvaluator
from ragfallback.tracking import CacheMonitor, CacheStats
from ragfallback.mlops import (
    RagasHook, RagasReport,
    BaselineRegistry, RegressionError,
    GoldenRunner, GoldenReport,
    QuerySimulator, SimQuery,
    MLflowLogger,
    generate_locustfile,
)

MLOps — Evaluation & Regression Gate

ragfallback ships a complete MLOps evaluation layer for RAG pipelines. No API keys required — all metrics use local heuristics by default, with optional RAGAS + MLflow when installed.

Install

pip install ragfallback[chroma,huggingface,real-data,mlops]

Full eval loop

import asyncio
from ragfallback.mlops import GoldenRunner, RagasHook, BaselineRegistry

# 1 — Build evaluation hook (heuristic by default; RAGAS when installed)
hook = RagasHook(llm=None, embeddings=embeddings)

# 2 — Run against 75 real SQuAD QA pairs
runner = GoldenRunner(
    retriever=retriever,           # AdaptiveRAGRetriever instance
    ragas_hook=hook,
    dataset="examples/golden_qa.json",
)
report = asyncio.run(runner.run_async())

print(f"Recall@3        : {report.recall_at_3:.3f}")
print(f"Faithfulness    : {report.ragas.faithfulness:.3f}")
print(f"Latency P95     : {report.latency_p95_ms:.0f}ms")
print(f"Fallback rate   : {report.fallback_rate:.1%}")

# 3 — Regression gate: fails if any metric drops > 5% vs baseline
registry = BaselineRegistry("baselines.json")
registry.compare_or_fail(report, dataset="my_dataset")   # raises RegressionError if degraded
registry.update(report, dataset="my_dataset")             # save new baseline

Adversarial query simulation

from ragfallback.mlops import QuerySimulator

sim = QuerySimulator()
queries = ["What is the refund policy?", "How do API rate limits work?"]

# 4 types: short_keyword, long_nl, ambiguous, out_of_domain
mixed = sim.simulate(queries)

# All 4 types for every query — for stress testing
unhappy = sim.simulate_unhappy_paths(queries)

Load testing

from ragfallback.mlops import generate_locustfile

generate_locustfile("locustfile.py", endpoint="http://localhost:8000")
# Run: locust -f locustfile.py --host http://localhost:8000 --users 50

CI regression gate (GitHub Actions)

The included workflow (mlops-regression-gate job in .github/workflows/test.yml) runs on every push to main:

Pulls 75 SQuAD samples from HuggingFace (open data, CC BY-SA 4.0)
Indexes them in ChromaDB using all-MiniLM-L6-v2 (no API key)
Runs GoldenRunner async — computes recall@3, recall@5, latency P95
Calls compare_or_fail() against examples/baselines.json (committed)
Fails the pipeline if any metric regresses more than 5%

# Run the CI gate locally
python examples/build_golden_dataset.py   # one-time setup
python examples/ci_regression_gate.py    # exits 0 (pass) or 1 (fail)

Contributing

See CONTRIBUTING.md. The quick version: run pytest tests/unit/ -v before any PR, follow Google-style docstrings, use logging not print, and update __all__ in the subpackage __init__.py.

License · Changelog

MIT License — see LICENSE.
Full version history in CHANGELOG.md.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
ragfallback		ragfallback
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL_AND_RUN.md		INSTALL_AND_RUN.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
ragfallback_arch.png		ragfallback_arch.png
ragfallback_colab.ipynb		ragfallback_colab.ipynb
requirements-dev.txt		requirements-dev.txt
run_all_examples.py		run_all_examples.py
setup.py		setup.py
verify_library.py		verify_library.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ragfallback

What it prevents

Quick start

Configuration

Full pipeline

Module reference

`ragfallback.diagnostics`

`ragfallback.retrieval`

`ragfallback.core`

`ragfallback.strategies`

`ragfallback.tracking`

`ragfallback.evaluation`

Examples — real public datasets

Verified numbers — SQuAD Wikipedia validation set

Install

Subpackage import map

MLOps — Evaluation & Regression Gate

Install

Full eval loop

Adversarial query simulation

Load testing

CI regression gate (GitHub Actions)

Contributing

License · Changelog

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ragfallback

What it prevents

Quick start

Configuration

Full pipeline

Module reference

ragfallback.diagnostics

ragfallback.retrieval

ragfallback.core

ragfallback.strategies

ragfallback.tracking

ragfallback.evaluation

Examples — real public datasets

Verified numbers — SQuAD Wikipedia validation set

Install

Subpackage import map

MLOps — Evaluation & Regression Gate

Install

Full eval loop

Adversarial query simulation

Load testing

CI regression gate (GitHub Actions)

Contributing

License · Changelog

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ragfallback.diagnostics`

`ragfallback.retrieval`

`ragfallback.core`

`ragfallback.strategies`

`ragfallback.tracking`

`ragfallback.evaluation`

Packages