ragfallback prevents silent RAG failures across the full pipeline — from bad chunks at ingest, through retrieval outages at runtime, to invisible answer quality degradation in production.
| # | Real production failure | Module | Example |
|---|---|---|---|
| 1 | Query mismatch → silent empty results | AdaptiveRAGRetriever + QueryVariationsStrategy |
uc6_adaptive_rag.py |
| 2 | Embedding model switch corrupts index dimensions | EmbeddingGuard |
uc2_embedding_guard.py |
| 3 | Bad chunks (too short, mid-sentence) poison retrieval | ChunkQualityChecker |
uc3_chunk_quality.py |
| 4 | Retrieved chunks overflow LLM context window | ContextWindowGuard |
uc4_context_window.py |
| 5 | Keyword queries fail dense retrieval silently | SmartThresholdHybridRetriever |
uc5_hybrid_failover.py |
| 6 | Primary retriever outage returns empty, no fallback | FailoverRetriever |
uc5_hybrid_failover.py |
| 7 | Multi-step questions always fail single-shot RAG | MultiHopFallbackStrategy |
uc6_multi_hop_demo.py |
| 8 | Index serves stale data after document updates | StaleIndexDetector |
— |
| 9 | Answer quality invisible in production | RAGEvaluator |
uc7_rag_evaluator.py |
| 10 | Cross-boundary answers lost between adjacent chunks | OverlappingContextStitcher |
uc8_context_stitcher.py |
| 11 | Metric regression after model/embedder/chunker change | GoldenRunner + BaselineRegistry |
examples/ci_regression_gate.py |
pip install ragfallback[chroma,huggingface,real-data]from datasets import load_dataset
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from ragfallback.diagnostics import ChunkQualityChecker, EmbeddingGuard, RetrievalHealthCheck
from ragfallback.evaluation import RAGEvaluator
# 1 — load 50 real Wikipedia passages (SQuAD, CC BY-SA 4.0)
ds = load_dataset("rajpurkar/squad", split="validation")
seen, docs, probes = set(), [], []
for row in ds:
ctx = row["context"].strip()
if ctx not in seen and len(seen) < 50:
seen.add(ctx)
docs.append(Document(page_content=ctx, metadata={"source": "squad"}))
if row["answers"]["text"]:
probes.append({"question": row["question"],
"ground_truth": row["answers"]["text"][0]})
print(f"Loaded {len(docs)} real passages, {len(probes)} Q&A pairs")
# 2 — check chunk quality before embedding
report = ChunkQualityChecker().check(docs)
print(report.summary())
# 3 — guard embedding dimensions before writing to any index
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
EmbeddingGuard(expected_dim=384).validate(embeddings).raise_if_failed()
# 4 — build index and smoke-test retrieval with real Q&A probes
store = Chroma.from_documents(docs, embeddings, persist_directory="./my_index")
health = RetrievalHealthCheck(k=4).run_substring_probes(
store,
{p["question"]: p["ground_truth"][:50] for p in probes[:10]},
)
print(f"Retrieval hit rate: {health.hit_rate:.0%}")
# 5 — evaluate answer quality on a real question
question = probes[0]["question"]
retrieved = store.as_retriever(search_kwargs={"k": 4}).invoke(question)
answer = retrieved[0].page_content if retrieved else "Not found"
score = RAGEvaluator().evaluate(
question, answer,
[d.page_content for d in retrieved],
ground_truth=probes[0]["ground_truth"],
)
print(score.report())Expected output (actual numbers — run it yourself):
Loaded 50 real passages, 2627 Q&A pairs
[PASS] chunks=50 | len min/avg/max=144/618/2095
Retrieval hit rate: 100%
========================================================
RAG evaluation
========================================================
Context precision : 100.00%
Faithfulness : 95.00%
Answer relevance : 40.00%
Recall (gold hit) : 100.00%
Overall : 84.00%
Pass (>=70%) : True
Most features work with no API key — chunk checking, embedding validation, hybrid retrieval, and evaluation all run locally.
LLM-dependent features (AdaptiveRAGRetriever, QueryVariationsStrategy, MultiHopFallbackStrategy) need a model. Copy .env.example to .env and fill in:
cp .env.example .envMISTRAL_API_KEY=your_key_here
MISTRAL_MODEL=mistral-small-latest # default, override if needed
Get a free Mistral key at console.mistral.ai. The library also supports any LangChain-compatible LLM — pass it directly to AdaptiveRAGRetriever(llm=your_llm).
Your documents
│
▼
[ChunkQualityChecker] ← bad splits, short/duplicate chunks
│
▼
[EmbeddingGuard] ← dimension / NaN / zero-vector checks before write
[EmbeddingQualityProbe] ← domain mismatch heuristic (generic model on jargon)
[sanitize_documents] ← JSON-safe metadata before any vector store write
│
▼
Vector store (Chroma / FAISS / Qdrant / …)
│
▼
[StaleIndexDetector] ← SHA256 manifest: source files vs last build
│
▼
[RetrievalHealthCheck] ← labeled recall@k or quick substring smoke probes
│
▼
[SmartThresholdHybridRetriever] ← threshold + optional BM25 fallback
[FailoverRetriever] ← primary → fallback on exception or empty results
│
▼
[ContextWindowGuard] ← rank + trim chunks to token budget (8 model presets)
[OverlappingContextStitcher] ← merge adjacent chunks from same source
│
▼
[AdaptiveRAGRetriever] ← QueryVariationsStrategy / MultiHopFallbackStrategy
│
▼
[RAGEvaluator] ← recall@k, nDCG, faithfulness (heuristic + LLM judge)
ChunkQualityChecker — detects too-short, too-long, mid-sentence, and duplicate chunks before embedding.
from ragfallback.diagnostics import ChunkQualityChecker
report = ChunkQualityChecker(min_chars=100, max_chars=8000).check(docs)
if report.has_issues:
fixed = ChunkQualityChecker().auto_fix(docs)EmbeddingGuard — validates dimension, NaN, and zero-vectors before writing to any index.
from ragfallback.diagnostics import EmbeddingGuard
guard = EmbeddingGuard(expected_dim=384)
guard.validate(embeddings_model).raise_if_failed() # model-level check
guard.validate_raw_vectors(vectors).raise_if_failed() # pre-computed vectorsEmbeddingQualityProbe — heuristic domain-fit check: if similarity scores are uniformly low, the model is likely a poor domain match.
from ragfallback.diagnostics import EmbeddingQualityProbe
result = EmbeddingQualityProbe().run(embeddings, query="...", reference_snippets=[...])
if not result.ok:
print(result.warnings) # "consider domain-specific model"RetrievalHealthCheck — labeled recall@k or quick substring smoke probes against a live vector store.
from ragfallback.diagnostics import RetrievalHealthCheck
health = RetrievalHealthCheck(k=5)
report = health.run_substring_probes(vector_store, {"What is Python?": "high-level language"})
print(report.hit_rate, report.avg_latency_ms)StaleIndexDetector — SHA256 manifest to catch when source files changed since last index build.
from ragfallback.diagnostics import StaleIndexDetector
det = StaleIndexDetector(manifest_path="./index_manifest.json")
det.record_paths(["./docs/policy.md"]) # record after build
report = det.check_paths(["./docs/policy.md"]) # check before serving
if report.has_stale:
print(report.summary())ContextWindowGuard — ranks and trims retrieved chunks to fit a token budget; 8 model presets included.
from ragfallback.diagnostics import ContextWindowGuard
guard = ContextWindowGuard.from_model_name("gpt-4o")
selected, report = guard.select(query, retrieved_docs, embeddings)OverlappingContextStitcher — merges consecutive chunks from the same source so cross-boundary answers aren't split.
from ragfallback.diagnostics import OverlappingContextStitcher
merged = OverlappingContextStitcher().stitch(retrieved_docs)sanitize_documents — normalizes list/dict/bytes metadata to JSON-safe scalars before any vector store write.
from ragfallback.diagnostics import sanitize_documents
clean_docs = sanitize_documents(dirty_docs) # safe for Chroma, Pinecone, QdrantSmartThresholdHybridRetriever — score-threshold gating with automatic BM25 fallback when dense scores are weak. Supports distance, similarity, and relative score modes.
from ragfallback.retrieval import SmartThresholdHybridRetriever
retriever = SmartThresholdHybridRetriever.from_documents(
docs, embeddings, dense_threshold=0.5, k=4
) # pip install ragfallback[hybrid] for BM25FailoverRetriever — if the primary retriever raises or returns fewer than min_results docs, automatically switches to a secondary.
from ragfallback.retrieval import FailoverRetriever
retriever = FailoverRetriever(primary=chroma_retriever, fallback=faiss_retriever, min_results=1)AdaptiveRAGRetriever — wraps a vector store with retry logic and pluggable fallback strategies. On each attempt it retrieves, scores confidence, and either returns the answer or tries the next strategy.
from ragfallback import AdaptiveRAGRetriever
from ragfallback.strategies import QueryVariationsStrategy
retriever = AdaptiveRAGRetriever(
vector_store=store,
llm=llm, # any LangChain LLM
strategies=[QueryVariationsStrategy(num_variations=2)],
confidence_threshold=0.7,
max_attempts=3,
)
result = retriever.retrieve("What is the refund policy?")
print(result.answer, result.confidence, result.attempts_used)Requires MISTRAL_API_KEY (or any LangChain-compatible LLM passed via llm=).
aquery_with_fallback — native async version of query_with_fallback(). Real coroutine using LangChain ainvoke() — not a thread-pool wrapper. Falls back to thread pool automatically if the underlying LLM doesn't implement ainvoke.
import asyncio
# async-native — LLM API calls overlap instead of serializing
result = await retriever.aquery_with_fallback("What is the refund policy?")
print(result.answer, result.confidence, result.attempts)
# works in FastAPI, GoldenRunner.run_async(), or any async context
asyncio.run(retriever.aquery_with_fallback("How do API tokens expire?"))QueryVariationsStrategy — LLM rewrites the original query into N variations to broaden retrieval recall. Requires an LLM.
MultiHopFallbackStrategy — decomposes complex multi-step questions into sub-questions, retrieves each independently, then synthesises a final answer. Requires an LLM.
from ragfallback.strategies import MultiHopFallbackStrategy
result = MultiHopFallbackStrategy(max_hops=3).run(question, retriever, llm)
print(result.final_answer, result.total_hops)CostTracker — token cost ledger for a RAG session. Records spend per operation, enforces an optional budget ceiling, and surfaces a report at the end.
from ragfallback import CostTracker
tracker = CostTracker(budget_usd=1.0)
tracker.record(model="mistral-small-latest", input_tokens=500, output_tokens=200)
print(tracker.get_report()) # total cost, budget remainingMetricsCollector — records latency, success/failure counts, and confidence scores across retrieval attempts.
from ragfallback import MetricsCollector
metrics = MetricsCollector()
# passed automatically to AdaptiveRAGRetriever; or record manually:
metrics.record_attempt(success=True, latency_ms=120, confidence=0.85)
print(metrics.get_stats())CacheMonitor — wraps any LangChain retriever to track cache hit rate, per-category latency (hit vs miss), TTL-based expiry, and LRU eviction. Zero new dependencies — stdlib only. Supports both sync invoke() and async ainvoke().
from ragfallback.tracking import CacheMonitor
monitor = CacheMonitor(max_size=512, ttl_seconds=600)
cached_retriever = monitor.wrap_retriever(store.as_retriever(search_kwargs={"k": 4}))
# use cached_retriever exactly like any LangChain retriever
docs = cached_retriever.invoke("What is the refund policy?")
print(monitor.summary())
# → cache hit_rate=34.7% hits=26 misses=49 entries=49 evictions=0
stats = monitor.get_stats()
print(stats.hit_rate, stats.avg_hit_latency_ms, stats.avg_miss_latency_ms)Pass to GoldenRunner to capture cache efficiency alongside RAGAS scores:
from ragfallback.mlops import GoldenRunner, RagasHook
from ragfallback.tracking import CacheMonitor
monitor = CacheMonitor(max_size=256, ttl_seconds=300)
runner = GoldenRunner(
retriever=retriever,
ragas_hook=hook,
dataset="examples/golden_qa.json",
cache_monitor=monitor,
)
report = asyncio.run(runner.run_async())
print(report.cache_stats)
# → {"hit_rate": 0.347, "hits": 26, "misses": 49, "evictions": 0, ...}RAGEvaluator — scores recall@k, nDCG, and faithfulness without external services. Optional LLM judge hook for higher accuracy.
from ragfallback.evaluation import RAGEvaluator
ev = RAGEvaluator()
score = ev.evaluate(question, answer, context_docs, ground_truth="...")
print(score.overall_score, score.faithfulness_score, score.recall_at_k)
print(ev.batch_summary([score]))| Example | Dataset | Command |
|---|---|---|
| UC-1: retrieval health | SQuAD Wikipedia | python examples/uc1_retrieval_health.py |
| UC-2: embedding guard | — (dimension check) | python examples/uc2_embedding_guard.py |
| UC-3: chunk quality | SQuAD Wikipedia | python examples/uc3_chunk_quality.py |
| UC-4: context window | sample KB | python examples/uc4_context_window.py |
| UC-5: hybrid + failover | FAISS + BM25 | python examples/uc5_hybrid_failover.py |
| UC-6: adaptive RAG | SQuAD Wikipedia (needs MISTRAL_API_KEY or Ollama) |
python examples/uc6_adaptive_rag.py |
| UC-7: RAG evaluator | PubMedQA (MIT) — real medical Q&A | python examples/uc7_rag_evaluator.py |
| UC-8: context stitcher | ChromaDB + HR chunks | python examples/uc8_context_stitcher.py |
| UC-9: embedding probe | — (similarity check) | python examples/uc9_embedding_probe.py |
| UC-10: metadata sanitizer | ChromaDB dirty docs | python examples/uc10_metadata_sanitizer.py |
| End-to-end on SQuAD | SQuAD Wikipedia (CC BY-SA 4.0) | python examples/real_data_demo.py |
| Financial news RAG | nickmuchi/financial-classification (Apache 2.0) | python examples/financial_risk_analysis.py |
| Legal contract RAG | theatticusproject/cuad-qa (CC BY 4.0) | python examples/legal_document_analysis.py |
| Medical abstract RAG | qiaojin/PubMedQA (MIT) | python examples/medical_research_synthesis.py |
| MLOps: build golden dataset | SQuAD (CC BY-SA 4.0) + SciQ (CC BY-NC 3.0) | python examples/build_golden_dataset.py |
| MLOps: full demo | SQuAD golden set, zero API keys | python examples/mlops_demo.py |
| MLOps: CI regression gate | SQuAD golden set, committed baseline | python examples/ci_regression_gate.py |
python examples/real_data_demo.py runs every module on 200 real Wikipedia passages. Numbers below are printed by the script on every run — not made up.
Passages indexed : 200 real Wikipedia passages
Q&A pairs : 10 570 (ground truth available)
ChunkQualityChecker : 1 violation (avg 662 chars/passage)
EmbeddingGuard : OK — dim 384 matches expected 384
RetrievalHealthCheck (20 real Q&A substring probes):
Hit rate : 100.0%
Avg latency: 25 ms per query
RAGEvaluator (10 real Q&A pairs, heuristic, no LLM judge):
Pass rate : 2/10 (heuristic; rises with LLM judge)
Avg recall@k : 100.0%
Avg faithfulness : 79.5%
Avg overall : 62.9%
Install: pip install ragfallback[chroma,huggingface,real-data]
Dataset: rajpurkar/squad — CC BY-SA 4.0
pip install ragfallback # core only
pip install ragfallback[chroma,huggingface] # golden path (no API keys)
pip install ragfallback[faiss,huggingface] # FAISS instead of Chroma
pip install ragfallback[hybrid] # adds BM25 (rank_bm25)
pip install ragfallback[real-data] # real dataset examples (HuggingFace datasets)
pip install ragfallback[mlops] # MLOps eval layer (RAGAS + MLflow + Locust)| Extra | Installs |
|---|---|
chroma |
chromadb |
faiss |
faiss-cpu |
huggingface |
sentence-transformers, huggingface-hub |
hybrid |
rank_bm25, langchain-community |
real-data |
datasets |
openai |
langchain-openai, openai |
mlops |
ragas, mlflow, locust, aiohttp |
from ragfallback import AdaptiveRAGRetriever, QueryResult, CostTracker, MetricsCollector, CacheMonitor
from ragfallback.diagnostics import (
ChunkQualityChecker, EmbeddingGuard, EmbeddingQualityProbe,
RetrievalHealthCheck, StaleIndexDetector, ContextWindowGuard,
OverlappingContextStitcher, sanitize_documents, sanitize_metadata,
)
from ragfallback.retrieval import SmartThresholdHybridRetriever, FailoverRetriever
from ragfallback.strategies import QueryVariationsStrategy, MultiHopFallbackStrategy
from ragfallback.evaluation import RAGEvaluator
from ragfallback.tracking import CacheMonitor, CacheStats
from ragfallback.mlops import (
RagasHook, RagasReport,
BaselineRegistry, RegressionError,
GoldenRunner, GoldenReport,
QuerySimulator, SimQuery,
MLflowLogger,
generate_locustfile,
)ragfallback ships a complete MLOps evaluation layer for RAG pipelines. No API keys required — all metrics use local heuristics by default, with optional RAGAS + MLflow when installed.
pip install ragfallback[chroma,huggingface,real-data,mlops]import asyncio
from ragfallback.mlops import GoldenRunner, RagasHook, BaselineRegistry
# 1 — Build evaluation hook (heuristic by default; RAGAS when installed)
hook = RagasHook(llm=None, embeddings=embeddings)
# 2 — Run against 75 real SQuAD QA pairs
runner = GoldenRunner(
retriever=retriever, # AdaptiveRAGRetriever instance
ragas_hook=hook,
dataset="examples/golden_qa.json",
)
report = asyncio.run(runner.run_async())
print(f"Recall@3 : {report.recall_at_3:.3f}")
print(f"Faithfulness : {report.ragas.faithfulness:.3f}")
print(f"Latency P95 : {report.latency_p95_ms:.0f}ms")
print(f"Fallback rate : {report.fallback_rate:.1%}")
# 3 — Regression gate: fails if any metric drops > 5% vs baseline
registry = BaselineRegistry("baselines.json")
registry.compare_or_fail(report, dataset="my_dataset") # raises RegressionError if degraded
registry.update(report, dataset="my_dataset") # save new baselinefrom ragfallback.mlops import QuerySimulator
sim = QuerySimulator()
queries = ["What is the refund policy?", "How do API rate limits work?"]
# 4 types: short_keyword, long_nl, ambiguous, out_of_domain
mixed = sim.simulate(queries)
# All 4 types for every query — for stress testing
unhappy = sim.simulate_unhappy_paths(queries)from ragfallback.mlops import generate_locustfile
generate_locustfile("locustfile.py", endpoint="http://localhost:8000")
# Run: locust -f locustfile.py --host http://localhost:8000 --users 50The included workflow (mlops-regression-gate job in .github/workflows/test.yml)
runs on every push to main:
- Pulls 75 SQuAD samples from HuggingFace (open data, CC BY-SA 4.0)
- Indexes them in ChromaDB using
all-MiniLM-L6-v2(no API key) - Runs
GoldenRunnerasync — computes recall@3, recall@5, latency P95 - Calls
compare_or_fail()againstexamples/baselines.json(committed) - Fails the pipeline if any metric regresses more than 5%
# Run the CI gate locally
python examples/build_golden_dataset.py # one-time setup
python examples/ci_regression_gate.py # exits 0 (pass) or 1 (fail)See CONTRIBUTING.md. The quick version: run pytest tests/unit/ -v before any PR, follow Google-style docstrings, use logging not print, and update __all__ in the subpackage __init__.py.
MIT License — see LICENSE.
Full version history in CHANGELOG.md.