An end-to-end implementation of the Speculative Retrieval-Augmented Generation (Speculative RAG) pipeline, designed to quantify the trade-off between answer quality and inference latency on knowledge-intensive QA tasks.
Standard RAG architectures suffer from a critical bottleneck: processing long retrieved contexts leads to prohibitive latency and reasoning errors.
This project addresses this by implementing a Speculative RAG pipeline:
-
Multi-Perspective Selection: Diversifying retrieved documents into distinct subsets.
-
Batched Multi-Draft Prompting: Using a smaller, specialist Drafter model to generate multiple
{answer, rationale}drafts in parallel. -
Verifier-Based Selection: Employing a larger Verifier model to select the best draft using log-probability-based confidence scoring.
-
Speedup:
$\ge 1.3\times$ p50 latency speedup at matched accuracy (within 1.0 EM of baseline). -
GPU Utilization:
$\ge 60%$ average SM utilization via continuous batching. -
Stretch Goals: Match or exceed the original paper's gains (e.g.,
$\ge +5.0$ EM or$\ge 1.8\times$ speedup at baseline EM).
-
Subset Selection (Diversity): Retrieves top-n chunks and forms m subsets using embedding-based k-means clustering or an MMR diversification heuristic. Embeddings are precomputed offline.
-
Batched Parallel Drafting: Implements parallel drafting as batched multi-prompt generation through a single drafter engine using vLLM continuous batching (generating m drafts in one scheduling window).
-
Verifier-Based Selection: Selects the best draft based on a log-probability scoring function:
Score = log P_V(a_j | q, r_j) + λ log P_D(r_j | q, S_j).
Hardware: Single NVIDIA A100 (40GB) on GCP.
-
Inference Engine: vLLM (leveraging continuous batching and PagedAttention).
-
Models
-
Drafter: 7B class model (e.g., Mistral-7B).
-
Verifier: 7B-13B model (Stretch goal: Mixtral-8x7B).
-
-
Retrieval: FAISS vector store with offline precomputed embeddings (InBedder-Roberta, E5, or BGE).
-
Optimizations: INT8/NF4 quantization via
bitsandbytes, KV-cache limits. -
Profiling: Nsight Systems (
nsys) and PyTorch profiler.
Evaluated on knowledge-intensive datasets such as TriviaQA or PubHealth. Metrics include Exact Match (EM), p50/p95 latency, and GPU utilization.
-
Standard RAG: Concatenate top-k retrieved chunks into one prompt and generate.
-
CRAG-Inspired (Filter-then-Generate): Lightweight filtering/reranking of retrieved chunks followed by generation with shorter context.
speculative-rag/
├── doc/
│ └── speculative-rag-iclr2025.pdf
├── standard-rag/ ← Standard RAG baseline (implemented)
│ ├── README.md ← full setup & run instructions
│ ├── Makefile
│ ├── Dockerfile
│ ├── pyproject.toml
│ ├── infra/ ← Terraform: GCS, Artifact Registry, service account
│ └── src/rag/
└── speculative-rag/ ← Speculative RAG implementation (in progress)
Each subdirectory is an independent project with its own environment, Docker image, and GCP infrastructure.
| Component | Status | Notes |
|---|---|---|
| Standard RAG baseline | Complete | Vertex AI pipeline; see standard-rag/ |
| Speculative RAG | In progress |
See standard-rag/README.md for the full setup guide including:
- GCP project configuration and API enablement
- GPU quota increase instructions (required — new projects default to 0 GPU quota)
- Terraform infra provisioning
- Docker build and push
- Running smoke tests and full evaluation on Vertex AI
cd standard-rag
# First-time setup
make gcp-enable-apis # enable Vertex AI, Artifact Registry, GCS APIs
make infra-apply # provision GCS bucket + Artifact Registry
make docker-push # build and push the container image
# Run evaluation
make vertex-submit # smoke test (100k passages, 500 examples)
make vertex-submit ENV=prod # full eval (21M passages, 11k examples)
make fetch-results # download results.json and print tableThe project includes a live performance dashboard featuring:
-
Live Latency Panel: p50/p95 latency, speedup ratios, tokens/sec, queries/sec.
-
Subset/Diversity View: 2D projection/table showing evidence diversity across draft subsets.
-
Verifier Scoring Breakdown: Visualization of the score components used for final selection.
-
Rationale Inspection: Real-time viewing of document-grounded rationales.
See CONTRIBUTING.md for the team, branch workflow, commit style, and code standards.
See SECURITY.md for guidance on secret management, GCP service account scoping, and what to do if a credential is accidentally exposed.
This project is licensed under the MIT License — see the LICENSE file for details.
- Wang et al., Speculative RAG: Enhancing Retrieval Augmented Generation Through Drafting, ICLR 2025. (paper PDF)