Baseline:
- Naive RAG
- Flat chunking
- Default FAISS
- CPU inference
Optimizations:
- Semantic + temporal hybrid chunking
- Adaptive top-k retrieval
- HNSW tuning for recall/latency balance
- Quantized CPU inference (GGUF)
Results:
- p95 latency ↓ 73.6%
- cost/query ↓ 83.3%
- Recall preserved under noisy OCR
This pattern has held across:
- 12 → 120k documents
- Single-node → horizontally scaled CPU deployments