🚀 SmartSearch
Fault-Aware Async Ingestion + Semantic Retrieval Backend
A production-style backend that ingests documents asynchronously, generates embeddings, and serves semantic search and RAG — with explicit guarantees for idempotency, crash recovery, and job lifecycle correctness.
This is not just another RAG demo.
It is a correctness-first ingestion and retrieval system designed to answer:
- What if the worker crashes mid-processing?
- What if Kafka replays messages?
- What if the database goes down?
- What if duplicate requests arrive?
This system is built to handle those scenarios deterministically.
- Correctness over convenience
- Explicit state over hidden progress
- Failure-aware design
- Deterministic recovery
End-to-end architecture: API → Kafka → Async Workers → pgvector → Vector Search
Components
- API Service
- Accepts ingestion requests
- Produces messages to Kafka
- Exposes search + RAG endpoints
- Kafka
- Decouples ingestion from processing
- Enables retry and replay
- Worker
- Consumes messages
- Generates embeddings
- Writes to Postgres
- Postgres + pgvector
- Stores embeddings
- Enables similarity search
🟢 Stage 0 — Synchronous ingestion (baseline) API directly:
- generates embeddings
- stores in DB
❌ Problems:
- slow
- no retry
- no failure recovery
🟡 Stage 1 — Asynchronous ingestion
- API produces message to Kafka
- Worker processes asynchronously
✔️ Benefits:
- decoupling
- retry support
- resilience
🟠 Stage 2 — Job lifecycle state machine
Each request has explicit states: PENDING -> PROCESSING -> READY | FAILED
✔️ Guarantees:
- no hidden progress
- observable system state
- debuggable failures
🔵 Stage 3 — Idempotent ingestion
- Duplicate messages (Kafka replay / retries) are safe
- Enforced via:
- unique constraints (e.g., docId + chunkId)
- deterministic writes
✔️ Guarantees:
- no duplicate embeddings
- safe reprocessing
🔴 Stage 4 — Failure handling + retries
- Worker retries failed jobs (bounded attempts)
- After retry exhaustion:
- job marked FAILED
- message sent to DLQ
✔️ Guarantees:
- no infinite retry loops
- bad data is isolated
🟣 Stage 5 — Observability
System exposes:
- /api/system/pressure
- pending jobs
- processing jobs
- failed jobs
- ready jobs
- logs:
- job transitions
- end-to-end latency
✔️ Guarantees:
- system is observable
- failures are visible
Correctness
- At-least-once ingestion (via Kafka)
- Idempotent processing (no duplicate chunks)
- Deterministic job lifecycle
Failure Safety
- Worker crash does not corrupt state
- Partial processing is recoverable
- Retries are safe
Data Integrity
- No duplicate embeddings
- No partial chunk visibility
- Consistent DB state after recovery
-
Lifecycle invariant: requestId transitions monotonically: PENDING → PROCESSING → READY | FAILED (no backward transitions)
-
Idempotency invariant: Reprocessing the same request does not change final DB state.
-
Visibility invariant: A document is searchable iff state == READY.
-
Failure isolation invariant: A FAILED job does not corrupt other documents.
-
Recovery invariant: After crash/restart, system state = last committed DB state + replay-safe Kafka processing.
| Failure Scenario | Expected Behavior |
|---|---|
| Worker crash mid-processing | Job is retried and completes successfully |
| Worker crash after DB write | Reprocessing occurs but duplicates are prevented |
| Kafka broker restart | Processing resumes with no data loss |
| Postgres outage | Worker retries; job eventually READY or FAILED |
| Poison message | Retries exhausted → FAILED + DLQ |
| Duplicate request | No duplicate embeddings created |
SmartSearch includes a full observability stack using Prometheus and Grafana to monitor system behavior, performance, and reliability.
The system exposes metrics via Spring Boot Actuator and Micrometer:
- HTTP request rate and latency
- Ingestion pipeline metrics:
- received, succeeded, failed, retries, DLQ
- Processing age (staleness metric):
- Measures how long a job waits before being processed
- Helps detect backlog and async system pressure
- Database metrics:
- Hikari connection pool (active, idle, total)
- DB write throughput
- Requests/sec
- Ingest success/failure/retry
- Latency and DLQ
- Success vs failure rates
- Retry behavior
- Worker health
- Processing age (key signal for backlog detection)
- Connection pool usage
- DB latency signals
- Write throughput
docker compose up -dAccess:
-
Grafana: http://localhost:3000
-
Prometheus: http://localhost:9090
👉 Dashboard 2 (Kafka & Worker)
Save as:
docs/dashboard-kafka-worker.png
## 🧪 Failure Proof (Reproducible Tests)
**T1 — Crash mid-processing**
- Submit document
- Kill worker during processing
✔️ Expected:
- Job resumes
- No duplicate chunks
**T2 — Crash after DB write**
- Kill worker after write but before commit acknowledgment
✔️ Expected:
- Reprocessing occurs
- No duplicates (idempotency holds)
**T3 — Kafka restart**
- Stop Kafka during ingestion
- Restart Kafka
✔️ Expected:
- Worker resumes
- No message loss
**T4 — Database outage**
- Stop Postgres
- Submit job
- Restart DB
✔️ Expected:
- Worker retries
- Job becomes READY or FAILED
**T5 — Poison message**
- Submit malformed document
✔️ Expected:
- Retries attempted
- Job marked FAILED
- Message sent to DLQ
## 🔍 API Endpoints (example)
**Ingestion**
```bash
POST /api/documents
Search
GET /api/search?q=...RAG
POST /api/ragSystem Pressure
GET /api/system/pressureMost AI systems focus on:
- embeddings
- LLMs
- retrieval quality
This project focuses on: What happens when the system breaks. That’s what differentiates:
- demos -> production systems
- prototypes -> infrastructure
-
At-least-once vs Exactly-once: Chose at-least-once + idempotency for simplicity and robustness.
-
Kafka vs Direct Processing: Kafka adds operational complexity but enables replay and durability.
-
Async vs Sync ingestion: Async improves resilience and scalability but introduces eventual consistency.
-
Postgres + pgvector vs Dedicated Vector DB: Simpler stack and transactional guarantees, but not optimized for very large-scale vector search.
- Exactly-once semantics (Kafka transactions)
- Distributed workers + partition-aware scaling
- Backpressure-aware scheduling
- Streaming ingestion
- BFT-style replicated ingestion pipeline (long-term vision)
This project sits at the intersection of:
- Distributed systems
- AI infrastructure
- Fault-tolerant pipelines It is designed as a foundation layer for reliable AI systems.
This system reflects a broader focus on:
- fault tolerance
- correctness guarantees
- distributed system design
Similar to how storage engines ensure durability and consistency, this system ensures reliable AI data pipelines under failure.
