Retrieval-Augmented Generation (RAG) using the Endee vector database
- Problem Statement
- System Design
- How Endee Is Used
- Project Structure
- Setup and Installation
- Running the Application
- API Reference
- Running Tests
- Configuration
- Example Walkthrough
Knowledge workers spend enormous time searching through large document collections for specific answers.
Traditional keyword search fails to capture semantic meaning — a query for "climate solutions" won't match a passage about "carbon-reduction strategies" even though they mean the same thing.
DocuMind solves this with a RAG pipeline:
- Documents are chunked, embedded into dense vector representations, and stored in Endee for sub-millisecond similarity search.
- When a user asks a question, the most semantically relevant chunks are retrieved from Endee.
- An LLM uses only those retrieved chunks to generate a grounded, cited answer — eliminating hallucination.
┌───────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Raw Text / File │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌───────────────────┐ ┌────────────┐ │
│ │ Text │ │ Sentence │ │ Endee │ │
│ │ Chunker │───▶│ Transformer │───▶│ Vector DB │ │
│ │ (512 char, │ │ (all-MiniLM-L6) │ │ (cosine, │ │
│ │ 64 ovlap) │ │ 384-dim vectors │ │ INT8) │ │
│ └─────────────┘ └───────────────────┘ └────────────┘ │
└───────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ User Question │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ Embedding │ │ Endee │ │ LLM (GPT-4o) │ │
│ │ Model │───▶│ ANN Search│───▶│ + Context │ │
│ │ (same model │ │ top-k=5 │ │ → Grounded │ │
│ │ as ingest) │ │ chunks │ │ Answer │ │
│ └──────────────┘ └────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────────────┘
| Decision | Choice | Reason |
|---|---|---|
| Embedding model | all-MiniLM-L6-v2 |
Fast, accurate, 384-dim — ideal balance for local inference |
| Similarity metric | Cosine | Normalised embeddings → cosine = dot product; direction > magnitude |
| Precision | INT8 | Endee INT8 quantisation gives ~4× memory savings with negligible accuracy loss |
| Chunk size | 512 chars / 64 overlap | Preserves sentence context; overlap prevents boundary misses |
| LLM | GPT-4o-mini (optional) | Low cost, strong instruction following; fallback to extractive mode |
Endee is the core vector storage and retrieval engine of DocuMind. Every interaction with the knowledge base goes through Endee.
from endee import Endee, Precision
client = Endee() # connects to localhost:8080
client.create_index(
name="documind_chunks",
dimension=384, # matches all-MiniLM-L6-v2 output
space_type="cosine", # semantic similarity metric
precision=Precision.INT8, # memory-efficient quantisation
)Each chunk is stored with its embedding and full metadata (text, filename, doc_id):
index = client.get_index("documind_chunks")
index.upsert([
{
"id": "abc123_c0001",
"vector": [0.12, -0.45, ...], # 384-dim float list
"meta": {
"doc_id": "abc123",
"filename": "climate_report.txt",
"text": "The Paris Agreement commits nations to…",
"chunk_idx": 1,
},
}
])The query is embedded with the same model and sent to Endee's ANN index:
query_vector = embedder.encode("What is the Paris Agreement?").tolist()
results = index.query(vector=query_vector, top_k=5)
# → returns list of {id, similarity, meta} sorted by cosine similarity- High-performance HNSW indexing — sub-millisecond ANN queries at scale
- Simple REST / SDK API — no complex configuration required
- Docker-ready — single
docker compose upto start - Up to 1B vectors on a single node — production-grade scalability
- INT8 precision support — 4× memory reduction with minimal quality loss
documind/
├── app.py # FastAPI REST API server
├── demo.py # CLI demo (no server required)
├── requirements.txt
├── Dockerfile
├── docker-compose.yml # Runs Endee + DocuMind together
├── src/
│ ├── __init__.py
│ ├── rag_engine.py # Core: chunking, embedding, Endee upsert/search
│ └── qa_pipeline.py # QA: context building + LLM generation
└── tests/
└── test_documind.py # Unit tests (chunking, ingestion, search, QA)
- Python 3.11+
- Docker + Docker Compose (for Endee)
- (Optional) OpenAI API key for generative answers
Required by evaluation rules: Star and fork the Endee repository before proceeding.
# After forking on GitHub:
git clone https://github.com/<your-username>/endee
cd endee
# Then clone DocuMind alongside it
git clone https://github.com/<your-username>/documind
cd documinddocker compose up endee -d
# Endee is now running at http://localhost:8080Verify:
curl http://localhost:8080/api/v1/index/list
# → {"indexes": []}python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Edit .env:
# OPENAI_API_KEY=sk-... ← enables generative mode
# OPENAI_MODEL=gpt-4o-mini
# ENDEE_HOST=http://localhost:8080No server needed. Ingests 3 sample documents and runs Q&A:
python demo.py
# Ask a custom question:
python demo.py --question "What are the benefits of renewable energy?"Sample output:
════════════════════════════════════════════════════════════
DocuMind — RAG Demo (powered by Endee Vector DB)
════════════════════════════════════════════════════════════
[1/3] Initialising RAG engine …
[2/3] Ingesting sample documents …
✓ climate_change.txt (4 chunks, doc_id=a1b2c3d4e5f6)
✓ machine_learning.txt (4 chunks, doc_id=f6e5d4c3b2a1)
✓ space_exploration.txt (4 chunks, doc_id=123456789abc)
[3/3] Running 5 Q&A queries …
────────────────────────────────────────────────────────────
Q: What caused climate change?
A: Climate change has been primarily driven by human activities since the 1800s,
especially the burning of fossil fuels such as coal, oil, and gas, which
release heat-trapping gases. [Source: climate_change.txt]
Mode: extractive
Sources:
• climate_change.txt (sim=0.9732) Climate change refers to long-term shifts…
uvicorn app:app --host 0.0.0.0 --port 8000 --reloadAPI docs available at: http://localhost:8000/docs
# Set your OpenAI key (optional)
export OPENAI_API_KEY=sk-...
docker compose up --buildBoth services start:
- Endee:
http://localhost:8080 - DocuMind API:
http://localhost:8000
Ingest raw text into the knowledge base.
{
"text": "The Great Wall of China is one of the greatest wonders…",
"filename": "great_wall.txt"
}Response:
{
"status": "ingested",
"doc_id": "a1b2c3d4e5f6",
"filename": "great_wall.txt",
"num_chunks": 3
}Upload a .txt file:
curl -X POST http://localhost:8000/ingest/file \
-F "file=@my_document.txt"Ask a question and get a grounded answer:
{
"question": "What is the Great Wall of China made of?",
"top_k": 5
}Response:
{
"question": "What is the Great Wall of China made of?",
"answer": "The Great Wall was built from stone, brick, tamped earth…",
"sources": [
{
"filename": "great_wall.txt",
"chunk_id": "a1b2c3_c0001",
"similarity": 0.9412,
"excerpt": "The Great Wall of China is one of…"
}
],
"mode": "generative"
}Raw semantic search (no LLM generation):
{
"query": "renewable energy sources",
"top_k": 3
}Service health check.
python -m pytest tests/ -vThe test suite covers:
- Text chunking (edge cases, overlap, whitespace)
- Document ingestion (chunk count, Endee upsert called)
- Semantic search (result parsing, empty results)
- QA pipeline (extractive mode, no-results graceful handling)
All tests use mocks for Endee and the embedding model — no live server required.
| Environment Variable | Default | Description |
|---|---|---|
ENDEE_HOST |
http://localhost:8080 |
Endee server URL |
ENDEE_AUTH_TOKEN |
"" |
Endee auth token (leave blank for open mode) |
OPENAI_API_KEY |
"" |
OpenAI key (enables generative mode) |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model to use |
from src.rag_engine import RAGEngine
from src.qa_pipeline import QAPipeline
# 1. Start Endee (docker compose up endee -d)
# 2. Initialise
rag = RAGEngine() # connects to Endee, creates index if needed
qa = QAPipeline(rag_engine=rag)
# 3. Ingest a document
with open("my_report.txt") as f:
rag.ingest_text(f.read(), filename="my_report.txt")
# 4. Ask questions
result = qa.ask("What were the main findings?")
print(result["answer"])
# → "The main findings indicate that… [Source: my_report.txt]"
---
## Evaluation Compliance
This project strictly follows the mandatory evaluation guidelines:
- ⭐ Starred the official Endee repository:
https://github.com/endee-io/endee
- 🍴 Forked the Endee repository to my personal GitHub account:
https://github.com/vishalkumar-swe/endee
- Implemented a complete RAG (Retrieval-Augmented Generation) system using Endee as the core vector database.
- Hosted the full project publicly on GitHub with complete setup and execution instructions.
All required steps for the project-based evaluation have been completed.
<img width="1915" height="958" alt="image" src="https://github.com/user-attachments/assets/37b4d77a-1622-43f0-ad06-d7095689277c" />Clone this project to your local machine:
git clone https://github.com/vishalkumar-swe/documind-rag-endee.git
cd documind-rag-endee
## 📄 License
MIT — see [LICENSE](LICENSE)