_ ___ ____ _____ _ _______ _____ ____ _____ ____
| | / _ \| _ \| ____| | |/ / ____| ____| _ \| ____| _ \
| | | | | | |_) | _| | ' /| _| | _| | |_) | _| | |_) |
| |__| |_| | _ <| |___ | . \| |___| |___| __/| |___| _ <
|_____\___/|_| \_\_____| |_|\_\_____|_____|_| |_____|_| \_\
Ask anything of the archives of Middle-earth.
๐ Live Demo โ lorekeeper-ochre.vercel.app
A Hybrid GraphRAG question-answering system built over Tolkien's Middle-earth lore. Combines a Neo4j knowledge graph with ChromaDB vector search for retrieval, then uses Gemini to generate grounded, cited answers in natural language.
Ask any question about Tolkien's legendarium in plain English like:
"Who forged the One Ring?"
"What is the relationship between Aragorn and Isildur?"
"What happened at the Battle of Helm's Deep?"
"How are Frodo and Bilbo related?"
"What are the Silmarils and why do they matter?"
The system retrieves facts from both a structured knowledge graph and semantic vector search, then synthesizes a cited, accurate answer using Gemini.
User Question
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Query Pipeline โ
โ โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ Graph โ โ Vector โ โ
โ โ Retriever โ โ Retriever โ โ
โ โ โ โ โ โ
โ โ โ extract โ โ โ embed query โ โ
โ โ entities โ โ โ ChromaDB โ โ
โ โ โ fuzzy match โ โ cosine search โ โ
โ โ โ Neo4j โ โ โ top-5 chunks โ โ
โ โ traversal โ โ โ โ
โ โ โ triples โ โ โ โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโ โ
โ โผ โ
โ Context Assembler โ
โ GRAPH FACTS + TEXT CONTEXT |
โ โ โ
โ โผ โ
โ Gemini Answer Generation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Cited Answer
| Retrieval Type | Strength |
|---|---|
| Graph path | Explicit relationships โ kinship, allegiance, geography, events |
| Vector path | Semantic similarity โ handles typos, paraphrasing, vague questions |
| Always both | No query classifier โ both paths run unconditionally on every question |
Tested on 10 typed questions covering factual, relationship, event, location, artifact, character, and thematic queries:
| Metric | Result |
|---|---|
| Answer rate | 100% (10/10) |
| Graph hit rate | 100% |
| Vector hit rate | 100% |
| Both paths used | 100% |
| Average latency | ~9 seconds |
| Property | Value |
|---|---|
| Source | English Wikipedia (76 pages) |
| Coverage | LotR, The Hobbit, The Silmarillion, Unfinished Tales |
| Chunks | 1,115 overlapping text passages (500 tokens, 100 overlap) |
| Vectors | 1,115 embeddings (all-MiniLM-L6-v2, 384 dims) |
| Graph nodes | ~3,130 |
| Graph edges | 18 relationship types |
| Chunks extracted | 651 / 1,115 |
Node labels: Character ยท Location ยท Event ยท Artifact ยท Faction
Relationship types:
| Category | Predicates |
|---|---|
| Kinship | CHILD_OF ยท SIBLING_OF ยท SPOUSE_OF ยท HEIR_OF |
| Alliance & Enmity | ALLY_OF ยท ENEMY_OF ยท SERVANT_OF |
| Faction & Politics | MEMBER_OF ยท RULES_OVER |
| Craftsmanship | CREATED ยท FORGED_BY ยท WIELDED |
| Geography | BORN_IN ยท PART_OF ยท LOCATED_IN |
| Events | OCCURRED_AT ยท PARTICIPATED_IN ยท RESULTED_IN |
| Layer | Technology |
|---|---|
| Knowledge Graph | Neo4j AuraDB Free |
| Vector DB | ChromaDB (persistent) |
| Embeddings | all-MiniLM-L6-v2 (384 dims) |
| Extraction LLM | Groq โ Llama 3.1 8B / Qwen3 32B |
| Answer LLM | Google Gemini 2.5 Flash |
| Fuzzy Matching | RapidFuzz (threshold 85.0) |
| Backend | FastAPI + Uvicorn |
| Frontend | Next.js + Tailwind CSS |
| Backend Hosting | Railway |
| Frontend Hosting | Vercel |
| Scraping | httpx + BeautifulSoup4 |
| Chunking | LangChain RecursiveCharacterTextSplitter |
| Language | Python 3.11+ / TypeScript |
lore_analyser/ โ Backend (Python)
โ
โโโ app.py โ FastAPI REST API
โโโ main.py โ CLI entry point
โโโ config.py โ Centralised settings
โโโ requirements.txt
โ
โโโ ingestion/
โ โโโ scraper.py โ Wikipedia scraper
โ โโโ document_loader.py
โ
โโโ chunking/
โ โโโ text_cleaner.py
โ โโโ chunker.py
โ
โโโ embeddings/
โ โโโ embedder.py โ sentence-transformers wrapper
โ โโโ chroma_store.py โ ChromaDB interface
โ
โโโ extraction/
โ โโโ prompt_templates.py โ LLM extraction prompts
โ โโโ entity_extractor.py โ Groq entity/relation extractor
โ
โโโ graph/
โ โโโ neo4j_client.py โ Neo4j driver wrapper
โ โโโ deduplicator.py โ RapidFuzz name canonicalisation
โ โโโ graph_builder.py โ MERGE nodes/edges into Neo4j
โ โโโ graph_traversal.py โ Cypher neighborhood queries
โ
โโโ retrieval/
โ โโโ vector_retriever.py โ ChromaDB similarity search
โ โโโ graph_retriever.py โ Entity extraction + fuzzy match
โ โโโ context_assembler.py โ Merges graph + vector context
โ
โโโ pipeline/
โ โโโ ingestion_pipeline.py
โ โโโ query_pipeline.py โ Full query orchestration
โ
โโโ evaluation/
โ โโโ eval_runner.py
โ โโโ metrics.py
โ โโโ questions.json โ 10 typed eval questions
โ
โโโ data/
โโโ chunks/chunks.json โ 1,115 text chunks
โโโ chroma_db/ โ ChromaDB vector index
โโโ evaluation/results.json โ Evaluation results
- Python 3.11+
- Node.js 18+
- Neo4j AuraDB Free โ console.neo4j.io
- Google Gemini API key โ aistudio.google.com
- Groq API key โ console.groq.com
git clone https://github.com/Erenjaegaaa/lore_analyser
cd lore_analyser
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txtCreate .env:
GEMINI_API_KEY=your_gemini_key
GROQ_API_KEY=your_groq_key
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE=neo4j
CHROMA_PERSIST_PATH=./data/chroma_db
CHROMA_COLLECTION_NAME=lore_chunks
LOG_LEVEL=INFORun:
uvicorn app:app --reload --port 8000git clone https://github.com/Erenjaegaaa/LORE_KEEPER_FRONTEND
cd LORE_KEEPER_FRONTEND
npm installCreate .env.local:
NEXT_PUBLIC_API_URL=http://localhost:8000Run:
npm run dev# Single query
python main.py query "Who forged the One Ring?"
# Interactive session
python main.py interactive# Day 1 โ scrape and chunk
python run_day1.py
# Day 2 โ embed into ChromaDB
python run_day2.py
# Day 3 โ extract entities and build Neo4j graph
python run_day3.py --delay 2 --start 0 --limit 250
# Resume from a checkpoint
python run_day3.py --delay 2 --start 250python -m evaluation.eval_runner
python -m evaluation.metrics| Component | Platform |
|---|---|
| Frontend | Vercel |
| Backend | Railway |
| Graph DB | Neo4j AuraDB Free |
| Vector DB | ChromaDB (bundled with backend) |
Built with Next.js, designed with a dark fantasy aesthetic. Features:
Why hybrid retrieval? Pure vector search misses explicit relationships ("Who is Aragorn's father?"). Pure graph search misses semantic queries ("Tell me about the corruption of power in Tolkien"). Running both unconditionally gives the best of both worlds with no routing complexity or failure modes.
Why Wikipedia over fan wikis? Tolkien Gateway and lotr.fandom.com use Cloudflare protection that blocks scrapers. Wikipedia has comprehensive, well-structured coverage with the same MediaWiki HTML structure.
Why Groq for extraction? Gemini free tier allows only 20 requests/day โ far too slow for 1,115 chunks. Groq provides 500K tokens/day free across multiple models, making full-dataset extraction feasible at zero cost.
Why RapidFuzz deduplication? LLMs extract the same entity with slight name variations across chunks. Fuzzy matching at 85% threshold collapses variants like "Aragorn", "Aragorn son of Arathorn", and "Strider" to a single canonical node before writing to Neo4j.
Why no query classifier? Classifiers add latency and failure modes. Both retrieval paths are cheap enough to run unconditionally, and Gemini synthesizes whichever context is most relevant automatically.
- Graph coverage โ 651/1,115 chunks extracted into the graph (58%). Remaining chunks are accessible via vector search only.
- Graph noise โ LLM extraction introduces ~10โ15% incorrect relations. The vector path and Gemini's grounding instruction compensate in most cases.
- Latency โ ~9โ16 seconds per query due to Gemini API + cloud Neo4j round trips.
- Real-world noise โ Wikipedia's adaptation/reception sections introduce non-lore entities (directors, actors, scholars) into the graph.
- Complete graph extraction for all 1,115 chunks
- Add relation frequency filtering to reduce graph noise
- Implement Gemini streaming for lower perceived latency
- Add conversation memory for multi-turn Q&A
- Expand dataset to Unfinished Tales, The History of Middle-earth series
- Fine-tune a small model on the extraction task for better predicate adherence
Saahil ยท maker/cse student
GitHub: @Erenjaegaaa
"Not all those who wander are lost." โ J.R.R. Tolkien