Skip to content

Erenjaegaaa/lore-keeper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

26 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

 _     ___  ____  _____   _  _______ _____ ____  _____ ____  
| |   / _ \|  _ \| ____| | |/ / ____| ____|  _ \| ____|  _ \ 
| |  | | | | |_) |  _|   | ' /|  _| |  _| | |_) |  _| | |_) |
| |__| |_| |  _ <| |___  | . \| |___| |___|  __/| |___|  _ < 
|_____\___/|_| \_\_____| |_|\_\_____|_____|_|   |_____|_| \_\

Ask anything of the archives of Middle-earth.

๐ŸŒ Live Demo โ†’ lorekeeper-ochre.vercel.app

A Hybrid GraphRAG question-answering system built over Tolkien's Middle-earth lore. Combines a Neo4j knowledge graph with ChromaDB vector search for retrieval, then uses Gemini to generate grounded, cited answers in natural language.


โœจ What it does

Ask any question about Tolkien's legendarium in plain English like:

"Who forged the One Ring?"
"What is the relationship between Aragorn and Isildur?"
"What happened at the Battle of Helm's Deep?"
"How are Frodo and Bilbo related?"
"What are the Silmarils and why do they matter?"

The system retrieves facts from both a structured knowledge graph and semantic vector search, then synthesizes a cited, accurate answer using Gemini.


๐Ÿ—๏ธ Architecture

User Question
      โ”‚
      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Query Pipeline                 โ”‚
โ”‚                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Graph         โ”‚    โ”‚ Vector           โ”‚  โ”‚
โ”‚  โ”‚ Retriever     โ”‚    โ”‚ Retriever        โ”‚  โ”‚
โ”‚  โ”‚               โ”‚    โ”‚                  โ”‚  โ”‚
โ”‚  โ”‚ โ†’ extract     โ”‚    โ”‚ โ†’ embed query    โ”‚  โ”‚
โ”‚  โ”‚   entities    โ”‚    โ”‚ โ†’ ChromaDB       โ”‚  โ”‚
โ”‚  โ”‚ โ†’ fuzzy match โ”‚    โ”‚   cosine search  โ”‚  โ”‚
โ”‚  โ”‚ โ†’ Neo4j       โ”‚    โ”‚ โ†’ top-5 chunks   โ”‚  โ”‚
โ”‚  โ”‚   traversal   โ”‚    โ”‚                  โ”‚  โ”‚
โ”‚  โ”‚ โ†’ triples     โ”‚    โ”‚                  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚          โ”‚                    โ”‚             โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                     โ–ผ                       โ”‚
โ”‚            Context Assembler                โ”‚
โ”‚       GRAPH FACTS + TEXT CONTEXT            | 
โ”‚                     โ”‚                       โ”‚
โ”‚                     โ–ผ                       โ”‚
โ”‚          Gemini Answer Generation           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ–ผ
  Cited Answer

Why Hybrid?

Retrieval Type Strength
Graph path Explicit relationships โ€” kinship, allegiance, geography, events
Vector path Semantic similarity โ€” handles typos, paraphrasing, vague questions
Always both No query classifier โ€” both paths run unconditionally on every question

๐Ÿ“Š Evaluation

Tested on 10 typed questions covering factual, relationship, event, location, artifact, character, and thematic queries:

Metric Result
Answer rate 100% (10/10)
Graph hit rate 100%
Vector hit rate 100%
Both paths used 100%
Average latency ~9 seconds

๐Ÿ—‚๏ธ Dataset

Property Value
Source English Wikipedia (76 pages)
Coverage LotR, The Hobbit, The Silmarillion, Unfinished Tales
Chunks 1,115 overlapping text passages (500 tokens, 100 overlap)
Vectors 1,115 embeddings (all-MiniLM-L6-v2, 384 dims)
Graph nodes ~3,130
Graph edges 18 relationship types
Chunks extracted 651 / 1,115

Neo4j Knowledge Graph Schema

Node labels: Character ยท Location ยท Event ยท Artifact ยท Faction

Relationship types:

Category Predicates
Kinship CHILD_OF ยท SIBLING_OF ยท SPOUSE_OF ยท HEIR_OF
Alliance & Enmity ALLY_OF ยท ENEMY_OF ยท SERVANT_OF
Faction & Politics MEMBER_OF ยท RULES_OVER
Craftsmanship CREATED ยท FORGED_BY ยท WIELDED
Geography BORN_IN ยท PART_OF ยท LOCATED_IN
Events OCCURRED_AT ยท PARTICIPATED_IN ยท RESULTED_IN

๐Ÿ› ๏ธ Tech Stack

Layer Technology
Knowledge Graph Neo4j AuraDB Free
Vector DB ChromaDB (persistent)
Embeddings all-MiniLM-L6-v2 (384 dims)
Extraction LLM Groq โ€” Llama 3.1 8B / Qwen3 32B
Answer LLM Google Gemini 2.5 Flash
Fuzzy Matching RapidFuzz (threshold 85.0)
Backend FastAPI + Uvicorn
Frontend Next.js + Tailwind CSS
Backend Hosting Railway
Frontend Hosting Vercel
Scraping httpx + BeautifulSoup4
Chunking LangChain RecursiveCharacterTextSplitter
Language Python 3.11+ / TypeScript

๐Ÿ“ Repository Structure

lore_analyser/                         โ† Backend (Python)
โ”‚
โ”œโ”€โ”€ app.py                             โ† FastAPI REST API
โ”œโ”€โ”€ main.py                            โ† CLI entry point
โ”œโ”€โ”€ config.py                          โ† Centralised settings
โ”œโ”€โ”€ requirements.txt
โ”‚
โ”œโ”€โ”€ ingestion/
โ”‚   โ”œโ”€โ”€ scraper.py                     โ† Wikipedia scraper
โ”‚   โ””โ”€โ”€ document_loader.py
โ”‚
โ”œโ”€โ”€ chunking/
โ”‚   โ”œโ”€โ”€ text_cleaner.py
โ”‚   โ””โ”€โ”€ chunker.py
โ”‚
โ”œโ”€โ”€ embeddings/
โ”‚   โ”œโ”€โ”€ embedder.py                    โ† sentence-transformers wrapper
โ”‚   โ””โ”€โ”€ chroma_store.py                โ† ChromaDB interface
โ”‚
โ”œโ”€โ”€ extraction/
โ”‚   โ”œโ”€โ”€ prompt_templates.py            โ† LLM extraction prompts
โ”‚   โ””โ”€โ”€ entity_extractor.py           โ† Groq entity/relation extractor
โ”‚
โ”œโ”€โ”€ graph/
โ”‚   โ”œโ”€โ”€ neo4j_client.py               โ† Neo4j driver wrapper
โ”‚   โ”œโ”€โ”€ deduplicator.py               โ† RapidFuzz name canonicalisation
โ”‚   โ”œโ”€โ”€ graph_builder.py              โ† MERGE nodes/edges into Neo4j
โ”‚   โ””โ”€โ”€ graph_traversal.py            โ† Cypher neighborhood queries
โ”‚
โ”œโ”€โ”€ retrieval/
โ”‚   โ”œโ”€โ”€ vector_retriever.py           โ† ChromaDB similarity search
โ”‚   โ”œโ”€โ”€ graph_retriever.py            โ† Entity extraction + fuzzy match
โ”‚   โ””โ”€โ”€ context_assembler.py          โ† Merges graph + vector context
โ”‚
โ”œโ”€โ”€ pipeline/
โ”‚   โ”œโ”€โ”€ ingestion_pipeline.py
โ”‚   โ””โ”€โ”€ query_pipeline.py             โ† Full query orchestration
โ”‚
โ”œโ”€โ”€ evaluation/
โ”‚   โ”œโ”€โ”€ eval_runner.py
โ”‚   โ”œโ”€โ”€ metrics.py
โ”‚   โ””โ”€โ”€ questions.json                โ† 10 typed eval questions
โ”‚
โ””โ”€โ”€ data/
    โ”œโ”€โ”€ chunks/chunks.json            โ† 1,115 text chunks
    โ”œโ”€โ”€ chroma_db/                    โ† ChromaDB vector index
    โ””โ”€โ”€ evaluation/results.json      โ† Evaluation results

๐Ÿš€ Local Setup

Prerequisites

Backend

git clone https://github.com/Erenjaegaaa/lore_analyser
cd lore_analyser
python -m venv venv
venv\Scripts\activate        # Windows
pip install -r requirements.txt

Create .env:

GEMINI_API_KEY=your_gemini_key
GROQ_API_KEY=your_groq_key
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE=neo4j
CHROMA_PERSIST_PATH=./data/chroma_db
CHROMA_COLLECTION_NAME=lore_chunks
LOG_LEVEL=INFO

Run:

uvicorn app:app --reload --port 8000

Frontend

git clone https://github.com/Erenjaegaaa/LORE_KEEPER_FRONTEND
cd LORE_KEEPER_FRONTEND
npm install

Create .env.local:

NEXT_PUBLIC_API_URL=http://localhost:8000

Run:

npm run dev

Open http://localhost:3000


๐Ÿ’ป CLI Usage

# Single query
python main.py query "Who forged the One Ring?"

# Interactive session
python main.py interactive

๐Ÿ”„ Rebuilding the Knowledge Graph

# Day 1 โ€” scrape and chunk
python run_day1.py

# Day 2 โ€” embed into ChromaDB
python run_day2.py

# Day 3 โ€” extract entities and build Neo4j graph
python run_day3.py --delay 2 --start 0 --limit 250

# Resume from a checkpoint
python run_day3.py --delay 2 --start 250

๐Ÿ“‹ Running Evaluation

python -m evaluation.eval_runner
python -m evaluation.metrics

๐ŸŒ Deployment

Component Platform
Frontend Vercel
Backend Railway
Graph DB Neo4j AuraDB Free
Vector DB ChromaDB (bundled with backend)

๐ŸŽจ Frontend

Built with Next.js, designed with a dark fantasy aesthetic. Features:


๐Ÿง  Design Decisions

Why hybrid retrieval? Pure vector search misses explicit relationships ("Who is Aragorn's father?"). Pure graph search misses semantic queries ("Tell me about the corruption of power in Tolkien"). Running both unconditionally gives the best of both worlds with no routing complexity or failure modes.

Why Wikipedia over fan wikis? Tolkien Gateway and lotr.fandom.com use Cloudflare protection that blocks scrapers. Wikipedia has comprehensive, well-structured coverage with the same MediaWiki HTML structure.

Why Groq for extraction? Gemini free tier allows only 20 requests/day โ€” far too slow for 1,115 chunks. Groq provides 500K tokens/day free across multiple models, making full-dataset extraction feasible at zero cost.

Why RapidFuzz deduplication? LLMs extract the same entity with slight name variations across chunks. Fuzzy matching at 85% threshold collapses variants like "Aragorn", "Aragorn son of Arathorn", and "Strider" to a single canonical node before writing to Neo4j.

Why no query classifier? Classifiers add latency and failure modes. Both retrieval paths are cheap enough to run unconditionally, and Gemini synthesizes whichever context is most relevant automatically.


โš ๏ธ Known Limitations

  • Graph coverage โ€” 651/1,115 chunks extracted into the graph (58%). Remaining chunks are accessible via vector search only.
  • Graph noise โ€” LLM extraction introduces ~10โ€“15% incorrect relations. The vector path and Gemini's grounding instruction compensate in most cases.
  • Latency โ€” ~9โ€“16 seconds per query due to Gemini API + cloud Neo4j round trips.
  • Real-world noise โ€” Wikipedia's adaptation/reception sections introduce non-lore entities (directors, actors, scholars) into the graph.

๐Ÿ”ฎ Future Work

  • Complete graph extraction for all 1,115 chunks
  • Add relation frequency filtering to reduce graph noise
  • Implement Gemini streaming for lower perceived latency
  • Add conversation memory for multi-turn Q&A
  • Expand dataset to Unfinished Tales, The History of Middle-earth series
  • Fine-tune a small model on the extraction task for better predicate adherence

๐Ÿ‘ค Author

Saahil ยท maker/cse student
GitHub: @Erenjaegaaa


"Not all those who wander are lost." โ€” J.R.R. Tolkien

About

Hybrid GraphRAG system over Tolkien's Middle-earth lore. Combines Neo4j knowledge graph + ChromaDB vector search for retrieval, with Gemini for grounded, cited answers. Built from 76 Wikipedia pages with LLM-based entity extraction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages