Transforms a directory of Markdown documents into a searchable knowledge graph with hybrid (graph + text) search capabilities, optimized for Agentic RAG applications.
Build a knowledge graph from Markdown documents with:
- Graph-based relationships (wiki-links, tags, keywords)
- TF-IDF keyword analysis for relevance scoring
- Vector-ready storage (LanceDB) for semantic search
- REST API for agentic RAG integration
| Component | Technology | Role |
|---|---|---|
| Language | Rust | High-performance indexing and processing |
| Database | LanceDB | Vector-ready, embedded DB for persistence and search |
| Markdown | pulldown-cmark | Parse Markdown to plain text |
| NLP | stop-words, rust-stemmers | Text preprocessing + TF-IDF keyword analysis |
| Web | axum + egui (WASM) | REST API + Web UI |
trunk- WASM build toolprotoc- Protocol buffer compiler- Rust toolchain
# Build
cargo build --release
# Index a markdown directory
cargo run --release -- --dir ./docs --rebuild
# Run web server (port 3000)
cargo run --release -- --serve --port 3000The system uses a hierarchical ID scheme for better indexing and graph traversal:
- Documents: Sequential
U32(0, 1, 2, ...) - Concepts:
U64=(doc_id << 32) | concept_index- encodes parent document - Tags/Keywords: Hash-based with kind prefix (deduplicated globally)
enum NodeId {
U32(u32), // Documents (up to ~4B)
U64(u64), // Concepts from U32 docs
U128(u128), // Deep concepts
Chunked(Vec<u32>), // Very deep hierarchies
}documents - Source content with embeddings
id- Sequential document index (U32)path,title- File metadatacontent- Raw Markdowntext- Plain textwiki_links,tags- Extracted relationshipskeywords- Stemmed tokensembedding- Vector (for future semantic search)
nodes - Graph entities
id- Hierarchical NodeId (see above)name- Entity namekind- Document, Concept, Tag, Keyworddoc_id- Source document (for concepts)vector- Entity embedding
edges - Relationships
source,target- NodeId referenceskind- LinksTo, HasTag, HasKeyword
- Scan - Recursively find
.mdfiles - Parse - Extract wiki-links
[[Target]],#tags - Extract Text - Convert Markdown to plain text
- Keyword Extraction - Tokenize → stop-word filter → stem → TF-IDF scoring
- Build Graph - Create nodes/edges from links, tags, keywords
- Persist - Save to LanceDB
Keywords are scored using TF-IDF:
- TF: Term frequency within document
- IDF:
ln(N/df) + 1(inverse document frequency) - Keywords with score > 0.5 are included in the graph
| Endpoint | Method | Description |
|---|---|---|
/api/graph |
GET | Full graph (nodes + edges) |
/api/doc/:id |
GET | Document content |
/api/search?q=<query>&limit=<n> |
GET | Full-text search |
/api/connected/:node_id |
GET | Direct neighbors |
/api/traverse |
POST | Multi-hop traversal |
# Search documents
curl "http://localhost:3000/api/search?q=rust&limit=5"
# Get connected nodes
curl "http://localhost:3000/api/connected/abc123"
# Multi-hop traversal
curl -X POST http://localhost:3000/api/traverse \
-H "Content-Type: application/json" \
-d '{"node_id": "abc123", "hops": 2}'The graph structure supports hybrid search:
- Text Search -
/api/searchfor keyword matching - Graph Traversal -
/api/traversefor related documents - Connected Context -
/api/connected/:node_idfor immediate neighbors
Future enhancements:
- Add embeddings to
embeddingfield for vector similarity - Use LanceDB's native vector search
- Implement re-ranking with graph structure
graphodoc/
├── Cargo.toml # Workspace config
├── crates/
│ ├── core/ # Data structures (Node, Edge, Doc)
│ │ └── src/
│ │ ├── lib.rs # Core types
│ │ └── graph.rs # Graph building + TF-IDF
│ ├── cli/ # Indexer + Server
│ │ └── src/
│ │ ├── main.rs # CLI entry
│ │ ├── indexer.rs # Document processing
│ │ └── server.rs # REST API
│ └── web/ # WASM frontend (egui)
│ └── src/
│ └── lib.rs # Graph visualization
└── mdkb_data/ # LanceDB storage (created on first run)
# Index documents with rebuild
cargo run -- --dir ./math --rebuild
# Start server on custom port
cargo run -- --serve --port 8080
# Index + serve in one command
cargo run -- --dir ./docs --serveSee the doc/ directory for detailed documentation:
- architecture.md - High-level system design and components
- knowledge_graph.md - Graph storage, hierarchical addressing, vector embeddings
- graph_search.md - TF-IDF algorithm, keyword extraction, search APIs