Problem
Current foreign-key schema causes inefficiencies:
- Nuclear cleanup: Reprocessing deletes all chunks/embeddings, wastes computation
- No deduplication: Identical content across versions stored multiple times
- Feature blindness: Features encoded in strings, not queryable entities
- Storage waste: Patch/minor version updates recreate unchanged documentation
Solution: Graph-Based Architecture
Leverage SurrealDB's graph capabilities for content-addressable storage with automatic deduplication.
Core Design
Entities:
crate:name@version - Base crate (name + version only)
feature:derive - Reusable feature entity
build_<mti_id> - Specific crate+feature configuration
chunk_<blake3_hash> - Content-addressed chunk
Relationships:
build -> of -> crate (which crate)
build -> enables -> feature (which features)
build -> contains -> doc_chunk (with context metadata on edge)
chunk -> embedded_by -> embedding
Benefits
- Automatic deduplication: Same content = same hash = same chunk
- Incremental processing: Only new content creates new chunks
- Feature analytics: Query which crates use specific features
- Storage efficiency: 50-90% savings for minor/patch updates
- Smart cleanup: Delete relationships, not chunks (orphan GC later)
Implementation Plan
1. Schema Changes
New Tables:
-- Core entities
DEFINE TABLE crate SCHEMAFULL; -- name, version
DEFINE TABLE feature SCHEMAFULL; -- name (unique)
DEFINE TABLE build SCHEMAFULL; -- build_id (mti), status, timestamps
-- Content-addressed chunks
DEFINE TABLE doc_chunk SCHEMAFULL;
DEFINE FIELD content_hash ON TABLE doc_chunk TYPE string; -- blake3 hash
DEFINE FIELD content ON TABLE doc_chunk TYPE string;
-- ... content-intrinsic fields
-- Relationships (edges with metadata)
DEFINE TABLE of SCHEMAFULL; -- build->crate
DEFINE TABLE enables SCHEMAFULL; -- build->feature
DEFINE TABLE contains SCHEMAFULL; -- build->chunk (context metadata here)
DEFINE TABLE embedded_by SCHEMAFULL; -- chunk->embedding
Context Separation:
- Content properties (token_count, content_type) → on
doc_chunk
- Context properties (source_file, line numbers) → on
contains edge
2. Type System (src/types/ids.rs)
use mti::prelude::*;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BuildId(MagicTypeId);
impl BuildId {
pub fn new() -> Self {
Self("build".create_type_id::<V7>())
}
}
// Example: build_01h455vb4pex5vsknk084sn02q
3. Content Hashing
use blake3;
pub fn hash_content(content: &str) -> String {
blake3::hash(content.as_bytes()).to_hex().to_string()
}
4. Handler Changes (src/actors/database.rs)
PersistCrate (lines 294-365):
// 1. Create/get crate entity
// 2. Create/get feature entities
// 3. Create build with mti ID
// 4. RELATE build->of->crate
// 5. RELATE build->enables->feature (for each)
// 6. Cleanup: DELETE contains WHERE in = $build_id (not chunks!)
PersistDocChunk (lines 1094-1279):
// 1. Hash content with blake3
// 2. Upsert chunk (UPDATE RETURN BEFORE, then INSERT if empty)
// 3. RELATE build->contains->chunk CONTENT { context_metadata }
QueryDocChunks (lines 497-551):
// Graph traversal: SELECT *, ->contains.* FROM build->contains->doc_chunk
5. Message Changes (src/messages/)
- Update
PersistCrate to create build entities
- Update
PersistDocChunk to use BuildId and content hashing
- New
QueryBuild for build-specific queries
- Update
DocChunksQueryResponse to include edge metadata
6. Dependencies
Add to Cargo.toml:
blake3 = "1.5" # Fast content hashing
Analytics Queries Enabled
-- Popular features
SELECT count(<-enables<-build) as uses, name FROM feature GROUP BY name;
-- Chunk reuse statistics
SELECT count(<-contains<-build) as reuse, content_hash FROM doc_chunk;
-- Builds with feature combo
SELECT * FROM build WHERE ->enables->feature.name CONTAINSALL ["derive"];
Migration Strategy
There is no need for migration as this is a greenfield app being written for production.
Success Criteria
- ✅ Chunks content-addressed with blake3
- ✅ Builds use mti IDs (format:
build_01h455...)
- ✅ Automatic deduplication working
- ✅ Features as queryable entities
- ✅ Graph queries functional
- ✅ Reprocessing only updates edges
- ✅ No placeholder or simplified code - full production-ready implementation only
- ✅ Any code orphaned from the refactor is removed
- ✅ All tests pass
Problem
Current foreign-key schema causes inefficiencies:
Solution: Graph-Based Architecture
Leverage SurrealDB's graph capabilities for content-addressable storage with automatic deduplication.
Core Design
Entities:
Relationships:
Benefits
Implementation Plan
1. Schema Changes
New Tables:
Context Separation:
doc_chunkcontainsedge2. Type System (src/types/ids.rs)
3. Content Hashing
4. Handler Changes (src/actors/database.rs)
PersistCrate (lines 294-365):
PersistDocChunk (lines 1094-1279):
QueryDocChunks (lines 497-551):
// Graph traversal: SELECT *, ->contains.* FROM build->contains->doc_chunk5. Message Changes (src/messages/)
PersistCrateto create build entitiesPersistDocChunkto use BuildId and content hashingQueryBuildfor build-specific queriesDocChunksQueryResponseto include edge metadata6. Dependencies
Add to Cargo.toml:
Analytics Queries Enabled
Migration Strategy
There is no need for migration as this is a greenfield app being written for production.
Success Criteria
build_01h455...)