Skip to content

Refactor Database to Graph-Based Content Deduplication #82

@rrrodzilla

Description

@rrrodzilla

Problem

Current foreign-key schema causes inefficiencies:

  • Nuclear cleanup: Reprocessing deletes all chunks/embeddings, wastes computation
  • No deduplication: Identical content across versions stored multiple times
  • Feature blindness: Features encoded in strings, not queryable entities
  • Storage waste: Patch/minor version updates recreate unchanged documentation

Solution: Graph-Based Architecture

Leverage SurrealDB's graph capabilities for content-addressable storage with automatic deduplication.

Core Design

Entities:

crate:name@version     - Base crate (name + version only)
feature:derive         - Reusable feature entity
build_<mti_id>         - Specific crate+feature configuration
chunk_<blake3_hash>    - Content-addressed chunk

Relationships:

build -> of -> crate           (which crate)
build -> enables -> feature    (which features)
build -> contains -> doc_chunk (with context metadata on edge)
chunk -> embedded_by -> embedding

Benefits

  1. Automatic deduplication: Same content = same hash = same chunk
  2. Incremental processing: Only new content creates new chunks
  3. Feature analytics: Query which crates use specific features
  4. Storage efficiency: 50-90% savings for minor/patch updates
  5. Smart cleanup: Delete relationships, not chunks (orphan GC later)

Implementation Plan

1. Schema Changes

New Tables:

-- Core entities
DEFINE TABLE crate SCHEMAFULL;     -- name, version
DEFINE TABLE feature SCHEMAFULL;   -- name (unique)
DEFINE TABLE build SCHEMAFULL;     -- build_id (mti), status, timestamps

-- Content-addressed chunks
DEFINE TABLE doc_chunk SCHEMAFULL;
DEFINE FIELD content_hash ON TABLE doc_chunk TYPE string;  -- blake3 hash
DEFINE FIELD content ON TABLE doc_chunk TYPE string;
-- ... content-intrinsic fields

-- Relationships (edges with metadata)
DEFINE TABLE of SCHEMAFULL;       -- build->crate
DEFINE TABLE enables SCHEMAFULL;  -- build->feature
DEFINE TABLE contains SCHEMAFULL; -- build->chunk (context metadata here)
DEFINE TABLE embedded_by SCHEMAFULL; -- chunk->embedding

Context Separation:

  • Content properties (token_count, content_type) → on doc_chunk
  • Context properties (source_file, line numbers) → on contains edge

2. Type System (src/types/ids.rs)

use mti::prelude::*;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BuildId(MagicTypeId);

impl BuildId {
    pub fn new() -> Self {
        Self("build".create_type_id::<V7>())
    }
}

// Example: build_01h455vb4pex5vsknk084sn02q

3. Content Hashing

use blake3;

pub fn hash_content(content: &str) -> String {
    blake3::hash(content.as_bytes()).to_hex().to_string()
}

4. Handler Changes (src/actors/database.rs)

PersistCrate (lines 294-365):

// 1. Create/get crate entity
// 2. Create/get feature entities
// 3. Create build with mti ID
// 4. RELATE build->of->crate
// 5. RELATE build->enables->feature (for each)
// 6. Cleanup: DELETE contains WHERE in = $build_id (not chunks!)

PersistDocChunk (lines 1094-1279):

// 1. Hash content with blake3
// 2. Upsert chunk (UPDATE RETURN BEFORE, then INSERT if empty)
// 3. RELATE build->contains->chunk CONTENT { context_metadata }

QueryDocChunks (lines 497-551):

// Graph traversal: SELECT *, ->contains.* FROM build->contains->doc_chunk

5. Message Changes (src/messages/)

  • Update PersistCrate to create build entities
  • Update PersistDocChunk to use BuildId and content hashing
  • New QueryBuild for build-specific queries
  • Update DocChunksQueryResponse to include edge metadata

6. Dependencies

Add to Cargo.toml:

blake3 = "1.5"  # Fast content hashing

Analytics Queries Enabled

-- Popular features
SELECT count(<-enables<-build) as uses, name FROM feature GROUP BY name;

-- Chunk reuse statistics
SELECT count(<-contains<-build) as reuse, content_hash FROM doc_chunk;

-- Builds with feature combo
SELECT * FROM build WHERE ->enables->feature.name CONTAINSALL ["derive"];

Migration Strategy

There is no need for migration as this is a greenfield app being written for production.

Success Criteria

  • ✅ Chunks content-addressed with blake3
  • ✅ Builds use mti IDs (format: build_01h455...)
  • ✅ Automatic deduplication working
  • ✅ Features as queryable entities
  • ✅ Graph queries functional
  • ✅ Reprocessing only updates edges
  • ✅ No placeholder or simplified code - full production-ready implementation only
  • ✅ Any code orphaned from the refactor is removed
  • ✅ All tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions