Refactor Database to Graph-Based Content Deduplication

## Problem

Current foreign-key schema causes inefficiencies:
- **Nuclear cleanup**: Reprocessing deletes all chunks/embeddings, wastes computation
- **No deduplication**: Identical content across versions stored multiple times
- **Feature blindness**: Features encoded in strings, not queryable entities
- **Storage waste**: Patch/minor version updates recreate unchanged documentation

## Solution: Graph-Based Architecture

Leverage SurrealDB's graph capabilities for content-addressable storage with automatic deduplication.

### Core Design

**Entities**:
```
crate:name@version     - Base crate (name + version only)
feature:derive         - Reusable feature entity
build_<mti_id>         - Specific crate+feature configuration
chunk_<blake3_hash>    - Content-addressed chunk
```

**Relationships**:
```
build -> of -> crate           (which crate)
build -> enables -> feature    (which features)
build -> contains -> doc_chunk (with context metadata on edge)
chunk -> embedded_by -> embedding
```

### Benefits

1. **Automatic deduplication**: Same content = same hash = same chunk
2. **Incremental processing**: Only new content creates new chunks
3. **Feature analytics**: Query which crates use specific features
4. **Storage efficiency**: 50-90% savings for minor/patch updates
5. **Smart cleanup**: Delete relationships, not chunks (orphan GC later)

## Implementation Plan

### 1. Schema Changes

**New Tables**:
```sql
-- Core entities
DEFINE TABLE crate SCHEMAFULL;     -- name, version
DEFINE TABLE feature SCHEMAFULL;   -- name (unique)
DEFINE TABLE build SCHEMAFULL;     -- build_id (mti), status, timestamps

-- Content-addressed chunks
DEFINE TABLE doc_chunk SCHEMAFULL;
DEFINE FIELD content_hash ON TABLE doc_chunk TYPE string;  -- blake3 hash
DEFINE FIELD content ON TABLE doc_chunk TYPE string;
-- ... content-intrinsic fields

-- Relationships (edges with metadata)
DEFINE TABLE of SCHEMAFULL;       -- build->crate
DEFINE TABLE enables SCHEMAFULL;  -- build->feature
DEFINE TABLE contains SCHEMAFULL; -- build->chunk (context metadata here)
DEFINE TABLE embedded_by SCHEMAFULL; -- chunk->embedding
```

**Context Separation**:
- Content properties (token_count, content_type) → on `doc_chunk`
- Context properties (source_file, line numbers) → on `contains` edge

### 2. Type System (src/types/ids.rs)

```rust
use mti::prelude::*;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BuildId(MagicTypeId);

impl BuildId {
    pub fn new() -> Self {
        Self("build".create_type_id::<V7>())
    }
}

// Example: build_01h455vb4pex5vsknk084sn02q
```

### 3. Content Hashing

```rust
use blake3;

pub fn hash_content(content: &str) -> String {
    blake3::hash(content.as_bytes()).to_hex().to_string()
}
```

### 4. Handler Changes (src/actors/database.rs)

**PersistCrate** (lines 294-365):
```rust
// 1. Create/get crate entity
// 2. Create/get feature entities
// 3. Create build with mti ID
// 4. RELATE build->of->crate
// 5. RELATE build->enables->feature (for each)
// 6. Cleanup: DELETE contains WHERE in = $build_id (not chunks!)
```

**PersistDocChunk** (lines 1094-1279):
```rust
// 1. Hash content with blake3
// 2. Upsert chunk (UPDATE RETURN BEFORE, then INSERT if empty)
// 3. RELATE build->contains->chunk CONTENT { context_metadata }
```

**QueryDocChunks** (lines 497-551):
```rust
// Graph traversal: SELECT *, ->contains.* FROM build->contains->doc_chunk
```

### 5. Message Changes (src/messages/)

- Update `PersistCrate` to create build entities
- Update `PersistDocChunk` to use BuildId and content hashing
- New `QueryBuild` for build-specific queries
- Update `DocChunksQueryResponse` to include edge metadata

### 6. Dependencies

Add to Cargo.toml:
```toml
blake3 = "1.5"  # Fast content hashing
```

## Analytics Queries Enabled

```sql
-- Popular features
SELECT count(<-enables<-build) as uses, name FROM feature GROUP BY name;

-- Chunk reuse statistics
SELECT count(<-contains<-build) as reuse, content_hash FROM doc_chunk;

-- Builds with feature combo
SELECT * FROM build WHERE ->enables->feature.name CONTAINSALL ["derive"];
```

## Migration Strategy

There is no need for migration as this is a greenfield app being written for production.

## Success Criteria

- ✅ Chunks content-addressed with blake3
- ✅ Builds use mti IDs (format: `build_01h455...`)
- ✅ Automatic deduplication working
- ✅ Features as queryable entities
- ✅ Graph queries functional
- ✅ Reprocessing only updates edges
- ✅ No placeholder or simplified code - full production-ready implementation only
- ✅ Any code orphaned from the refactor is removed
- ✅ All tests pass


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Database to Graph-Based Content Deduplication #82

Problem

Solution: Graph-Based Architecture

Core Design

Benefits

Implementation Plan

1. Schema Changes

2. Type System (src/types/ids.rs)

3. Content Hashing

4. Handler Changes (src/actors/database.rs)

5. Message Changes (src/messages/)

6. Dependencies

Analytics Queries Enabled

Migration Strategy

Success Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Refactor Database to Graph-Based Content Deduplication #82

Description

Problem

Solution: Graph-Based Architecture

Core Design

Benefits

Implementation Plan

1. Schema Changes

2. Type System (src/types/ids.rs)

3. Content Hashing

4. Handler Changes (src/actors/database.rs)

5. Message Changes (src/messages/)

6. Dependencies

Analytics Queries Enabled

Migration Strategy

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions