feat: graph-aware relatedness queries (neighborhood overlap + Personalized PageRank)

## Summary

Add graph-structure-based relatedness queries so agents can ask "what symbols are similar to X?" without vector embeddings. The call graph topology itself encodes functional similarity — symbols that share callers/callees, sit in the same subgraph region, or belong to the same tightly connected cluster are structurally related.

This is a post-feature-strip addition: once embeddings are removed, these mechanisms replace semantic search with something **deterministic, zero-dependency, and always fresh**.

## Motivation

After the feature strip, `lore_search` only supports BM25/FTS5 text matching. That handles "find symbols named like X" but not "find symbols that play a similar role to X in the codebase." Graph-aware relatedness fills that gap using the existing `symbol_refs`, `type_refs`, `file_imports`, and `symbol_relationships` tables — no new infrastructure required.

## Required Mechanisms

All three mechanisms must be implemented. They serve complementary purposes at different granularities.

### 1. Neighborhood Overlap (pure SQL, no precomputation)

Two symbols are "similar" if they share callers or callees. Implementable as a single SQL query:

```sql
-- Symbols sharing the most callees with symbol :id
SELECT sr2.caller_id AS related_id, s.name, COUNT(*) AS shared_callees
  FROM symbol_refs sr1
  JOIN symbol_refs sr2 ON sr1.callee_id = sr2.callee_id
  JOIN symbols s ON s.id = sr2.caller_id
 WHERE sr1.caller_id = :id AND sr2.caller_id != :id
 GROUP BY sr2.caller_id
 ORDER BY shared_callees DESC
 LIMIT 20;
```

Same approach works in reverse (shared callers → symbols that serve as substitutes). Can also combine callee overlap, caller overlap, and type-ref overlap with configurable weights.

**Properties:**
- Zero new dependencies
- No precomputation or staleness
- Works on the live index
- Captures direct functional similarity

### 2. Personalized PageRank (in-process, ~100 LOC)

For a query symbol, run truncated random walks over the call graph. Symbols that accumulate high visit probability are "semantically near" in the graph — even across many hops.

```
Given query node q, damping α = 0.85:
  Run N random walks from q
  At each step: follow random outgoing edge (prob α), or teleport to q (prob 1-α)
  Count visits per node → similarity score
```

**Properties:**
- Captures transitive relationships that neighborhood overlap misses
- No external dependencies (runs over existing adjacency data in SQLite)
- Small in-process algorithm (~100 lines TypeScript)
- Approximation, but tunable precision (more walks = more accurate)

### 3. Community / Cluster Detection

Label propagation or connected-component analysis to identify functional modules — clusters of tightly interconnected symbols. "Related" = "in the same cluster."

Useful for higher-level queries like "what other code is part of the payment processing subsystem?" Could power a `lore_graph(kind="cluster")` mode.

**Properties:**
- Answers "what subsystem does X belong to?" — a question neither overlap nor PPR directly addresses
- Label propagation is simple to implement (~150 LOC) and runs in linear time
- No external dependencies
- Clusters can be cached and invalidated on index refresh

## Proposed API Surface

Option A — new mode on `lore_search`:
```json
{ "query_id": 42, "mode": "related", "limit": 20 }
```

Option B — new mode on `lore_graph`:
```json
{ "kind": "related", "source_id": 42, "algorithm": "overlap" | "pagerank" | "cluster" }
```

Option C — standalone tool `lore_related`:
```json
{ "symbol_id": 42, "algorithm": "overlap", "limit": 20 }
```

Leaning toward **Option B** since it's a graph query and fits naturally alongside `kind="call"`, `kind="import"`, etc.

## Implementation Plan

1. **Neighborhood overlap** as a new `kind="related"` on `lore_graph` (or `mode` on `lore_search`)
   - Add SQL query functions to `src/db/read-only.ts`
   - Wire into tool handler
   - Tests against fixture DBs
2. **Personalized PageRank** as a second algorithm option
   - Implement PPR walker in a new `src/resolution/pagerank.ts` (or similar)
   - Expose via `algorithm="pagerank"` parameter
   - Tests with known graph topologies
3. **Community / Cluster Detection** as a third algorithm option
   - Implement label propagation in `src/resolution/clusters.ts` (or similar)
   - Expose via `algorithm="cluster"` parameter
   - Cache cluster assignments, invalidate on index refresh
   - Tests with known cluster structures
4. **Benchmark** all three against the existing benchmark suite to measure whether they improve agent correctness on structural questions

## Non-Goals

- **Vector embeddings** — the whole point is that graph structure is the embedding. No `sqlite-vec`, no model downloads, no staleness.
- **Natural language understanding** — these mechanisms answer "what's structurally similar?" not "what does authentication mean?" BM25/FTS5 handles the text matching case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: graph-aware relatedness queries (neighborhood overlap + Personalized PageRank) #255

Summary

Motivation

Required Mechanisms

1. Neighborhood Overlap (pure SQL, no precomputation)

2. Personalized PageRank (in-process, ~100 LOC)

3. Community / Cluster Detection

Proposed API Surface

Implementation Plan

Non-Goals

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: graph-aware relatedness queries (neighborhood overlap + Personalized PageRank) #255

Description

Summary

Motivation

Required Mechanisms

1. Neighborhood Overlap (pure SQL, no precomputation)

2. Personalized PageRank (in-process, ~100 LOC)

3. Community / Cluster Detection

Proposed API Surface

Implementation Plan

Non-Goals

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions