Summary
Add graph-structure-based relatedness queries so agents can ask "what symbols are similar to X?" without vector embeddings. The call graph topology itself encodes functional similarity — symbols that share callers/callees, sit in the same subgraph region, or belong to the same tightly connected cluster are structurally related.
This is a post-feature-strip addition: once embeddings are removed, these mechanisms replace semantic search with something deterministic, zero-dependency, and always fresh.
Motivation
After the feature strip, lore_search only supports BM25/FTS5 text matching. That handles "find symbols named like X" but not "find symbols that play a similar role to X in the codebase." Graph-aware relatedness fills that gap using the existing symbol_refs, type_refs, file_imports, and symbol_relationships tables — no new infrastructure required.
Required Mechanisms
All three mechanisms must be implemented. They serve complementary purposes at different granularities.
1. Neighborhood Overlap (pure SQL, no precomputation)
Two symbols are "similar" if they share callers or callees. Implementable as a single SQL query:
-- Symbols sharing the most callees with symbol :id
SELECT sr2.caller_id AS related_id, s.name, COUNT(*) AS shared_callees
FROM symbol_refs sr1
JOIN symbol_refs sr2 ON sr1.callee_id = sr2.callee_id
JOIN symbols s ON s.id = sr2.caller_id
WHERE sr1.caller_id = :id AND sr2.caller_id != :id
GROUP BY sr2.caller_id
ORDER BY shared_callees DESC
LIMIT 20;
Same approach works in reverse (shared callers → symbols that serve as substitutes). Can also combine callee overlap, caller overlap, and type-ref overlap with configurable weights.
Properties:
- Zero new dependencies
- No precomputation or staleness
- Works on the live index
- Captures direct functional similarity
2. Personalized PageRank (in-process, ~100 LOC)
For a query symbol, run truncated random walks over the call graph. Symbols that accumulate high visit probability are "semantically near" in the graph — even across many hops.
Given query node q, damping α = 0.85:
Run N random walks from q
At each step: follow random outgoing edge (prob α), or teleport to q (prob 1-α)
Count visits per node → similarity score
Properties:
- Captures transitive relationships that neighborhood overlap misses
- No external dependencies (runs over existing adjacency data in SQLite)
- Small in-process algorithm (~100 lines TypeScript)
- Approximation, but tunable precision (more walks = more accurate)
3. Community / Cluster Detection
Label propagation or connected-component analysis to identify functional modules — clusters of tightly interconnected symbols. "Related" = "in the same cluster."
Useful for higher-level queries like "what other code is part of the payment processing subsystem?" Could power a lore_graph(kind="cluster") mode.
Properties:
- Answers "what subsystem does X belong to?" — a question neither overlap nor PPR directly addresses
- Label propagation is simple to implement (~150 LOC) and runs in linear time
- No external dependencies
- Clusters can be cached and invalidated on index refresh
Proposed API Surface
Option A — new mode on lore_search:
{ "query_id": 42, "mode": "related", "limit": 20 }
Option B — new mode on lore_graph:
{ "kind": "related", "source_id": 42, "algorithm": "overlap" | "pagerank" | "cluster" }
Option C — standalone tool lore_related:
{ "symbol_id": 42, "algorithm": "overlap", "limit": 20 }
Leaning toward Option B since it's a graph query and fits naturally alongside kind="call", kind="import", etc.
Implementation Plan
- Neighborhood overlap as a new
kind="related" on lore_graph (or mode on lore_search)
- Add SQL query functions to
src/db/read-only.ts
- Wire into tool handler
- Tests against fixture DBs
- Personalized PageRank as a second algorithm option
- Implement PPR walker in a new
src/resolution/pagerank.ts (or similar)
- Expose via
algorithm="pagerank" parameter
- Tests with known graph topologies
- Community / Cluster Detection as a third algorithm option
- Implement label propagation in
src/resolution/clusters.ts (or similar)
- Expose via
algorithm="cluster" parameter
- Cache cluster assignments, invalidate on index refresh
- Tests with known cluster structures
- Benchmark all three against the existing benchmark suite to measure whether they improve agent correctness on structural questions
Non-Goals
- Vector embeddings — the whole point is that graph structure is the embedding. No
sqlite-vec, no model downloads, no staleness.
- Natural language understanding — these mechanisms answer "what's structurally similar?" not "what does authentication mean?" BM25/FTS5 handles the text matching case.
Summary
Add graph-structure-based relatedness queries so agents can ask "what symbols are similar to X?" without vector embeddings. The call graph topology itself encodes functional similarity — symbols that share callers/callees, sit in the same subgraph region, or belong to the same tightly connected cluster are structurally related.
This is a post-feature-strip addition: once embeddings are removed, these mechanisms replace semantic search with something deterministic, zero-dependency, and always fresh.
Motivation
After the feature strip,
lore_searchonly supports BM25/FTS5 text matching. That handles "find symbols named like X" but not "find symbols that play a similar role to X in the codebase." Graph-aware relatedness fills that gap using the existingsymbol_refs,type_refs,file_imports, andsymbol_relationshipstables — no new infrastructure required.Required Mechanisms
All three mechanisms must be implemented. They serve complementary purposes at different granularities.
1. Neighborhood Overlap (pure SQL, no precomputation)
Two symbols are "similar" if they share callers or callees. Implementable as a single SQL query:
Same approach works in reverse (shared callers → symbols that serve as substitutes). Can also combine callee overlap, caller overlap, and type-ref overlap with configurable weights.
Properties:
2. Personalized PageRank (in-process, ~100 LOC)
For a query symbol, run truncated random walks over the call graph. Symbols that accumulate high visit probability are "semantically near" in the graph — even across many hops.
Properties:
3. Community / Cluster Detection
Label propagation or connected-component analysis to identify functional modules — clusters of tightly interconnected symbols. "Related" = "in the same cluster."
Useful for higher-level queries like "what other code is part of the payment processing subsystem?" Could power a
lore_graph(kind="cluster")mode.Properties:
Proposed API Surface
Option A — new mode on
lore_search:{ "query_id": 42, "mode": "related", "limit": 20 }Option B — new mode on
lore_graph:{ "kind": "related", "source_id": 42, "algorithm": "overlap" | "pagerank" | "cluster" }Option C — standalone tool
lore_related:{ "symbol_id": 42, "algorithm": "overlap", "limit": 20 }Leaning toward Option B since it's a graph query and fits naturally alongside
kind="call",kind="import", etc.Implementation Plan
kind="related"onlore_graph(ormodeonlore_search)src/db/read-only.tssrc/resolution/pagerank.ts(or similar)algorithm="pagerank"parametersrc/resolution/clusters.ts(or similar)algorithm="cluster"parameterNon-Goals
sqlite-vec, no model downloads, no staleness.