Skip to content

feat: graph-aware relatedness queries (neighborhood overlap + Personalized PageRank) #255

@jafreck

Description

@jafreck

Summary

Add graph-structure-based relatedness queries so agents can ask "what symbols are similar to X?" without vector embeddings. The call graph topology itself encodes functional similarity — symbols that share callers/callees, sit in the same subgraph region, or belong to the same tightly connected cluster are structurally related.

This is a post-feature-strip addition: once embeddings are removed, these mechanisms replace semantic search with something deterministic, zero-dependency, and always fresh.

Motivation

After the feature strip, lore_search only supports BM25/FTS5 text matching. That handles "find symbols named like X" but not "find symbols that play a similar role to X in the codebase." Graph-aware relatedness fills that gap using the existing symbol_refs, type_refs, file_imports, and symbol_relationships tables — no new infrastructure required.

Required Mechanisms

All three mechanisms must be implemented. They serve complementary purposes at different granularities.

1. Neighborhood Overlap (pure SQL, no precomputation)

Two symbols are "similar" if they share callers or callees. Implementable as a single SQL query:

-- Symbols sharing the most callees with symbol :id
SELECT sr2.caller_id AS related_id, s.name, COUNT(*) AS shared_callees
  FROM symbol_refs sr1
  JOIN symbol_refs sr2 ON sr1.callee_id = sr2.callee_id
  JOIN symbols s ON s.id = sr2.caller_id
 WHERE sr1.caller_id = :id AND sr2.caller_id != :id
 GROUP BY sr2.caller_id
 ORDER BY shared_callees DESC
 LIMIT 20;

Same approach works in reverse (shared callers → symbols that serve as substitutes). Can also combine callee overlap, caller overlap, and type-ref overlap with configurable weights.

Properties:

  • Zero new dependencies
  • No precomputation or staleness
  • Works on the live index
  • Captures direct functional similarity

2. Personalized PageRank (in-process, ~100 LOC)

For a query symbol, run truncated random walks over the call graph. Symbols that accumulate high visit probability are "semantically near" in the graph — even across many hops.

Given query node q, damping α = 0.85:
  Run N random walks from q
  At each step: follow random outgoing edge (prob α), or teleport to q (prob 1-α)
  Count visits per node → similarity score

Properties:

  • Captures transitive relationships that neighborhood overlap misses
  • No external dependencies (runs over existing adjacency data in SQLite)
  • Small in-process algorithm (~100 lines TypeScript)
  • Approximation, but tunable precision (more walks = more accurate)

3. Community / Cluster Detection

Label propagation or connected-component analysis to identify functional modules — clusters of tightly interconnected symbols. "Related" = "in the same cluster."

Useful for higher-level queries like "what other code is part of the payment processing subsystem?" Could power a lore_graph(kind="cluster") mode.

Properties:

  • Answers "what subsystem does X belong to?" — a question neither overlap nor PPR directly addresses
  • Label propagation is simple to implement (~150 LOC) and runs in linear time
  • No external dependencies
  • Clusters can be cached and invalidated on index refresh

Proposed API Surface

Option A — new mode on lore_search:

{ "query_id": 42, "mode": "related", "limit": 20 }

Option B — new mode on lore_graph:

{ "kind": "related", "source_id": 42, "algorithm": "overlap" | "pagerank" | "cluster" }

Option C — standalone tool lore_related:

{ "symbol_id": 42, "algorithm": "overlap", "limit": 20 }

Leaning toward Option B since it's a graph query and fits naturally alongside kind="call", kind="import", etc.

Implementation Plan

  1. Neighborhood overlap as a new kind="related" on lore_graph (or mode on lore_search)
    • Add SQL query functions to src/db/read-only.ts
    • Wire into tool handler
    • Tests against fixture DBs
  2. Personalized PageRank as a second algorithm option
    • Implement PPR walker in a new src/resolution/pagerank.ts (or similar)
    • Expose via algorithm="pagerank" parameter
    • Tests with known graph topologies
  3. Community / Cluster Detection as a third algorithm option
    • Implement label propagation in src/resolution/clusters.ts (or similar)
    • Expose via algorithm="cluster" parameter
    • Cache cluster assignments, invalidate on index refresh
    • Tests with known cluster structures
  4. Benchmark all three against the existing benchmark suite to measure whether they improve agent correctness on structural questions

Non-Goals

  • Vector embeddings — the whole point is that graph structure is the embedding. No sqlite-vec, no model downloads, no staleness.
  • Natural language understanding — these mechanisms answer "what's structurally similar?" not "what does authentication mean?" BM25/FTS5 handles the text matching case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions