Fix N+1 in GraphStoreMetrics caller lookups (CA-171)#44
Merged
Conversation
repoReferenceCount and packageReferenceCount issued one SurrealDB websocket round-trip per caller via GetSymbol(callerID) inside a nested loop. On any non-trivial repo this stalled Living Wiki page generation indefinitely — the page goroutine ground through the per-call websocket queue while the errgroup semaphore (clamped to MaxConcurrency=1 by the upstream LLM capacity provider) blocked every subsequent page. Replaces the inner GetSymbol-per-callerID with a single GetSymbolsByIDs batch fetch (helper already exists in internal/db/store.go:1875). The outer GetCallers-per-symbol N+1 remains and is tracked separately; that fix needs a new GetCallersByIDs store method. Caught via pprof goroutine dump (gorillaws.Connection.Call [select]) during CA-169 deploy validation. Refs CA-171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds dev-only Go pprof endpoints under /debug/pprof/* gated by an env var (default false) so a goroutine dump can be captured against a hung job without rebuilding. Mounted before the rate limiter so a dump is not throttled. Compose forwards the env var so SOURCEBRIDGE_PPROF_ENABLED=true on the host enables it for local stacks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lEdges (CA-171) Initial fix only batched the inner GetSymbol-per-callerID lookup. The outer GetCallers-per-symbol N+1 still stalled the same way (confirmed by a second pprof goroutine dump after the first deploy). GraphStore already exposes GetCallEdges(repoID) which returns every caller→callee edge for the repo in a single query. Rewrites all four metric functions to use it: - packageReferenceCount: filter edges by callee membership in pkg, then one GetSymbolsByIDs batch for callers. - packageRelationCount: count edges whose callee is in pkg. - repoReferenceCount: one GetCallEdges + one GetSymbolsByIDs batch. - repoRelationCount: len(GetCallEdges(repoID)). For a repo with N symbols and K avg callers per symbol, this collapses O(N*K + 1) sequential SurrealDB round-trips into 2-3 total. Refs CA-171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Living Wiki page generation hung indefinitely on any non-trivial repo because
GraphStoreMetrics.repoReferenceCount/packageReferenceCountissued one SurrealDB websocket round-trip per caller viaGetSymbol(callerID)inside a nested loop — O(symbols × callers) sequential RPCs. Replaces the inner N with oneGetSymbolsByIDsbatch.Discovered via pprof goroutine dump during CA-169 deploy validation:
select-blocked ingorillaws.Connection.Callchan send-blocked on the orchestrator semaphore (clamped to 1 by the upstream-LLM capacity provider — that part working as designed)failedThe outer
GetCallers-per-symbol N+1 remains; that fix needs a newGetCallersByIDsstore method and is tracked separately.Closes CA-171.
Test plan
go test ./internal/livingwiki/orchestrator/ -count=1greengo build ./...cleanAlso in this PR (separate commit)
chore(api): expose net/http/pprof behind SOURCEBRIDGE_PPROF_ENABLED— dev-only goroutine dump endpoint that produced the diagnosis. Off by default; opt-in via env var.🤖 Generated with Claude Code