Skip to content

Troubleshooting

jstuart0 edited this page Apr 28, 2026 · 2 revisions

Troubleshooting

API won't start: migrations fail

Symptom: The API crashes at startup with a SurrealDB error mentioning a migration.

Fix:

  1. Check that SurrealDB is reachable at SOURCEBRIDGE_STORAGE_SURREAL_URL.
  2. Verify SurrealDB is the correct version. The Docker Compose files use surrealdb/surrealdb:v2.2.1.
  3. If you are using embedded mode, ensure the data directory is writable and not corrupted.
  4. Check the API logs for the specific migration number that failed.

The migration runner runs at startup and skips already-applied migrations. If a migration partially applied, the safest recovery for a dev instance is to wipe the data directory and restart.

API starts but generation features fail

Symptom: Search and indexing work; cliff notes, code tours, or QA fail with errors.

Cause: The Python worker is not reachable.

Checks:

  1. Confirm the worker container/process is running.
  2. Confirm SOURCEBRIDGE_WORKER_ADDRESS matches the actual gRPC address (default localhost:50051).
  3. Confirm SOURCEBRIDGE_SECURITY_GRPC_AUTH_SECRET matches SOURCEBRIDGE_WORKER_GRPC_AUTH_SECRET.
  4. Check worker logs for startup errors (missing LLM config, DB connection failure).
  5. Curl the API readiness probe: curl http://localhost:8080/readyz — it will return unhealthy if the worker probe failed.

Split-brain agentic QA on rolling deploys

Symptom: Some API pods use the agentic QA loop; others fall back to single-shot. Inconsistent answer quality.

Cause: Under a rolling deploy, an API pod can probe the worker before it is ready, fail, and stay on the single-shot path for its lifetime.

Fix: The startup probe retries up to 6 times with 5-second backoff (30-second window). Ensure the worker's readiness probe is configured. In Kubernetes, verify the worker pod's readinessProbe is passing before the API pod marks itself ready.

LLM timeouts

Symptom: Generation jobs fail with timeout errors in the worker logs.

Checks:

  1. Verify your LLM provider credentials are correct.
  2. For cloud providers: check rate limits and API status pages.
  3. For Ollama: verify the model is fully loaded (ollama list and ollama ps).
  4. The default timeout is 900 seconds (15 minutes). If local models are slower, increase SOURCEBRIDGE_LLM_TIMEOUT_SECONDS.
  5. For Qwen3 on Ollama: verify the /no_think directive is being respected. If the model returns empty content with stop_reason=length, it may be spending the whole budget inside a thinking block.

Indexing is slow

Checks:

  1. Increase SOURCEBRIDGE_INDEXING_MAX_CONCURRENCY (default 8). On a fast disk, 16–32 is reasonable.
  2. Verify SOURCEBRIDGE_INDEXING_MAX_FILE_SIZE_BYTES is not too low for your repo's large files.
  3. Confirm the ignore globs exclude large generated directories (node_modules, dist, vendor).
  4. For very large repos (100k+ files), expect indexing to take several minutes even with high concurrency.

Ollama model returns empty content

Symptom: Generation completes but returns empty or truncated output with Qwen models on Ollama.

Cause: Qwen 3 MoE burns its entire max_tokens budget inside an unemitted thinking block when thinking is not disabled.

Fix: This is handled automatically in current builds via the /no_think directive. If you are on an older version, update to a current build.

Living wiki: enabled, but nothing appears in Confluence

Checks:

  1. Verify SOURCEBRIDGE_LIVING_WIKI_ENABLED=true and the feature is enabled for the specific repo.
  2. Check the job result in the admin activity feed — the failure category will indicate whether it is a credential error, network error, or partial-content failure.
  3. Verify Confluence credentials in /settings/living-wiki and use the Test connection button.
  4. Verify confluenceSite is set (added in migration 038 — check the settings page for a "Confluence site" field).
  5. If the test-connection button fails with an auth error, verify you are using the Confluence API token with the correct site URL format (your-org.atlassian.net/wiki).
  6. Check for ErrSinkNotImplemented in logs — this means you chose a stubbed sink (Backstage TechDocs, MkDocs, etc.).

Living wiki: pages are regenerating but human edits keep getting overwritten

Cause: The edit policy for the sink is set to promote_to_canonical instead of local_to_sink.

Fix: In the per-repo Settings → Living Wiki panel (Stage B / refinement), change the edit policy for the affected sink to local_to_sink (recommended for Confluence and Notion) or require_review_before_promote.

MCP clients cannot connect

Checks:

  1. Verify SOURCEBRIDGE_MCP_ENABLED=true.
  2. Confirm the client is sending Authorization: Bearer <token> on every request.
  3. For multi-replica deployments: confirm Redis is configured (SOURCEBRIDGE_STORAGE_REDIS_MODE=external) so sessions are shared across pods.
  4. For mcp-remote clients: verify the URL format is http://your-host:8080/mcp.
  5. Check whether the client is sandboxing network access (some Claude Desktop versions restrict outbound MCP calls).

VS Code extension shows "offline" or no lenses

Checks:

  1. Confirm sourcebridge.apiUrl matches the running server (including port).
  2. Run SourceBridge: Show Logs from the command palette — it will show connection errors and auth failures.
  3. Run SourceBridge: Sign In to refresh the token.
  4. Verify the workspace folder is indexed on the connected server.
  5. If lenses are missing on specific symbols: verify the symbol kind is supported (functions, methods, classes are indexed; anonymous lambdas may not be named).

Subsystem clustering tab is empty

Cause: Clustering runs as an async job after indexing. If the tab is empty, either the job has not completed or it found too few edges to cluster meaningfully.

Checks:

  1. Check the admin activity feed for a completed clustering job.
  2. The clustering job only runs if the call-graph SHA-256 changed since the last run.
  3. Very small repositories (fewer than ~20 symbols with call edges) may not produce clusters.
  4. Verify the subsystem_clustering capability is available on your edition (oss includes it).

SurrealDB: stale rows after schema migration

Symptom: UPSERT failures mentioning NONE for fields that should have defaults.

Cause: SurrealDB DEFAULT only fires on row creation. Rows created before a migration that added new columns will have NONE for those fields.

Fix: This is handled by backfill migrations (e.g., migration 039 for living-wiki settings). If you see this on a custom table, run a manual UPDATE table SET field = default WHERE field IS NONE query.

Resetting a development instance

For dev instances, a full reset is often the fastest recovery:

# Docker Compose
docker compose down -v   # removes volumes including SurrealDB data
docker compose up -d

# Embedded mode (running sourcebridge serve directly)
rm -rf ./surrealdb-data
./sourcebridge serve

After restart, re-index your repositories from the web UI.

Clone this wiki locally