Skip to content

fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459

Open
kriszyp wants to merge 3 commits intomainfrom
fix/hnsw-corrupt-node-resilience
Open

fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459
kriszyp wants to merge 3 commits intomainfrom
fix/hnsw-corrupt-node-resilience

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 2, 2026

Summary

  • Adds a private safeGetSync helper on HierarchicalNavigableSmallWorld that wraps indexStore.getSync in a try-catch, returning undefined (and logging a warning) instead of propagating the error when a node's stored bytes are corrupt.
  • Applies it to all 7 call sites that read node data by numeric key: getEntryPoint, the entry-point read in index(), the old-node neighbor removal loop, the connectionsToBeReplaced loop, addConnection, searchLayer, and checkSymmetry.
  • Fixes a pre-existing bug in checkSymmetry that logged the undefined result instead of the key.
  • Adds a unit test using a minimal mock store that throws "Data read, but end of buffer not reached 0" after the third neighbor access, asserting doesNotThrow on subsequent inserts.

Purpose

A single corrupt HNSW node in a customer's index (User app, confirmed raw bytes 01010101449100000e0000...) was causing indexStore.getSync to throw an unhandled exception in graph traversal. That uncaught exception triggered a worker uncaughtException → RocksDB napi_call_threadsafe_function on dead thread → SIGSEGV → full process crash. Every subsequent insert to an HNSW-indexed table re-hit the same corrupt node and crashed again, creating a crash loop.

The fix makes corrupt nodes non-fatal: they are skipped during traversal with a warning log. The graph remains operational (potentially with degraded routing near the corrupt node) until the index is rebuilt.

Areas requiring attention

  • Coverage completeness: there are exactly 7 places that call getSync with a numeric key. I believe all are now covered by safeGetSync, but please verify against the full index() method, especially the level-iteration loop around line 150–180, which was the hardest to reason about.
  • Swallowed corruption: safeGetSync silently continues when a node is unreadable. This means a corrupt node in the path from the entry point could cause the search to explore a smaller portion of the graph. The logged warning at warn level should surface this, but consider whether error level is more appropriate.
  • Unit test mock fidelity: the mock indexStore key dispatch (Symbol → entryPoint, number → node, JSON-stringified → level connection lists) mirrors the real store's key pattern. If that pattern changes, the test could silently stop exercising the corrupt path.
  • Two commits on branch: the first commit (047744bf8) has verbose per-site try-catch blocks; the second (4b9e22bab) refactors to the safeGetSync helper. Squash on merge is fine.

Test plan

  • npm test -- --grep 'does not crash when an index node decodes as corrupt' passes
  • Full HNSW test suite (vectorIndex.test.js) passes (3 integration tests + new unit test)
  • Deploy 5.0.10 patch to Serent nodes and confirm crash loop stops
  • After stabilisation, rebuild HNSW indexes for DocumentChunk, CompanyProfile, ScoreEvidence to evict the corrupt node

🤖 Generated with Claude Code

kriszyp and others added 3 commits May 2, 2026 15:42
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When an HNSW index node's data is corrupt (e.g., from an interrupted write
or format mismatch after a reclone), getSync throws a decode error that
previously propagated uncaught through searchLayer and index(), crashing
the worker via uncaughtException, which then triggered a RocksDB SIGSEGV
via napi_call_threadsafe_function on the dead thread.

Wrap all getSync calls that traverse the HNSW graph (entry point, neighbor
nodes in searchLayer, fromId/toId nodes during connection updates, and
neighbor nodes during old-node removal) in try-catch so corrupt nodes are
skipped rather than crashing the process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ad paths

Adds a private `safeGetSync` helper that wraps `indexStore.getSync` in a
try-catch, returning `undefined` instead of throwing when a node's stored
data is corrupt. Applies it to all 7 call sites that read node data by
numeric key (including `getEntryPoint`, `searchLayer`, `checkSymmetry`,
and `addConnection`), so a single corrupt node causes a logged warning
rather than an uncaught exception.

Also fixes a pre-existing bug in `checkSymmetry` where the undefined value
(not the key) was being logged.

Adds a unit test using a minimal mock store that simulates corrupt reads
after the third neighbor access, verifying `doesNotThrow` on subsequent
inserts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp requested a review from a team as a code owner May 2, 2026 22:41
@@ -112,7 +112,7 @@ export class HierarchicalNavigableSmallWorld {
oldNode = { ...this.indexStore.getSync(nodeId, options) };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed safeGetSync call site — update path still crashes on corrupt nodes

nodeId is always numeric here (either from Atomics.add or from the safeKey lookup), so this is one of the numeric-key node reads the PR set out to protect. If the stored bytes for this node are corrupt, getSync throws before the spread runs, and the exception propagates through index() exactly as before the fix — defeating the crash-loop protection on the update (existingVector) path.

Suggested change
oldNode = { ...this.indexStore.getSync(nodeId, options) };
oldNode = { ...this.safeGetSync(nodeId, options) };

The PR description counts "exactly 7 places that call getSync with a numeric key"; this is the one that was missed.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 2, 2026

One blocker: resources/indexes/HierarchicalNavigableSmallWorld.ts line 112 has a raw this.indexStore.getSync(nodeId, options) in the existingVector (update) branch of index() — the one numeric-key node read not converted to safeGetSync. A corrupt node on the update path will still throw and re-enter the crash loop. See inline comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant