fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459
fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When an HNSW index node's data is corrupt (e.g., from an interrupted write or format mismatch after a reclone), getSync throws a decode error that previously propagated uncaught through searchLayer and index(), crashing the worker via uncaughtException, which then triggered a RocksDB SIGSEGV via napi_call_threadsafe_function on the dead thread. Wrap all getSync calls that traverse the HNSW graph (entry point, neighbor nodes in searchLayer, fromId/toId nodes during connection updates, and neighbor nodes during old-node removal) in try-catch so corrupt nodes are skipped rather than crashing the process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ad paths Adds a private `safeGetSync` helper that wraps `indexStore.getSync` in a try-catch, returning `undefined` instead of throwing when a node's stored data is corrupt. Applies it to all 7 call sites that read node data by numeric key (including `getEntryPoint`, `searchLayer`, `checkSymmetry`, and `addConnection`), so a single corrupt node causes a logged warning rather than an uncaught exception. Also fixes a pre-existing bug in `checkSymmetry` where the undefined value (not the key) was being logged. Adds a unit test using a minimal mock store that simulates corrupt reads after the third neighbor access, verifying `doesNotThrow` on subsequent inserts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| @@ -112,7 +112,7 @@ export class HierarchicalNavigableSmallWorld { | |||
| oldNode = { ...this.indexStore.getSync(nodeId, options) }; | |||
There was a problem hiding this comment.
Missed safeGetSync call site — update path still crashes on corrupt nodes
nodeId is always numeric here (either from Atomics.add or from the safeKey lookup), so this is one of the numeric-key node reads the PR set out to protect. If the stored bytes for this node are corrupt, getSync throws before the spread runs, and the exception propagates through index() exactly as before the fix — defeating the crash-loop protection on the update (existingVector) path.
| oldNode = { ...this.indexStore.getSync(nodeId, options) }; | |
| oldNode = { ...this.safeGetSync(nodeId, options) }; |
The PR description counts "exactly 7 places that call getSync with a numeric key"; this is the one that was missed.
|
One blocker: |
Summary
safeGetSynchelper onHierarchicalNavigableSmallWorldthat wrapsindexStore.getSyncin a try-catch, returningundefined(and logging a warning) instead of propagating the error when a node's stored bytes are corrupt.getEntryPoint, the entry-point read inindex(), the old-node neighbor removal loop, theconnectionsToBeReplacedloop,addConnection,searchLayer, andcheckSymmetry.checkSymmetrythat logged theundefinedresult instead of the key."Data read, but end of buffer not reached 0"after the third neighbor access, assertingdoesNotThrowon subsequent inserts.Purpose
A single corrupt HNSW node in a customer's index (User app, confirmed raw bytes
01010101449100000e0000...) was causingindexStore.getSyncto throw an unhandled exception in graph traversal. That uncaught exception triggered a workeruncaughtException→ RocksDBnapi_call_threadsafe_functionon dead thread →SIGSEGV→ full process crash. Every subsequent insert to an HNSW-indexed table re-hit the same corrupt node and crashed again, creating a crash loop.The fix makes corrupt nodes non-fatal: they are skipped during traversal with a warning log. The graph remains operational (potentially with degraded routing near the corrupt node) until the index is rebuilt.
Areas requiring attention
getSyncwith a numeric key. I believe all are now covered bysafeGetSync, but please verify against the fullindex()method, especially the level-iteration loop around line 150–180, which was the hardest to reason about.safeGetSyncsilently continues when a node is unreadable. This means a corrupt node in the path from the entry point could cause the search to explore a smaller portion of the graph. The logged warning atwarnlevel should surface this, but consider whethererrorlevel is more appropriate.indexStorekey dispatch (Symbol → entryPoint, number → node, JSON-stringified → level connection lists) mirrors the real store's key pattern. If that pattern changes, the test could silently stop exercising the corrupt path.047744bf8) has verbose per-site try-catch blocks; the second (4b9e22bab) refactors to thesafeGetSynchelper. Squash on merge is fine.Test plan
npm test -- --grep 'does not crash when an index node decodes as corrupt'passesvectorIndex.test.js) passes (3 integration tests + new unit test)DocumentChunk,CompanyProfile,ScoreEvidenceto evict the corrupt node🤖 Generated with Claude Code