fix(hnsw): skip corrupt nodes instead of crashing on decode errors by kriszyp · Pull Request #459 · HarperFast/harper

kriszyp · 2026-05-02T22:41:40Z

Summary

Adds a private safeGetSync helper on HierarchicalNavigableSmallWorld that wraps indexStore.getSync in a try-catch, returning undefined (and logging a warning) instead of propagating the error when a node's stored bytes are corrupt.
Applies it to all 7 call sites that read node data by numeric key: getEntryPoint, the entry-point read in index(), the old-node neighbor removal loop, the connectionsToBeReplaced loop, addConnection, searchLayer, and checkSymmetry.
Fixes a pre-existing bug in checkSymmetry that logged the undefined result instead of the key.
Adds a unit test using a minimal mock store that throws "Data read, but end of buffer not reached 0" after the third neighbor access, asserting doesNotThrow on subsequent inserts.

Purpose

A single corrupt HNSW node in a customer's index (User app, confirmed raw bytes 01010101449100000e0000...) was causing indexStore.getSync to throw an unhandled exception in graph traversal. That uncaught exception triggered a worker uncaughtException → RocksDB napi_call_threadsafe_function on dead thread → SIGSEGV → full process crash. Every subsequent insert to an HNSW-indexed table re-hit the same corrupt node and crashed again, creating a crash loop.

The fix makes corrupt nodes non-fatal: they are skipped during traversal with a warning log. The graph remains operational (potentially with degraded routing near the corrupt node) until the index is rebuilt.

Areas requiring attention

Coverage completeness: there are exactly 7 places that call getSync with a numeric key. I believe all are now covered by safeGetSync, but please verify against the full index() method, especially the level-iteration loop around line 150–180, which was the hardest to reason about.
Swallowed corruption: safeGetSync silently continues when a node is unreadable. This means a corrupt node in the path from the entry point could cause the search to explore a smaller portion of the graph. The logged warning at warn level should surface this, but consider whether error level is more appropriate.
Unit test mock fidelity: the mock indexStore key dispatch (Symbol → entryPoint, number → node, JSON-stringified → level connection lists) mirrors the real store's key pattern. If that pattern changes, the test could silently stop exercising the corrupt path.
Two commits on branch: the first commit (047744bf8) has verbose per-site try-catch blocks; the second (4b9e22bab) refactors to the safeGetSync helper. Squash on merge is fine.

Test plan

npm test -- --grep 'does not crash when an index node decodes as corrupt' passes
Full HNSW test suite (vectorIndex.test.js) passes (3 integration tests + new unit test)
Deploy 5.0.10 patch to Serent nodes and confirm crash loop stops
After stabilisation, rebuild HNSW indexes for DocumentChunk, CompanyProfile, ScoreEvidence to evict the corrupt node

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When an HNSW index node's data is corrupt (e.g., from an interrupted write or format mismatch after a reclone), getSync throws a decode error that previously propagated uncaught through searchLayer and index(), crashing the worker via uncaughtException, which then triggered a RocksDB SIGSEGV via napi_call_threadsafe_function on the dead thread. Wrap all getSync calls that traverse the HNSW graph (entry point, neighbor nodes in searchLayer, fromId/toId nodes during connection updates, and neighbor nodes during old-node removal) in try-catch so corrupt nodes are skipped rather than crashing the process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ad paths Adds a private `safeGetSync` helper that wraps `indexStore.getSync` in a try-catch, returning `undefined` instead of throwing when a node's stored data is corrupt. Applies it to all 7 call sites that read node data by numeric key (including `getEntryPoint`, `searchLayer`, `checkSymmetry`, and `addConnection`), so a single corrupt node causes a logged warning rather than an uncaught exception. Also fixes a pre-existing bug in `checkSymmetry` where the undefined value (not the key) was being logged. Adds a unit test using a minimal mock store that simulates corrupt reads after the third neighbor access, verifying `doesNotThrow` on subsequent inserts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-05-02T22:44:41Z

@@ -112,7 +112,7 @@ export class HierarchicalNavigableSmallWorld {
 			oldNode = { ...this.indexStore.getSync(nodeId, options) };


Missed safeGetSync call site — update path still crashes on corrupt nodes

nodeId is always numeric here (either from Atomics.add or from the safeKey lookup), so this is one of the numeric-key node reads the PR set out to protect. If the stored bytes for this node are corrupt, getSync throws before the spread runs, and the exception propagates through index() exactly as before the fix — defeating the crash-loop protection on the update (existingVector) path.

Suggested change

oldNode = { ...this.indexStore.getSync(nodeId, options) };

oldNode = { ...this.safeGetSync(nodeId, options) };

The PR description counts "exactly 7 places that call getSync with a numeric key"; this is the one that was missed.

claude · 2026-05-02T22:44:47Z

One blocker: resources/indexes/HierarchicalNavigableSmallWorld.ts line 112 has a raw this.indexStore.getSync(nodeId, options) in the existingVector (update) branch of index() — the one numeric-key node read not converted to safeGetSync. A corrupt node on the update path will still throw and re-enter the crash loop. See inline comment.

kriszyp and others added 3 commits May 2, 2026 15:42

add .gitattributes for npm-merge-driver

67d50fa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kriszyp requested a review from a team as a code owner May 2, 2026 22:41

claude Bot reviewed May 2, 2026

View reviewed changes

kriszyp mentioned this pull request May 6, 2026

fix(blob): retry bytes() up to 3x when lock-free but file is incomplete #477

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459

fix(hnsw): skip corrupt nodes instead of crashing on decode errors#459
kriszyp wants to merge 3 commits intomainfrom
fix/hnsw-corrupt-node-resilience

kriszyp commented May 2, 2026 •

edited

Loading

Uh oh!

claude Bot May 2, 2026

Uh oh!

claude Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -112,7 +112,7 @@ export class HierarchicalNavigableSmallWorld {
		oldNode = { ...this.indexStore.getSync(nodeId, options) };

	oldNode = { ...this.indexStore.getSync(nodeId, options) };
	oldNode = { ...this.safeGetSync(nodeId, options) };

Conversation

kriszyp commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Purpose

Areas requiring attention

Test plan

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kriszyp commented May 2, 2026 •

edited

Loading