fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477
Draft
fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477
Conversation
When a node crashes mid-blob-write, the in-memory LMDB lock state is cleared on restart. After recovery, tryLock returns true (no active writer) even though the blob file is only partially written. Previously, bytes() would immediately throw "Incomplete blob" in this state. This fix adds up to 3 retries with 100ms delay before giving up, allowing time for replication to re-send the blob and acquire the write lock. Covers the Serent-Canopy scenario where a crashed HNSW node left partial blob files that caused application-level errors until a manual 200ms retry was added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
Reviewed; no blockers found. |
Member
Author
|
I challenged Claude on this one, and after further research admitted it is not sure if this will really help. Probably will close. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds retry logic in
bytes()whentryLockreturnstruebut the blob file is still incomplete — a state that, given the locking invariant, can only occur after a process crash.The locking invariant
In
writeBlobWithStream, the lock is acquired before the file is created or any data written, and released only after both content and the final size header are fully written to disk. Therefore in normal operation, a reader always sees one of two consistent states:Lock-free + incomplete file can only happen after a process crash that cleared in-memory lock state while leaving a partial file on disk.
What the retry does and doesn't do
What it does: After detecting a crash-incomplete file, retries up to 3× at 100ms intervals (300ms total) before throwing, instead of throwing immediately.
What it does NOT do: Recover from the crash. After a crash, replication re-sends the blob data to a new local fileId — it never re-writes the old partial fileId. So all 3 retries read the same incomplete file and eventually throw
"Incomplete blob"regardless.The practical effect is a 300ms delay before the error surfaces instead of immediate failure. Serent-Canopy's 200ms application-level retry (which re-reads the LMDB record and gets the new fileId) is still required for actual recovery.
Is this worth keeping?
Arguments for: avoids immediate throw in edge cases where there might be a race not yet fully understood; gives a brief window if any future scenario warrants it.
Arguments against: 300ms latency on an error path that will always fail in the known crash scenario; the retry is misleading about recovery capability.
Leaving this open for discussion — I can drop the retry entirely and just leave the error throw unchanged if that's preferred.
Test
The test correctly simulates the crash state (file written then corrupted with DEFAULT_HEADER + partial content,
isSavingawaited to ensure the lock is free), and verifies both the rejection and the ≥290ms timing of 3 retries.🤖 Generated with Claude Code