Skip to content

fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477

Draft
kriszyp wants to merge 2 commits intomainfrom
fix/blob-incomplete-bytes-resilience
Draft

fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477
kriszyp wants to merge 2 commits intomainfrom
fix/blob-incomplete-bytes-resilience

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 6, 2026

Summary

Adds retry logic in bytes() when tryLock returns true but the blob file is still incomplete — a state that, given the locking invariant, can only occur after a process crash.

The locking invariant

In writeBlobWithStream, the lock is acquired before the file is created or any data written, and released only after both content and the final size header are fully written to disk. Therefore in normal operation, a reader always sees one of two consistent states:

  • Lock held → write in progress (reader waits via callback)
  • Lock free → file is complete (or doesn't exist)

Lock-free + incomplete file can only happen after a process crash that cleared in-memory lock state while leaving a partial file on disk.

What the retry does and doesn't do

What it does: After detecting a crash-incomplete file, retries up to 3× at 100ms intervals (300ms total) before throwing, instead of throwing immediately.

What it does NOT do: Recover from the crash. After a crash, replication re-sends the blob data to a new local fileId — it never re-writes the old partial fileId. So all 3 retries read the same incomplete file and eventually throw "Incomplete blob" regardless.

The practical effect is a 300ms delay before the error surfaces instead of immediate failure. Serent-Canopy's 200ms application-level retry (which re-reads the LMDB record and gets the new fileId) is still required for actual recovery.

Is this worth keeping?

Arguments for: avoids immediate throw in edge cases where there might be a race not yet fully understood; gives a brief window if any future scenario warrants it.

Arguments against: 300ms latency on an error path that will always fail in the known crash scenario; the retry is misleading about recovery capability.

Leaving this open for discussion — I can drop the retry entirely and just leave the error throw unchanged if that's preferred.

Test

The test correctly simulates the crash state (file written then corrupted with DEFAULT_HEADER + partial content, isSaving awaited to ensure the lock is free), and verifies both the rejection and the ≥290ms timing of 3 retries.

🤖 Generated with Claude Code

kriszyp and others added 2 commits May 5, 2026 17:59
When a node crashes mid-blob-write, the in-memory LMDB lock state is cleared
on restart. After recovery, tryLock returns true (no active writer) even though
the blob file is only partially written. Previously, bytes() would immediately
throw "Incomplete blob" in this state.

This fix adds up to 3 retries with 100ms delay before giving up, allowing time
for replication to re-send the blob and acquire the write lock. Covers the
Serent-Canopy scenario where a crashed HNSW node left partial blob files that
caused application-level errors until a manual 200ms retry was added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp requested review from a team as code owners May 6, 2026 00:00
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

Reviewed; no blockers found.

@kriszyp
Copy link
Copy Markdown
Member Author

kriszyp commented May 6, 2026

I challenged Claude on this one, and after further research admitted it is not sure if this will really help. Probably will close.

@kriszyp kriszyp marked this pull request as draft May 6, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant