fix(blob): retry bytes() up to 3x when lock-free but file is incomplete by kriszyp · Pull Request #477 · HarperFast/harper

kriszyp · 2026-05-06T00:00:18Z

Summary

Adds retry logic in bytes() when tryLock returns true but the blob file is still incomplete — a state that, given the locking invariant, can only occur after a process crash.

The locking invariant

In writeBlobWithStream, the lock is acquired before the file is created or any data written, and released only after both content and the final size header are fully written to disk. Therefore in normal operation, a reader always sees one of two consistent states:

Lock held → write in progress (reader waits via callback)
Lock free → file is complete (or doesn't exist)

Lock-free + incomplete file can only happen after a process crash that cleared in-memory lock state while leaving a partial file on disk.

What the retry does and doesn't do

What it does: After detecting a crash-incomplete file, retries up to 3× at 100ms intervals (300ms total) before throwing, instead of throwing immediately.

What it does NOT do: Recover from the crash. After a crash, replication re-sends the blob data to a new local fileId — it never re-writes the old partial fileId. So all 3 retries read the same incomplete file and eventually throw "Incomplete blob" regardless.

The practical effect is a 300ms delay before the error surfaces instead of immediate failure. Serent-Canopy's 200ms application-level retry (which re-reads the LMDB record and gets the new fileId) is still required for actual recovery.

Is this worth keeping?

Arguments for: avoids immediate throw in edge cases where there might be a race not yet fully understood; gives a brief window if any future scenario warrants it.

Arguments against: 300ms latency on an error path that will always fail in the known crash scenario; the retry is misleading about recovery capability.

Leaving this open for discussion — I can drop the retry entirely and just leave the error throw unchanged if that's preferred.

Test

The test correctly simulates the crash state (file written then corrupted with DEFAULT_HEADER + partial content, isSaving awaited to ensure the lock is free), and verifies both the rejection and the ≥290ms timing of 3 retries.

🤖 Generated with Claude Code

…ranch

When a node crashes mid-blob-write, the in-memory LMDB lock state is cleared on restart. After recovery, tryLock returns true (no active writer) even though the blob file is only partially written. Previously, bytes() would immediately throw "Incomplete blob" in this state. This fix adds up to 3 retries with 100ms delay before giving up, allowing time for replication to re-send the blob and acquire the write lock. Covers the Serent-Canopy scenario where a crashed HNSW node left partial blob files that caused application-level errors until a manual 200ms retry was added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-05-06T00:03:25Z

Reviewed; no blockers found.

kriszyp · 2026-05-06T00:30:53Z

I challenged Claude on this one, and after further research admitted it is not sure if this will really help. Probably will close.

kriszyp and others added 2 commits May 5, 2026 17:59

Trigger integration tests after successful cherry-pick onto release b…

cf2f4f5

…ranch

kriszyp requested review from a team as code owners May 6, 2026 00:00

kriszyp marked this pull request as draft May 6, 2026 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477

fix(blob): retry bytes() up to 3x when lock-free but file is incomplete#477
kriszyp wants to merge 2 commits intomainfrom
fix/blob-incomplete-bytes-resilience

kriszyp commented May 6, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

kriszyp commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kriszyp commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The locking invariant

What the retry does and doesn't do

Is this worth keeping?

Test

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

kriszyp commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kriszyp commented May 6, 2026 •

edited

Loading

kriszyp commented May 6, 2026 •

edited

Loading