fix(restore): tighten follow crash-recovery txid validation by dangra · Pull Request #1167 · benbjohnson/litestream

dangra · 2026-02-25T16:18:06Z

Summary

fix follow crash-recovery validation to compare against the earliest available snapshot (while still checking ahead-of-latest separately)
keep strict validation when no snapshots are retained by scanning non-snapshot LTX ranges and rejecting stale/ahead/gap states
add regression coverage for crash-recovery edge branches, including ahead snapshot, ahead non-snapshot, uncovered saved TXID, and no-files-available cases
harden non-crash restore behavior by cleaning up *.tmp output files on failure and adding coverage for output-exists and canceled-context paths
refactor Restore() control flow into explicit full / resume / v3 mode routing with a single r.follow() call site

Testing

go test ./...

corylanou

test

corylanou

test

corylanou

test

corylanou

test

corylanou

Code Review (3-pass, 6 unique findings)

Reviewed the -exec flag implementation against the PR description and existing replicate -exec behavior. Found several lifecycle/concurrency issues that should be addressed before merge.

P1 = functional regression or data loss risk | P2 = edge case or robustness issue

Note: Finding outside this PR's diff

[P1] Crash-recovery TXID validated against wrong snapshot (replica.go:555)

The crash-recovery guard (from a prior commit on this branch) iterates all snapshot files but only keeps the last one (latestSnapshot). Since the iterator is sorted by MinTXID ascending (see NewFileInfoSliceIterator in the ltx package), latestSnapshot ends up being the newest snapshot, not the earliest. The check latestSnapshot.MinTXID > txid is meant to detect if history was pruned past the saved TXID, but comparing against the newest snapshot means a valid TXID within an older retained snapshot will be incorrectly rejected. Consider tracking both earliestSnapshot and latestSnapshot.

dangra · 2026-02-25T18:07:46Z

All findings addressed in 7f339c1. Summary:

Finding	Fix
P1: child orphaned on follow error	`exec.CommandContext(cancelCtx, ...)` — `defer cancel()` kills child on any return path
P1: `execErr` never checked	`follow()` returns `nil` on `ctx.Done`, not `context.Canceled`; now return `execErr` whenever `Restore` returns `nil` or `context.Canceled`
P2: data race on `cmd`	`atomic.Pointer[exec.Cmd]`, stored after `Start()`
P2: single signal forwarded	Signal goroutine loops to forward multiple signals
P2: whitespace `-exec` panics	`len(execArgs) == 0` guard before indexing
P1 (out-of-diff): prune check uses wrong snapshot	Track `earliestSnapshot` and `latestSnapshot` separately; use `earliestSnapshot.MinTXID` for the pruning check, `latestSnapshot.MaxTXID` for the ahead check

Full test suite (go test -race ./...) passes after all fixes.

corylanou

Follow-up Review (3-pass, post-fix)

All 6 prior findings are confirmed fixed. Tests pass with -race. Nice work on the atomic.Pointer approach and the regression test for the snapshot bug.

Found 2 new issues below.

dangra · 2026-02-25T21:04:03Z

Following up on the no-snapshot crash-recovery discussion: the current behavior is intentionally strict.

We require the saved TXID itself to remain covered by retained history, not only txid+1. If saved TXID coverage is lost, we fail fast and require re-restore instead of attempting to continue from an inferred position.

This is deliberate for crash-recovery safety and is now covered by explicit tests in this PR:

TestReplica_Restore_Follow_UncoveredSavedTXID_NoSnapshots
TestReplica_Restore_Follow_AheadTXID_NoSnapshots
TestReplica_Restore_Follow_NoSnapshotsAndNoLTXFiles
TestReplica_Restore_Follow_StaleTXID_NoSnapshotsWithGap

corylanou

Looks good. The two issues from the prior review are both fixed correctly:

Crash-recovery validation now compares against the earliest snapshot for pruning detection, not the newest
When no snapshots exist, falls through to non-snapshot LTX range validation instead of silently accepting a stale TXID

The gap detection logic (txid+1 reachability) is a nice addition — prevents resuming from an invalid position where incremental files have been pruned.

Helper method extraction is clean and doesn't change behavior. Test coverage looks solid with regression tests for the key edge cases (ahead snapshot, ahead non-snapshot, uncovered saved TXID, no files available).

All tests pass with -race.

benbjohnson · 2026-02-25T20:54:51Z

-				if latestSnapshot.MinTXID > txid {
-					return fmt.Errorf("cannot resume follow mode: saved TXID %s is behind the earliest snapshot (min TXID %s); replica history has been pruned -- delete %s and %s-txid to re-restore", txid, latestSnapshot.MinTXID, opt.OutputPath, opt.OutputPath)
+			if earliestSnapshot != nil {
+				if earliestSnapshot.MinTXID > txid {


I'm not sure this makes sense. MinTXID should always be 1 for snapshots.

Makes sense 🤦🏼‍♂️

benbjohnson · 2026-02-26T14:34:41Z

+	return infos[len(infos)-1].MaxTXID, nil
+}
+
+func (r *Replica) validateCrashRecoveryTXID(ctx context.Context, txid ltx.TXID, outputPath string) error {


This can all be simplified a lot IMO:

Incrementally restore from L1 starting from the first LTX file that includes txid+1

Incrementally restore from L0 where L1 left off

If txid+1 doesn't exist in L1, perform a full restore

We shouldn't get in an error state just because txid+1 was been dropped due to snapshot retention enforcement.

Ok, a few questions here:

Should it check for gaps in L1 or L0 files or that is not expected at all?
"gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid.

are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?

Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.

Should it check for gaps in L1 or L0 files or that is not expected at all? "gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid.

Gaps shouldn't occur at all. If there are gaps then there's a compaction issue. When compaction occurs, it always looks at the max of the lower level and then compacts between the MaxTXID of the lower level and the MaxTXID of the current level.

are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?

Basically yes. L0 files are short lived and usually get deleted after 5 minutes or so. L1-L9 files all have the same lifetime so if a TXID doesn't exist in L1 then it won't exist in anything higher.

The only reason the TXID wouldn't exist in L1+ is if the retention period has passed so that's when you'd fall back on a full restore. A full restore builds a full restore plan across all levels by using the CalcRestorePlan() function.

Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.

Yes, as long as txid+1 exists in L1 then you should be able to apply all the files from that point to the last file in L1. You can then apply all the files in L0 that are after L1's MaxTXID so you can pick up any transactions that were created since the last L1 compaction.

corylanou reviewed Feb 25, 2026

View reviewed changes

Comment thread cmd/litestream/restore.go Outdated

Comment thread cmd/litestream/restore.go Outdated

Comment thread cmd/litestream/restore.go Outdated

Comment thread cmd/litestream/restore.go Outdated

Comment thread replica.go Outdated

dangra force-pushed the feat/restore-exec-flag branch from 61c55a2 to 27e3748 Compare February 25, 2026 18:36

corylanou reviewed Feb 25, 2026

View reviewed changes

Comment thread cmd/litestream/restore.go

Comment thread replica.go

fix(restore): tighten follow crash-recovery txid validation

6ef4ffc

dangra force-pushed the feat/restore-exec-flag branch from 3d45757 to 6ef4ffc Compare February 25, 2026 20:33

dangra changed the title ~~feat(restore): add -exec flag to litestream restore -f~~ fix(restore): tighten follow crash-recovery txid validation Feb 25, 2026

dangra added 2 commits February 25, 2026 17:52

test(restore): cover missing follow crash-recovery validation cases

cf91acd

refactor(restore): extract follow crash-recovery resume flow

8b29d86

corylanou approved these changes Feb 25, 2026

View reviewed changes

dangra added 2 commits February 25, 2026 18:31

test(restore): cover non-crash restore path edge cases

97eb196

refactor(restore): unify mode routing and follow entrypoint

8c5716a

dangra mentioned this pull request Feb 25, 2026

feat(restore): add -exec support for restore follow mode #1168

Open

benbjohnson requested changes Feb 26, 2026

View reviewed changes

Conversation

dangra commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Code Review (3-pass, 6 unique findings)

Note: Finding outside this PR's diff

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dangra commented Feb 25, 2026

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Follow-up Review (3-pass, post-fix)

Uh oh!

Uh oh!

Uh oh!

dangra commented Feb 25, 2026

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

benbjohnson Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

dangra Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

benbjohnson Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

dangra Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

benbjohnson Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dangra commented Feb 25, 2026 •

edited

Loading