Skip to content

fix(restore): tighten follow crash-recovery txid validation#1167

Open
dangra wants to merge 5 commits intobenbjohnson:mainfrom
dangra:feat/restore-exec-flag
Open

fix(restore): tighten follow crash-recovery txid validation#1167
dangra wants to merge 5 commits intobenbjohnson:mainfrom
dangra:feat/restore-exec-flag

Conversation

@dangra
Copy link
Copy Markdown

@dangra dangra commented Feb 25, 2026

Summary

  • fix follow crash-recovery validation to compare against the earliest available snapshot (while still checking ahead-of-latest separately)
  • keep strict validation when no snapshots are retained by scanning non-snapshot LTX ranges and rejecting stale/ahead/gap states
  • add regression coverage for crash-recovery edge branches, including ahead snapshot, ahead non-snapshot, uncovered saved TXID, and no-files-available cases
  • harden non-crash restore behavior by cleaning up *.tmp output files on failure and adding coverage for output-exists and canceled-context paths
  • refactor Restore() control flow into explicit full / resume / v3 mode routing with a single r.follow() call site

Testing

  • go test ./...

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (3-pass, 6 unique findings)

Reviewed the -exec flag implementation against the PR description and existing replicate -exec behavior. Found several lifecycle/concurrency issues that should be addressed before merge.

P1 = functional regression or data loss risk | P2 = edge case or robustness issue


Note: Finding outside this PR's diff

[P1] Crash-recovery TXID validated against wrong snapshot (replica.go:555)

The crash-recovery guard (from a prior commit on this branch) iterates all snapshot files but only keeps the last one (latestSnapshot). Since the iterator is sorted by MinTXID ascending (see NewFileInfoSliceIterator in the ltx package), latestSnapshot ends up being the newest snapshot, not the earliest. The check latestSnapshot.MinTXID > txid is meant to detect if history was pruned past the saved TXID, but comparing against the newest snapshot means a valid TXID within an older retained snapshot will be incorrectly rejected. Consider tracking both earliestSnapshot and latestSnapshot.

Comment thread cmd/litestream/restore.go Outdated
Comment thread cmd/litestream/restore.go Outdated
Comment thread cmd/litestream/restore.go Outdated
Comment thread cmd/litestream/restore.go Outdated
Comment thread replica.go Outdated
@dangra
Copy link
Copy Markdown
Author

dangra commented Feb 25, 2026

All findings addressed in 7f339c1. Summary:

Finding Fix
P1: child orphaned on follow error exec.CommandContext(cancelCtx, ...)defer cancel() kills child on any return path
P1: execErr never checked follow() returns nil on ctx.Done, not context.Canceled; now return execErr whenever Restore returns nil or context.Canceled
P2: data race on cmd atomic.Pointer[exec.Cmd], stored after Start()
P2: single signal forwarded Signal goroutine loops to forward multiple signals
P2: whitespace -exec panics len(execArgs) == 0 guard before indexing
P1 (out-of-diff): prune check uses wrong snapshot Track earliestSnapshot and latestSnapshot separately; use earliestSnapshot.MinTXID for the pruning check, latestSnapshot.MaxTXID for the ahead check

Full test suite (go test -race ./...) passes after all fixes.

@dangra dangra force-pushed the feat/restore-exec-flag branch from 61c55a2 to 27e3748 Compare February 25, 2026 18:36
Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review (3-pass, post-fix)

All 6 prior findings are confirmed fixed. Tests pass with -race. Nice work on the atomic.Pointer approach and the regression test for the snapshot bug.

Found 2 new issues below.

Comment thread cmd/litestream/restore.go
Comment thread replica.go
@dangra dangra force-pushed the feat/restore-exec-flag branch from 3d45757 to 6ef4ffc Compare February 25, 2026 20:33
@dangra dangra changed the title feat(restore): add -exec flag to litestream restore -f fix(restore): tighten follow crash-recovery txid validation Feb 25, 2026
@dangra
Copy link
Copy Markdown
Author

dangra commented Feb 25, 2026

Following up on the no-snapshot crash-recovery discussion: the current behavior is intentionally strict.

We require the saved TXID itself to remain covered by retained history, not only txid+1. If saved TXID coverage is lost, we fail fast and require re-restore instead of attempting to continue from an inferred position.

This is deliberate for crash-recovery safety and is now covered by explicit tests in this PR:

  • TestReplica_Restore_Follow_UncoveredSavedTXID_NoSnapshots
  • TestReplica_Restore_Follow_AheadTXID_NoSnapshots
  • TestReplica_Restore_Follow_NoSnapshotsAndNoLTXFiles
  • TestReplica_Restore_Follow_StaleTXID_NoSnapshotsWithGap

Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The two issues from the prior review are both fixed correctly:

  1. Crash-recovery validation now compares against the earliest snapshot for pruning detection, not the newest
  2. When no snapshots exist, falls through to non-snapshot LTX range validation instead of silently accepting a stale TXID

The gap detection logic (txid+1 reachability) is a nice addition — prevents resuming from an invalid position where incremental files have been pruned.

Helper method extraction is clean and doesn't change behavior. Test coverage looks solid with regression tests for the key edge cases (ahead snapshot, ahead non-snapshot, uncovered saved TXID, no files available).

All tests pass with -race.

Comment thread replica.go Outdated
if latestSnapshot.MinTXID > txid {
return fmt.Errorf("cannot resume follow mode: saved TXID %s is behind the earliest snapshot (min TXID %s); replica history has been pruned -- delete %s and %s-txid to re-restore", txid, latestSnapshot.MinTXID, opt.OutputPath, opt.OutputPath)
if earliestSnapshot != nil {
if earliestSnapshot.MinTXID > txid {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this makes sense. MinTXID should always be 1 for snapshots.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 🤦🏼‍♂️

Comment thread replica.go
return infos[len(infos)-1].MaxTXID, nil
}

func (r *Replica) validateCrashRecoveryTXID(ctx context.Context, txid ltx.TXID, outputPath string) error {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can all be simplified a lot IMO:

  1. Incrementally restore from L1 starting from the first LTX file that includes txid+1
  2. Incrementally restore from L0 where L1 left off
  3. If txid+1 doesn't exist in L1, perform a full restore

We shouldn't get in an error state just because txid+1 was been dropped due to snapshot retention enforcement.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, a few questions here:

  1. Should it check for gaps in L1 or L0 files or that is not expected at all?
    "gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid.

  2. are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?

  3. Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it check for gaps in L1 or L0 files or that is not expected at all? "gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid.

Gaps shouldn't occur at all. If there are gaps then there's a compaction issue. When compaction occurs, it always looks at the max of the lower level and then compacts between the MaxTXID of the lower level and the MaxTXID of the current level.

are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?

Basically yes. L0 files are short lived and usually get deleted after 5 minutes or so. L1-L9 files all have the same lifetime so if a TXID doesn't exist in L1 then it won't exist in anything higher.

The only reason the TXID wouldn't exist in L1+ is if the retention period has passed so that's when you'd fall back on a full restore. A full restore builds a full restore plan across all levels by using the CalcRestorePlan() function.

Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.

Yes, as long as txid+1 exists in L1 then you should be able to apply all the files from that point to the last file in L1. You can then apply all the files in L0 that are after L1's MaxTXID so you can pick up any transactions that were created since the last L1 compaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants