fix(restore): tighten follow crash-recovery txid validation#1167
fix(restore): tighten follow crash-recovery txid validation#1167dangra wants to merge 5 commits intobenbjohnson:mainfrom
Conversation
corylanou
left a comment
There was a problem hiding this comment.
Code Review (3-pass, 6 unique findings)
Reviewed the -exec flag implementation against the PR description and existing replicate -exec behavior. Found several lifecycle/concurrency issues that should be addressed before merge.
P1 = functional regression or data loss risk | P2 = edge case or robustness issue
Note: Finding outside this PR's diff
[P1] Crash-recovery TXID validated against wrong snapshot (replica.go:555)
The crash-recovery guard (from a prior commit on this branch) iterates all snapshot files but only keeps the last one (latestSnapshot). Since the iterator is sorted by MinTXID ascending (see NewFileInfoSliceIterator in the ltx package), latestSnapshot ends up being the newest snapshot, not the earliest. The check latestSnapshot.MinTXID > txid is meant to detect if history was pruned past the saved TXID, but comparing against the newest snapshot means a valid TXID within an older retained snapshot will be incorrectly rejected. Consider tracking both earliestSnapshot and latestSnapshot.
|
All findings addressed in 7f339c1. Summary:
Full test suite ( |
61c55a2 to
27e3748
Compare
corylanou
left a comment
There was a problem hiding this comment.
Follow-up Review (3-pass, post-fix)
All 6 prior findings are confirmed fixed. Tests pass with -race. Nice work on the atomic.Pointer approach and the regression test for the snapshot bug.
Found 2 new issues below.
3d45757 to
6ef4ffc
Compare
|
Following up on the no-snapshot crash-recovery discussion: the current behavior is intentionally strict. We require the saved TXID itself to remain covered by retained history, not only This is deliberate for crash-recovery safety and is now covered by explicit tests in this PR:
|
corylanou
left a comment
There was a problem hiding this comment.
Looks good. The two issues from the prior review are both fixed correctly:
- Crash-recovery validation now compares against the earliest snapshot for pruning detection, not the newest
- When no snapshots exist, falls through to non-snapshot LTX range validation instead of silently accepting a stale TXID
The gap detection logic (txid+1 reachability) is a nice addition — prevents resuming from an invalid position where incremental files have been pruned.
Helper method extraction is clean and doesn't change behavior. Test coverage looks solid with regression tests for the key edge cases (ahead snapshot, ahead non-snapshot, uncovered saved TXID, no files available).
All tests pass with -race.
| if latestSnapshot.MinTXID > txid { | ||
| return fmt.Errorf("cannot resume follow mode: saved TXID %s is behind the earliest snapshot (min TXID %s); replica history has been pruned -- delete %s and %s-txid to re-restore", txid, latestSnapshot.MinTXID, opt.OutputPath, opt.OutputPath) | ||
| if earliestSnapshot != nil { | ||
| if earliestSnapshot.MinTXID > txid { |
There was a problem hiding this comment.
I'm not sure this makes sense. MinTXID should always be 1 for snapshots.
| return infos[len(infos)-1].MaxTXID, nil | ||
| } | ||
|
|
||
| func (r *Replica) validateCrashRecoveryTXID(ctx context.Context, txid ltx.TXID, outputPath string) error { |
There was a problem hiding this comment.
This can all be simplified a lot IMO:
- Incrementally restore from L1 starting from the first LTX file that includes
txid+1 - Incrementally restore from L0 where L1 left off
- If
txid+1doesn't exist in L1, perform a full restore
We shouldn't get in an error state just because txid+1 was been dropped due to snapshot retention enforcement.
There was a problem hiding this comment.
Ok, a few questions here:
-
Should it check for gaps in L1 or L0 files or that is not expected at all?
"gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid. -
are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?
-
Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.
There was a problem hiding this comment.
Should it check for gaps in L1 or L0 files or that is not expected at all? "gaps" in the sense of ltx files not assembling into a contiguous range from txid+1 to max txid.
Gaps shouldn't occur at all. If there are gaps then there's a compaction issue. When compaction occurs, it always looks at the max of the lower level and then compacts between the MaxTXID of the lower level and the MaxTXID of the current level.
are you asking to ignore levels past L1? if it can't restore from L1 or L0, do full snapshot restore (L9). right?
Basically yes. L0 files are short lived and usually get deleted after 5 minutes or so. L1-L9 files all have the same lifetime so if a TXID doesn't exist in L1 then it won't exist in anything higher.
The only reason the TXID wouldn't exist in L1+ is if the retention period has passed so that's when you'd fall back on a full restore. A full restore builds a full restore plan across all levels by using the CalcRestorePlan() function.
Can it assume the data is valid and if there is a ltx file that includes txid+1, then it will be resumable from that point? no need to check nor validate further whether it reaches max txid.
Yes, as long as txid+1 exists in L1 then you should be able to apply all the files from that point to the last file in L1. You can then apply all the files in L0 that are after L1's MaxTXID so you can pick up any transactions that were created since the last L1 compaction.
Summary
*.tmpoutput files on failure and adding coverage for output-exists and canceled-context pathsRestore()control flow into explicitfull/resume/v3mode routing with a singler.follow()call siteTesting
go test ./...