refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222
Draft
refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222
Conversation
PR Build Metrics✅ All clear — no issues detected
Binary Size
Dependency ChangesNo dependency changes. govulncheck OutputBuild Info
History (10 previous)
🤖 Updated on each push. |
043fc37 to
b42010c
Compare
Implement Phases 1, 3, 4, and 6 of the incremental extraction proposed in issue #1221 to reduce structural coupling and prevent cascading fixes. Phase 1 - SyncResult: verifyAndSync() and sync() return a syncResult struct instead of mutating db fields as side effects. State is applied explicitly via applySyncResult(). Phase 3 - Checkpoint phases: Extract checkpoint() into checkpointPreSync, checkpointExec, checkpointPostSync. Always snapshot in post-checkpoint sync to capture pages moved by the application's auto-checkpoint during the db.rtx release window. Phase 4 - ReplicaScheduler: Extract syncScheduler from monitor(). Phase 6 - Restore: Move ~1200 lines to restore.go. Fix: Downgrade TRUNCATE to PASSIVE when pre-sync can't reach WAL end. Fix: Force snapshot on PrevFrameMismatch (WAL changed between verify/sync). Fix: Add snapshot page count validation in writeLTXFromDB(). Fix: Clear syncedToWALEnd before execCheckpoint releases db.rtx.
…gating Fix L0 retention to check actual L1 TXID ranges instead of single maxL1TXID threshold. L1 compaction can have gaps when an L0 wasn't uploaded before compaction listed files. Fix retention enforcement to run unless compaction actually failed. ErrNoCompaction and ErrCompactionTooEarly are not failures — retention should still prune expired snapshots on quiet databases.
Add unit tests for syncResult, checkpoint flow, WAL truncation handling, snapshot page integrity, and restore-after-compaction pipeline. Update soak assertions: add "post-checkpoint snapshot" as expected recovery reason, increase L0 oversized threshold to 50%.
b42010c to
efcd720
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Refactor the sync/checkpoint/monitoring subsystem per #1221, and fix a pre-existing page loss bug uncovered during soak testing.
Refactoring (Phases 1, 3, 4, 6 of #1221)
Phase 1 — SyncResult return type (
db.go):verifyAndSync()andsync()return asyncResultstruct instead of mutatingdb.lastSyncedWALOffset,db.syncedToWALEnd,db.pos, anddb.maxLTXFileInfosas side effects. NewapplySyncResult()centralizes all state application in one visible call site.Phase 3 — Checkpoint phase extraction (
db.go): Extractcheckpoint()into named phase methods (checkpointPreSync,checkpointExec,checkpointPostSync), making the checkpoint protocol explicit.Phase 4 — ReplicaScheduler (
replica.go): ExtractsyncSchedulertype fromReplica.monitor(), separating scheduling/backoff from sync correctness and error recovery. ExtracthandleSyncError()for LTX error detection, auto-recovery, and rate-limited logging.monitor()shrinks from 114 lines to a 30-line coordination loop.Phase 6 — Restore separation (
replica.go→restore.go): Move ~1200 lines of restore-related code to its own file.replica.goshrinks from ~1775 to ~590 lines.Bug Fix: Page loss under sustained writes
Soak testing at 500 writes/sec revealed a pre-existing bug where pages existed in the database but were never captured in any L0 file, causing restore failures with "nonsequential page numbers" errors. The root cause:
During
execCheckpoint(), litestream releases its read transaction (db.rtx) to allow the SQLite checkpoint to run. In this window, the application's SQLite auto-checkpoint (wal_autocheckpoint, default 1000 frames) can independently move WAL frames to the DB and trigger a WAL restart. Pages moved during this uncontrolled window were never captured in any L0 file because the post-checkpoint sync used flag-based heuristics (syncedToWALEnd) that had become stale.Fix: The post-checkpoint sync now always creates a snapshot — reading ALL pages from the DB under the write lock. This is correct by construction: no flags, no detection heuristics, no edge cases. The snapshot captures every page regardless of what happened during the checkpoint window.
Supporting fixes:
verify()andsync(), force a snapshot to capture pages from old WAL frames.syncedToWALEndbefore checkpoint: Prevents the stale flag from suppressing necessary snapshots after the checkpoint window.maxL1TXIDthreshold, preventing deletion of L0 files in gaps between L1 files.EnforceSnapshotRetentiononly skips on actual compaction failures, not onErrNoCompaction/ErrCompactionTooEarly.writeLTXFromDB()verifies all pages 1..commit are written, catching page gaps at creation time.Phases deferred to follow-up PRs
Motivation and Context
Since January 2026, recurring bugs cluster around the same code paths in
db.goandreplica.go. The #1210 idle optimization required 7 follow-up commits because the coupling between change detection, notification, checkpoint, and verification was implicit. This refactor makes state transitions explicit and concerns separated. The page loss bug was pre-existing and reproduced 100% under sustained 500 writes/sec — it was uncovered by the soak testing done to validate the refactoring.Fixes #1221
How Has This Been Tested?
go test -race -count=1 ./...— all packages passTypes of changes
Checklist
go fmt,go vet)go test ./...)