Skip to content

refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222

Draft
corylanou wants to merge 10 commits intomainfrom
issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin
Draft

refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222
corylanou wants to merge 10 commits intomainfrom
issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin

Conversation

@corylanou
Copy link
Copy Markdown
Collaborator

@corylanou corylanou commented Mar 31, 2026

Description

Refactor the sync/checkpoint/monitoring subsystem per #1221, and fix a pre-existing page loss bug uncovered during soak testing.

Refactoring (Phases 1, 3, 4, 6 of #1221)

Phase 1 — SyncResult return type (db.go): verifyAndSync() and sync() return a syncResult struct instead of mutating db.lastSyncedWALOffset, db.syncedToWALEnd, db.pos, and db.maxLTXFileInfos as side effects. New applySyncResult() centralizes all state application in one visible call site.

Phase 3 — Checkpoint phase extraction (db.go): Extract checkpoint() into named phase methods (checkpointPreSync, checkpointExec, checkpointPostSync), making the checkpoint protocol explicit.

Phase 4 — ReplicaScheduler (replica.go): Extract syncScheduler type from Replica.monitor(), separating scheduling/backoff from sync correctness and error recovery. Extract handleSyncError() for LTX error detection, auto-recovery, and rate-limited logging. monitor() shrinks from 114 lines to a 30-line coordination loop.

Phase 6 — Restore separation (replica.gorestore.go): Move ~1200 lines of restore-related code to its own file. replica.go shrinks from ~1775 to ~590 lines.

Bug Fix: Page loss under sustained writes

Soak testing at 500 writes/sec revealed a pre-existing bug where pages existed in the database but were never captured in any L0 file, causing restore failures with "nonsequential page numbers" errors. The root cause:

During execCheckpoint(), litestream releases its read transaction (db.rtx) to allow the SQLite checkpoint to run. In this window, the application's SQLite auto-checkpoint (wal_autocheckpoint, default 1000 frames) can independently move WAL frames to the DB and trigger a WAL restart. Pages moved during this uncontrolled window were never captured in any L0 file because the post-checkpoint sync used flag-based heuristics (syncedToWALEnd) that had become stale.

Fix: The post-checkpoint sync now always creates a snapshot — reading ALL pages from the DB under the write lock. This is correct by construction: no flags, no detection heuristics, no edge cases. The snapshot captures every page regardless of what happened during the checkpoint window.

Supporting fixes:

  • TRUNCATE → PASSIVE downgrade: When the pre-sync can't reach WAL end (sustained writes), downgrade TRUNCATE to PASSIVE. PASSIVE copies frames without truncating, so uncaptured frames remain in the WAL for the next sync.
  • Snapshot on PrevFrameMismatch: When the WAL changes between verify() and sync(), force a snapshot to capture pages from old WAL frames.
  • Clear syncedToWALEnd before checkpoint: Prevents the stale flag from suppressing necessary snapshots after the checkpoint window.
  • L0 retention TXID range coverage: L0 retention now checks actual L1 TXID ranges instead of a single maxL1TXID threshold, preventing deletion of L0 files in gaps between L1 files.
  • Retention gating: EnforceSnapshotRetention only skips on actual compaction failures, not on ErrNoCompaction/ErrCompactionTooEarly.
  • Snapshot page count validation: writeLTXFromDB() verifies all pages 1..commit are written, catching page gaps at creation time.

Phases deferred to follow-up PRs

  • Phase 2 (SyncState/executor split): Lock scope narrowing needs careful concurrency analysis.
  • Phase 5 (VFSFile decomposition): 40+ fields, 30+ methods crossing multiple field groups.

Motivation and Context

Since January 2026, recurring bugs cluster around the same code paths in db.go and replica.go. The #1210 idle optimization required 7 follow-up commits because the coupling between change detection, notification, checkpoint, and verification was implicit. This refactor makes state transitions explicit and concerns separated. The page loss bug was pre-existing and reproduced 100% under sustained 500 writes/sec — it was uncovered by the soak testing done to validate the refactoring.

Fixes #1221

How Has This Been Tested?

  • go test -race -count=1 ./... — all packages pass
  • Pre-commit hooks pass (go-imports, go-vet, go-staticcheck)
  • Codex code review completed — 2 findings addressed (removed debug validation from restore hot path, fixed retention gating for quiet databases)
  • Soak tests pass across all three load profiles (2 consecutive clean runs):
    • low-volume (10 writes/sec): restoration OK, integrity OK, 0 violations
    • burst-volume (500 writes/sec bursts): restoration OK, integrity OK, 0 violations
    • high-volume (500 writes/sec sustained): restoration OK, integrity OK, 0 missing pages, 0 violations

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the code style of this project (go fmt, go vet)
  • I have tested my changes (go test ./...)
  • I have updated the documentation accordingly (if needed)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 31, 2026

PR Build Metrics

All clear — no issues detected

Check Status Summary
Binary size 35.97 MB (+8.0 KB / +0.02%)
Dependencies No changes
Vulnerabilities None detected
Go toolchain 1.25.8 (latest)
Module graph 1206 edges (0)

Binary Size

Size Change
Base (1678a9b) 35.96 MB
PR (76ccfba) 35.97 MB +8.0 KB (+0.02%)

Dependency Changes

No dependency changes.

govulncheck Output

=== Symbol Results ===

No vulnerabilities found.

Your code is affected by 0 vulnerabilities.
This scan also found 1 vulnerability in packages you import and 0
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.

Build Info

Metric Value
Build time 41s
Go version go1.25.8
Commit 76ccfba

History (10 previous)

Commit Updated Status Summary
2900e35 2026-04-04 18:00 UTC 35.97 MB (+8.0 KB / +0.02%)
3f951f7 2026-04-04 17:50 UTC 35.97 MB (+8.0 KB / +0.02%)
3446168 2026-04-03 21:18 UTC 35.97 MB (+8.0 KB / +0.02%)
b94d232 2026-04-03 19:32 UTC 35.97 MB (+8.0 KB / +0.02%)
675632a 2026-04-03 18:22 UTC 35.97 MB (+8.0 KB / +0.02%)
dc8d260 2026-04-03 17:59 UTC 35.97 MB (+8.0 KB / +0.02%)
32e6f5e 2026-04-02 18:27 UTC 35.97 MB (+16.0 KB / +0.04%)
110d872 2026-04-02 15:16 UTC 35.97 MB (+16.0 KB / +0.04%)
dd9a57d 2026-04-01 23:16 UTC 35.97 MB (+16.0 KB / +0.04%)
b5fc10d 2026-04-01 21:48 UTC 35.97 MB (+16.0 KB / +0.04%)

🤖 Updated on each push.

@corylanou corylanou force-pushed the issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin branch from 043fc37 to b42010c Compare April 2, 2026 15:14
Implement Phases 1, 3, 4, and 6 of the incremental extraction proposed
in issue #1221 to reduce structural coupling and prevent cascading fixes.

Phase 1 - SyncResult: verifyAndSync() and sync() return a syncResult
struct instead of mutating db fields as side effects. State is applied
explicitly via applySyncResult().

Phase 3 - Checkpoint phases: Extract checkpoint() into checkpointPreSync,
checkpointExec, checkpointPostSync. Always snapshot in post-checkpoint
sync to capture pages moved by the application's auto-checkpoint during
the db.rtx release window.

Phase 4 - ReplicaScheduler: Extract syncScheduler from monitor().

Phase 6 - Restore: Move ~1200 lines to restore.go.

Fix: Downgrade TRUNCATE to PASSIVE when pre-sync can't reach WAL end.
Fix: Force snapshot on PrevFrameMismatch (WAL changed between verify/sync).
Fix: Add snapshot page count validation in writeLTXFromDB().
Fix: Clear syncedToWALEnd before execCheckpoint releases db.rtx.
…gating

Fix L0 retention to check actual L1 TXID ranges instead of single
maxL1TXID threshold. L1 compaction can have gaps when an L0 wasn't
uploaded before compaction listed files.

Fix retention enforcement to run unless compaction actually failed.
ErrNoCompaction and ErrCompactionTooEarly are not failures — retention
should still prune expired snapshots on quiet databases.
Add unit tests for syncResult, checkpoint flow, WAL truncation handling,
snapshot page integrity, and restore-after-compaction pipeline.

Update soak assertions: add "post-checkpoint snapshot" as expected
recovery reason, increase L0 oversized threshold to 50%.
@corylanou corylanou force-pushed the issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin branch from b42010c to efcd720 Compare April 2, 2026 18:25
@corylanou corylanou marked this pull request as draft April 6, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Incremental extraction of sync/checkpoint/monitoring subsystem

1 participant