refactor(db,replica): extract sync/checkpoint/monitor subsystem by corylanou · Pull Request #1222 · benbjohnson/litestream

corylanou · 2026-03-31T17:46:19Z

Description

Refactor the sync/checkpoint/monitoring subsystem per #1221, and fix a pre-existing page loss bug uncovered during soak testing.

Refactoring (Phases 1, 3, 4, 6 of #1221)

Phase 1 — SyncResult return type (db.go): verifyAndSync() and sync() return a syncResult struct instead of mutating db.lastSyncedWALOffset, db.syncedToWALEnd, db.pos, and db.maxLTXFileInfos as side effects. New applySyncResult() centralizes all state application in one visible call site.

Phase 3 — Checkpoint phase extraction (db.go): Extract checkpoint() into named phase methods (checkpointPreSync, checkpointExec, checkpointPostSync), making the checkpoint protocol explicit.

Phase 4 — ReplicaScheduler (replica.go): Extract syncScheduler type from Replica.monitor(), separating scheduling/backoff from sync correctness and error recovery. Extract handleSyncError() for LTX error detection, auto-recovery, and rate-limited logging. monitor() shrinks from 114 lines to a 30-line coordination loop.

Phase 6 — Restore separation (replica.go → restore.go): Move ~1200 lines of restore-related code to its own file. replica.go shrinks from ~1775 to ~590 lines.

Bug Fix: Page loss under sustained writes

Soak testing at 500 writes/sec revealed a pre-existing bug where pages existed in the database but were never captured in any L0 file, causing restore failures with "nonsequential page numbers" errors. The root cause:

During execCheckpoint(), litestream releases its read transaction (db.rtx) to allow the SQLite checkpoint to run. In this window, the application's SQLite auto-checkpoint (wal_autocheckpoint, default 1000 frames) can independently move WAL frames to the DB and trigger a WAL restart. Pages moved during this uncontrolled window were never captured in any L0 file because the post-checkpoint sync used flag-based heuristics (syncedToWALEnd) that had become stale.

Fix: The post-checkpoint sync now always creates a snapshot — reading ALL pages from the DB under the write lock. This is correct by construction: no flags, no detection heuristics, no edge cases. The snapshot captures every page regardless of what happened during the checkpoint window.

Supporting fixes:

TRUNCATE → PASSIVE downgrade: When the pre-sync can't reach WAL end (sustained writes), downgrade TRUNCATE to PASSIVE. PASSIVE copies frames without truncating, so uncaptured frames remain in the WAL for the next sync.
Snapshot on PrevFrameMismatch: When the WAL changes between verify() and sync(), force a snapshot to capture pages from old WAL frames.
Clear syncedToWALEnd before checkpoint: Prevents the stale flag from suppressing necessary snapshots after the checkpoint window.
L0 retention TXID range coverage: L0 retention now checks actual L1 TXID ranges instead of a single maxL1TXID threshold, preventing deletion of L0 files in gaps between L1 files.
Retention gating: EnforceSnapshotRetention only skips on actual compaction failures, not on ErrNoCompaction/ErrCompactionTooEarly.
Snapshot page count validation: writeLTXFromDB() verifies all pages 1..commit are written, catching page gaps at creation time.

Phases deferred to follow-up PRs

Phase 2 (SyncState/executor split): Lock scope narrowing needs careful concurrency analysis.
Phase 5 (VFSFile decomposition): 40+ fields, 30+ methods crossing multiple field groups.

Motivation and Context

Since January 2026, recurring bugs cluster around the same code paths in db.go and replica.go. The #1210 idle optimization required 7 follow-up commits because the coupling between change detection, notification, checkpoint, and verification was implicit. This refactor makes state transitions explicit and concerns separated. The page loss bug was pre-existing and reproduced 100% under sustained 500 writes/sec — it was uncovered by the soak testing done to validate the refactoring.

Fixes #1221

How Has This Been Tested?

go test -race -count=1 ./... — all packages pass
Pre-commit hooks pass (go-imports, go-vet, go-staticcheck)
Codex code review completed — 2 findings addressed (removed debug validation from restore hot path, fixed retention gating for quiet databases)
Soak tests pass across all three load profiles (2 consecutive clean runs):
- low-volume (10 writes/sec): restoration OK, integrity OK, 0 violations
- burst-volume (500 writes/sec bursts): restoration OK, integrity OK, 0 violations
- high-volume (500 writes/sec sustained): restoration OK, integrity OK, 0 missing pages, 0 violations

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

Checklist

My code follows the code style of this project (go fmt, go vet)
I have tested my changes (go test ./...)
I have updated the documentation accordingly (if needed)

github-actions · 2026-03-31T17:48:02Z

PR Build Metrics

✅ All clear — no issues detected

Check	Status	Summary
Binary size	✅	35.97 MB (+8.0 KB / +0.02%)
Dependencies	✅	No changes
Vulnerabilities	✅	None detected
Go toolchain	✅	1.25.8 (latest)
Module graph	✅	1206 edges (0)

Binary Size

	Size	Change
Base (`1678a9b`)	35.96 MB
PR (`76ccfba`)	35.97 MB	+8.0 KB (+0.02%)

Dependency Changes

No dependency changes.

govulncheck Output

=== Symbol Results ===

No vulnerabilities found.

Your code is affected by 0 vulnerabilities.
This scan also found 1 vulnerability in packages you import and 0
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.

Build Info

Metric	Value
Build time	41s
Go version	`go1.25.8`
Commit	`76ccfba`

History (10 previous)

Commit	Updated	Status	Summary
`2900e35`	2026-04-04 18:00 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`3f951f7`	2026-04-04 17:50 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`3446168`	2026-04-03 21:18 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`b94d232`	2026-04-03 19:32 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`675632a`	2026-04-03 18:22 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`dc8d260`	2026-04-03 17:59 UTC	✅	35.97 MB (+8.0 KB / +0.02%)
`32e6f5e`	2026-04-02 18:27 UTC	✅	35.97 MB (+16.0 KB / +0.04%)
`110d872`	2026-04-02 15:16 UTC	✅	35.97 MB (+16.0 KB / +0.04%)
`dd9a57d`	2026-04-01 23:16 UTC	✅	35.97 MB (+16.0 KB / +0.04%)
`b5fc10d`	2026-04-01 21:48 UTC	✅	35.97 MB (+16.0 KB / +0.04%)

_{🤖 Updated on each push.}

Implement Phases 1, 3, 4, and 6 of the incremental extraction proposed in issue #1221 to reduce structural coupling and prevent cascading fixes. Phase 1 - SyncResult: verifyAndSync() and sync() return a syncResult struct instead of mutating db fields as side effects. State is applied explicitly via applySyncResult(). Phase 3 - Checkpoint phases: Extract checkpoint() into checkpointPreSync, checkpointExec, checkpointPostSync. Always snapshot in post-checkpoint sync to capture pages moved by the application's auto-checkpoint during the db.rtx release window. Phase 4 - ReplicaScheduler: Extract syncScheduler from monitor(). Phase 6 - Restore: Move ~1200 lines to restore.go. Fix: Downgrade TRUNCATE to PASSIVE when pre-sync can't reach WAL end. Fix: Force snapshot on PrevFrameMismatch (WAL changed between verify/sync). Fix: Add snapshot page count validation in writeLTXFromDB(). Fix: Clear syncedToWALEnd before execCheckpoint releases db.rtx.

…gating Fix L0 retention to check actual L1 TXID ranges instead of single maxL1TXID threshold. L1 compaction can have gaps when an L0 wasn't uploaded before compaction listed files. Fix retention enforcement to run unless compaction actually failed. ErrNoCompaction and ErrCompactionTooEarly are not failures — retention should still prune expired snapshots on quiet databases.

Add unit tests for syncResult, checkpoint flow, WAL truncation handling, snapshot page integrity, and restore-after-compaction pipeline. Update soak assertions: add "post-checkpoint snapshot" as expected recovery reason, increase L0 oversized threshold to 50%.

corylanou force-pushed the issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin branch from 043fc37 to b42010c Compare April 2, 2026 15:14

corylanou added 3 commits April 2, 2026 13:25

corylanou force-pushed the issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin branch from b42010c to efcd720 Compare April 2, 2026 18:25

corylanou added 7 commits April 3, 2026 12:57

fix(db): narrow checkpoint snapshots

ecbead1

refactor(db): isolate sync state

3363d32

fix(db): narrow post-checkpoint recovery

e3886dd

refactor(db): clarify checkpoint follow-up

1afdb84

test(soak): add targeted recovery coverage

6e2eee8

fix(db): preserve truncate threshold checkpoint

bdfc4e4

fix(db): fallback when wal advances

6d48424

corylanou mentioned this pull request Apr 5, 2026

refactor(db): return sync results explicitly #1224

Merged

corylanou marked this pull request as draft April 6, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222

refactor(db,replica): extract sync/checkpoint/monitor subsystem#1222
corylanou wants to merge 10 commits intomainfrom
issue-1221-proposal-incremental-extraction-of-sync-checkpoint-monitorin

corylanou commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

corylanou commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Refactoring (Phases 1, 3, 4, 6 of #1221)

Bug Fix: Page loss under sustained writes

Phases deferred to follow-up PRs

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Build Metrics

Binary Size

Dependency Changes

govulncheck Output

Build Info

History (10 previous)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

corylanou commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading