perf(db): reduce idle CPU, memory, and cloud API costs by corylanou · Pull Request #1211 · benbjohnson/litestream

corylanou · 2026-03-25T18:58:45Z

Summary

Reduces unnecessary CPU, memory, and cloud storage API usage on idle databases while preserving prompt startup replication behavior.

Fixes #1210
Closes #992
Closes #1171

Review Follow-up

Addressed the two review findings from this PR:

Fresh or reset databases now treat no position, waiting for data as a notify-wait state instead of a retry state, so the first real upload is not delayed behind exponential backoff or an extra SyncInterval.
Database monitor jitter no longer adds an extra full monitor interval before the first background sync. The first poll stays at MonitorInterval, and the phase shift is applied on the following wakeup.

Additional Validation Fixes

While running local soak coverage for this PR, I fixed two soak-harness issues so the documented validation paths could actually run:

undefined criticalErrors references in the soak tests
incorrect S3 config indentation in CreateSoakConfig() for MinIO/S3-backed soak runs

Validation

go test ./...
go test ./... -run 'TestDB_Monitor_FirstSyncDoesNotWaitExtraInterval|TestReplica_Monitor_FreshDBWaitsOnNotify|TestStore_Integration'
SOAK_DEBUG=1 go test -v -tags='integration,soak' -run=TestComprehensiveSoak -test.short -timeout=1h ./tests/integration
SOAK_AUTO_PURGE=yes SOAK_DEBUG=1 go test -v -tags='integration,soak,docker' -run=TestMinIOSoak -test.short -timeout=20m ./tests/integration

Results

Full Go test suite passed.
Short file-based soak passed, including restore and integrity validation.
Short MinIO soak now reaches real replication and restore, but the restored DB still fails integrity check with wrong # of entries in index idx_load_test_timestamp. That issue was surfaced during validation and is not addressed in this PR.

github-actions · 2026-03-25T19:00:16Z

PR Build Metrics

✅ All clear — no issues detected

Check	Status	Summary
Binary size	✅	35.96 MB (+4.0 KB / +0.01%)
Dependencies	✅	No changes
Vulnerabilities	✅	None detected
Go toolchain	✅	1.25.8 (latest)
Module graph	✅	1206 edges (0)

Binary Size

	Size	Change
Base (`ad55ec1`)	35.96 MB
PR (`e38a37b`)	35.96 MB	+4.0 KB (+0.01%)

Dependency Changes

No dependency changes.

govulncheck Output

=== Symbol Results ===

No vulnerabilities found.

Your code is affected by 0 vulnerabilities.
This scan also found 1 vulnerability in packages you import and 0
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.

Build Info

Metric	Value
Build time	42s
Go version	`go1.25.8`
Commit	`e38a37b`

History (9 previous)

Commit	Updated	Status	Summary
`4fd483c`	2026-03-27 20:54 UTC	✅	35.96 MB (+4.0 KB / +0.01%)
`1dea800`	2026-03-26 14:47 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`02ff86b`	2026-03-26 13:25 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`5f4fa20`	2026-03-25 23:46 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`a63c5a6`	2026-03-25 23:25 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`cf23063`	2026-03-25 23:09 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`aed3f51`	2026-03-25 21:21 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`b992874`	2026-03-25 20:18 UTC	✅	35.96 MB (0.0 KB / 0.00%)
`5cf88b2`	2026-03-25 19:22 UTC	✅	35.96 MB (0.0 KB / 0.00%)

_{🤖 Updated on each push.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d21ccc1c90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-25T20:16:57Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca50569457

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-25T21:20:10Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b27b0864fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-25T23:08:17Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7909a1c764

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-25T23:23:58Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e01cad50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-25T23:45:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c2b5ebfc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

corylanou · 2026-03-26T13:23:43Z

@codex review

corylanou · 2026-03-26T15:40:07Z

Profiling Results: Idle CPU Optimization at 400 Databases

We profiled Litestream with 400 idle databases (800+ goroutines) to measure the impact of the optimizations in this PR and identify remaining bottlenecks.

Methodology

Used a new automated profiling test (TestIdleProfileSuite) that:

Creates N databases with file-based replicas, performs initial sync, then goes idle
Collects CPU profiles (runtime/pprof), heap profiles, goroutine dumps, and optionally runtime/trace
Captures runtime metrics (goroutine count, heap, GC cycles, stack) every 5 seconds
Exports everything to a configurable output directory for before/after comparison

PROFILE_DB_COUNT=400 PROFILE_DURATION=60s PROFILE_OUTPUT_DIR=./profiles/pr1211 \
  go test -tags='profile,integration' -run TestIdleProfileSuite -timeout=5m -v ./tests/integration/

CPU Profile Analysis (400 idle databases, 30s sample)

Before optimization (initial PR):

Duration: 30.09s, Total samples = 1.56s (5.18%)
  0.65s (41.67%)  os.Open          — opening WAL file to read header salts
  0.30s (19.23%)  os.Stat          — checking WAL file size (separate syscall)
  0.20s (12.82%)  ensureWALExists  — redundant stat before fast-path check
  0.37s (23.72%)  runtime.schedule — goroutine scheduling overhead

After optimization (final PR):

Duration: 30.09s, Total samples = 1.03s (3.42%)
  0.57s (55.34%)  os.Open          — single file open (combined with fstat)
  0.00s  (0.00%)  os.Stat          — eliminated (using f.Stat() on open fd)
  0.00s  (0.00%)  ensureWALExists  — eliminated (skipped when WAL known to exist)
  0.29s (28.16%)  runtime.schedule — reduced by jittered monitor start times

Result: 34% reduction in total idle CPU (5.18% → 3.42% at 400 databases).

What Changed

Combined WAL change detection into single open+fstat+read — Previously the idle path did os.Stat() (size check) then os.Open() + io.ReadFull() (salt check) as separate operations. Now we open the file once and use f.Stat() on the already-open fd, eliminating one syscall per tick.
Skip ensureWALExists() on idle path — After the first successful sync, lastSyncedWALOffset > 0 guarantees the WAL exists (we run with PERSIST_WAL). Skipping this eliminates another os.Stat() per tick.
Jittered monitor start times — All 400 db.monitor() goroutines were creating tickers simultaneously in Store.Open(), causing synchronized burst wakeups every second. A random delay [0, MonitorInterval) before the first tick spreads wakeups across the interval, reducing scheduler contention by 22%.

Idle Path Syscalls Per Tick

	Before PR	After Initial PR	After Profile-Guided Opt
Syscalls/tick	~15	6	3
Operations	stat+stat+open+read+close+verify+sync+...	stat+stat+open+read+close	open+fstat+read+close

Runtime Metrics at 400 Databases (60s steady-state)

Metric	Value
Goroutines	2003 (stable)
Heap	11-27MB (GC keeps it bounded)
Stack	8-21MB
GC cycles	~30/min
GC pause total	14-76ms/min

Goroutine Breakdown (400 DBs)

800  (40%)  database/sql.(*DB).connectionOpener  — SQL connection pools
400  (20%)  litestream.(*DB).monitor              — DB sync monitors
400  (20%)  litestream.(*Replica).monitor          — replica upload monitors
400  (20%)  database/sql.(*Tx).awaitDone          — long-running read txns

Remaining CPU Hotspot

os.Open() is now 55% of remaining idle CPU. Further reduction would require caching the WAL file descriptor across ticks (avoiding open/close each second). This is a more invasive change suitable for a follow-up PR.

Verification

All unit tests pass with -race flag
400-DB profile test runs without errors or data races
All CI checks pass (build, lint, integration tests, LTX behavioral gate)

Litestream consumes unnecessary resources on idle databases: the notify channel fires unconditionally every 1s waking all replicas, Sync() performs ~15 syscalls per tick even when the WAL hasn't changed, and the monitor uses fixed 1s polling regardless of activity. Changes: - Gate notify channel on `synced` return value so replicas only wake when data was actually synced (eliminates spurious wakeups) - Add WAL change detection: compare WAL file size against lastSyncedWALOffset to skip expensive verify+sync when idle - Add adaptive idle backoff in db.monitor(): exponentially increase polling interval from MonitorInterval to MaxIdleInterval (default 60s) when no changes detected, reset immediately on activity - Add Prometheus metrics: litestream_sync_skipped_total counter and litestream_db_idle gauge for observability - Add max-idle-interval config option to control backoff ceiling - Clear WAL cache in ResetLocalState for correct fresh-snapshot behavior Fixes #1210 Closes #992 Closes #1171

The adaptive idle backoff in monitor() delayed syncs during burst write patterns, causing oversized L0 files. The WAL change detection in Sync() already makes idle syncs cheap (1 os.Stat), so the monitor continues polling at MonitorInterval for low sync latency while Sync() skips expensive work when idle.

The fast-path that skips verifyAndSync() now also checks WAL header salt values (bytes 16-24) in addition to file size. This properly handles checkpoint-induced WAL resets where content changes while size stays constant (salt rotation after RESTART/FULL checkpoint, followed by new writes that refill to the same byte length). Changes: - Add lastWALSalt1, lastWALSalt2 fields to DB struct - Fast-path reads WAL header and compares salts before skipping - Update salt cache at end of each Sync() - Clear salt cache in ResetLocalState() Addresses PR review feedback.

1. Fix replica retry after upload failures (P1): - Add needsRetry flag to Replica.monitor() - Skip notify wait when retrying after error - Ensures transient failures on idle DBs can still retry 2. Remove unused max-idle-interval config (P3): - Remove DefaultMaxIdleInterval constant - Remove MaxIdleInterval from DB struct and DBConfig - Remove wiring in NewDBFromConfig() - Config was a no-op, can be re-added with fsnotify work

Move replica notification to happen right after verifyAndSync() succeeds with synced=true, before checkpointIfNeeded() and Pos() which can fail. This ensures replicas are notified about new TXIDs even when subsequent operations fail, preventing idle database replicas from being left behind after transient post-sync errors.

…rtbeats P1: Defer replica notification until all LTX creation completes - Add syncedDuringCurrentSync field to track LTX creation across Sync() - verifyAndSync() sets this flag when LTX is created (incl. checkpoint) - Use defer to notify replicas at end of Sync(), after checkpoint - Ensures replicas don't miss checkpoint-generated LTX files P2: Keep heartbeat timestamps fresh during idle periods - Add IdleWakeupInterval = 1 minute constant - Replica monitor wakes up periodically even without notify - Ensures Replica.Sync() is called, updating LastSuccessfulSyncAt - Prevents health checks from failing on idle-but-healthy databases

P2: Add db.mu.Lock() around WAL cache mutations in ResetLocalState() - Prevents data race with Sync() which reads/writes these fields under db.mu - Critical for auto-recovery which runs ResetLocalState() from Replica.monitor() P3: Replace time.After() with reusable time.Timer in replica monitor - Create single idleTimer before the loop, defer Stop() - Drain timer channel when notify fires first - Reset timer after each select - Reduces GC pressure with many replicas under sustained writes

The fast path that skips verifyAndSync() when WAL is unchanged now runs full verification at least once per minute. This ensures corrupted/missing LTX files are detected during idle periods, rather than going unnoticed until the next write. Changes: - Add FullVerifyInterval = 1 minute constant - Add lastFullVerifyTime field to track verification time - Fast path checks if full verification is due before skipping - Update timestamp when verifyAndSync() runs

…ed monitors Profile-guided optimizations based on 400-DB CPU profiling: 1. Combine WAL change detection into single open+fstat+read (was stat+stat+open+read): - Skip ensureWALExists() when lastSyncedWALOffset > 0 (WAL known to exist) - Use f.Stat() on already-open fd instead of separate os.Stat() - Eliminates 2 syscalls per idle tick (os.Stat x2) 2. Add jittered start delay to db.monitor(): - Random delay [0, MonitorInterval) before first tick - Spreads 400 synchronized wakeups across the interval - Reduces scheduler contention from burst patterns 3. Add automated profiling test (TestIdleProfileSuite): - Configurable DB count, duration, output directory - Collects CPU profile, heap profile, goroutine dump - Optional runtime/trace capture - JSON metrics export for before/after comparison Results at 400 idle databases (30s CPU profile): Total CPU: 5.18% → 3.42% (-34%) os.Stat: 0.30s → 0s (eliminated) ensureWALExists: 0.20s → 0s (eliminated) Scheduler: 0.37s → 0.29s (-22%)

Treat the fresh-database no-position state as a notify wait instead of a retry path so the first real upload is not held behind backoff or SyncInterval. Keep the first DB monitor sync at MonitorInterval and apply jitter on the following wakeup so startup latency does not regress while steady-state wakeups still dephase. Add regression coverage for both review findings and fix soak-test helpers needed for local validation.

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go Outdated

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go Outdated

Comment thread cmd/litestream/main.go Outdated

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go Outdated

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go Outdated

Comment thread db.go

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go

Comment thread replica.go Outdated

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread db.go

corylanou added 11 commits March 27, 2026 15:50

chore: trigger CI

79109e2

corylanou force-pushed the issue-1210-perf-reduce-idle-cpu-memory-and-cloud-api-costs branch from 83e8e9a to d51580e Compare March 27, 2026 20:52

corylanou mentioned this pull request Mar 28, 2026

Restore corruption reproduces in short soak tests on main #1220

Open

Conversation

corylanou commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review Follow-up

Additional Validation Fixes

Validation

Results

Uh oh!

github-actions Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Build Metrics

Binary Size

Dependency Changes

govulncheck Output

Build Info

History (9 previous)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

corylanou commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

corylanou commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

corylanou commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

corylanou commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

corylanou commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

corylanou commented Mar 26, 2026

Uh oh!

corylanou commented Mar 26, 2026

Profiling Results: Idle CPU Optimization at 400 Databases

Methodology

CPU Profile Analysis (400 idle databases, 30s sample)

What Changed

Idle Path Syscalls Per Tick

Runtime Metrics at 400 Databases (60s steady-state)

Goroutine Breakdown (400 DBs)

Remaining CPU Hotspot

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

corylanou commented Mar 25, 2026 •

edited

Loading

github-actions Bot commented Mar 25, 2026 •

edited

Loading