Skip to content

perf(db): reduce idle CPU, memory, and cloud API costs#1211

Open
corylanou wants to merge 11 commits intomainfrom
issue-1210-perf-reduce-idle-cpu-memory-and-cloud-api-costs
Open

perf(db): reduce idle CPU, memory, and cloud API costs#1211
corylanou wants to merge 11 commits intomainfrom
issue-1210-perf-reduce-idle-cpu-memory-and-cloud-api-costs

Conversation

@corylanou
Copy link
Copy Markdown
Collaborator

@corylanou corylanou commented Mar 25, 2026

Summary

Reduces unnecessary CPU, memory, and cloud storage API usage on idle databases while preserving prompt startup replication behavior.

Fixes #1210
Closes #992
Closes #1171

Review Follow-up

Addressed the two review findings from this PR:

  1. Fresh or reset databases now treat no position, waiting for data as a notify-wait state instead of a retry state, so the first real upload is not delayed behind exponential backoff or an extra SyncInterval.
  2. Database monitor jitter no longer adds an extra full monitor interval before the first background sync. The first poll stays at MonitorInterval, and the phase shift is applied on the following wakeup.

Additional Validation Fixes

While running local soak coverage for this PR, I fixed two soak-harness issues so the documented validation paths could actually run:

  • undefined criticalErrors references in the soak tests
  • incorrect S3 config indentation in CreateSoakConfig() for MinIO/S3-backed soak runs

Validation

  • go test ./...
  • go test ./... -run 'TestDB_Monitor_FirstSyncDoesNotWaitExtraInterval|TestReplica_Monitor_FreshDBWaitsOnNotify|TestStore_Integration'
  • SOAK_DEBUG=1 go test -v -tags='integration,soak' -run=TestComprehensiveSoak -test.short -timeout=1h ./tests/integration
  • SOAK_AUTO_PURGE=yes SOAK_DEBUG=1 go test -v -tags='integration,soak,docker' -run=TestMinIOSoak -test.short -timeout=20m ./tests/integration

Results

  • Full Go test suite passed.
  • Short file-based soak passed, including restore and integrity validation.
  • Short MinIO soak now reaches real replication and restore, but the restored DB still fails integrity check with wrong # of entries in index idx_load_test_timestamp. That issue was surfaced during validation and is not addressed in this PR.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

PR Build Metrics

All clear — no issues detected

Check Status Summary
Binary size 35.96 MB (+4.0 KB / +0.01%)
Dependencies No changes
Vulnerabilities None detected
Go toolchain 1.25.8 (latest)
Module graph 1206 edges (0)

Binary Size

Size Change
Base (ad55ec1) 35.96 MB
PR (e38a37b) 35.96 MB +4.0 KB (+0.01%)

Dependency Changes

No dependency changes.

govulncheck Output

=== Symbol Results ===

No vulnerabilities found.

Your code is affected by 0 vulnerabilities.
This scan also found 1 vulnerability in packages you import and 0
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.

Build Info

Metric Value
Build time 42s
Go version go1.25.8
Commit e38a37b

History (9 previous)

Commit Updated Status Summary
4fd483c 2026-03-27 20:54 UTC 35.96 MB (+4.0 KB / +0.01%)
1dea800 2026-03-26 14:47 UTC 35.96 MB (0.0 KB / 0.00%)
02ff86b 2026-03-26 13:25 UTC 35.96 MB (0.0 KB / 0.00%)
5f4fa20 2026-03-25 23:46 UTC 35.96 MB (0.0 KB / 0.00%)
a63c5a6 2026-03-25 23:25 UTC 35.96 MB (0.0 KB / 0.00%)
cf23063 2026-03-25 23:09 UTC 35.96 MB (0.0 KB / 0.00%)
aed3f51 2026-03-25 21:21 UTC 35.96 MB (0.0 KB / 0.00%)
b992874 2026-03-25 20:18 UTC 35.96 MB (0.0 KB / 0.00%)
5cf88b2 2026-03-25 19:22 UTC 35.96 MB (0.0 KB / 0.00%)

🤖 Updated on each push.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d21ccc1c90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go Outdated
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca50569457

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go Outdated
Comment thread cmd/litestream/main.go Outdated
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b27b0864fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go Outdated
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7909a1c764

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go Outdated
Comment thread db.go
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e01cad50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go
Comment thread replica.go Outdated
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c2b5ebfc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread db.go
@corylanou
Copy link
Copy Markdown
Collaborator Author

@codex review

@corylanou
Copy link
Copy Markdown
Collaborator Author

Profiling Results: Idle CPU Optimization at 400 Databases

We profiled Litestream with 400 idle databases (800+ goroutines) to measure the impact of the optimizations in this PR and identify remaining bottlenecks.

Methodology

Used a new automated profiling test (TestIdleProfileSuite) that:

  • Creates N databases with file-based replicas, performs initial sync, then goes idle
  • Collects CPU profiles (runtime/pprof), heap profiles, goroutine dumps, and optionally runtime/trace
  • Captures runtime metrics (goroutine count, heap, GC cycles, stack) every 5 seconds
  • Exports everything to a configurable output directory for before/after comparison
PROFILE_DB_COUNT=400 PROFILE_DURATION=60s PROFILE_OUTPUT_DIR=./profiles/pr1211 \
  go test -tags='profile,integration' -run TestIdleProfileSuite -timeout=5m -v ./tests/integration/

CPU Profile Analysis (400 idle databases, 30s sample)

Before optimization (initial PR):

Duration: 30.09s, Total samples = 1.56s (5.18%)
  0.65s (41.67%)  os.Open          — opening WAL file to read header salts
  0.30s (19.23%)  os.Stat          — checking WAL file size (separate syscall)
  0.20s (12.82%)  ensureWALExists  — redundant stat before fast-path check
  0.37s (23.72%)  runtime.schedule — goroutine scheduling overhead

After optimization (final PR):

Duration: 30.09s, Total samples = 1.03s (3.42%)
  0.57s (55.34%)  os.Open          — single file open (combined with fstat)
  0.00s  (0.00%)  os.Stat          — eliminated (using f.Stat() on open fd)
  0.00s  (0.00%)  ensureWALExists  — eliminated (skipped when WAL known to exist)
  0.29s (28.16%)  runtime.schedule — reduced by jittered monitor start times

Result: 34% reduction in total idle CPU (5.18% → 3.42% at 400 databases).

What Changed

  1. Combined WAL change detection into single open+fstat+read — Previously the idle path did os.Stat() (size check) then os.Open() + io.ReadFull() (salt check) as separate operations. Now we open the file once and use f.Stat() on the already-open fd, eliminating one syscall per tick.

  2. Skip ensureWALExists() on idle path — After the first successful sync, lastSyncedWALOffset > 0 guarantees the WAL exists (we run with PERSIST_WAL). Skipping this eliminates another os.Stat() per tick.

  3. Jittered monitor start times — All 400 db.monitor() goroutines were creating tickers simultaneously in Store.Open(), causing synchronized burst wakeups every second. A random delay [0, MonitorInterval) before the first tick spreads wakeups across the interval, reducing scheduler contention by 22%.

Idle Path Syscalls Per Tick

Before PR After Initial PR After Profile-Guided Opt
Syscalls/tick ~15 6 3
Operations stat+stat+open+read+close+verify+sync+... stat+stat+open+read+close open+fstat+read+close

Runtime Metrics at 400 Databases (60s steady-state)

Metric Value
Goroutines 2003 (stable)
Heap 11-27MB (GC keeps it bounded)
Stack 8-21MB
GC cycles ~30/min
GC pause total 14-76ms/min

Goroutine Breakdown (400 DBs)

800  (40%)  database/sql.(*DB).connectionOpener  — SQL connection pools
400  (20%)  litestream.(*DB).monitor              — DB sync monitors
400  (20%)  litestream.(*Replica).monitor          — replica upload monitors
400  (20%)  database/sql.(*Tx).awaitDone          — long-running read txns

Remaining CPU Hotspot

os.Open() is now 55% of remaining idle CPU. Further reduction would require caching the WAL file descriptor across ticks (avoiding open/close each second). This is a more invasive change suitable for a follow-up PR.

Verification

  • All unit tests pass with -race flag
  • 400-DB profile test runs without errors or data races
  • All CI checks pass (build, lint, integration tests, LTX behavioral gate)

Litestream consumes unnecessary resources on idle databases: the notify
channel fires unconditionally every 1s waking all replicas, Sync()
performs ~15 syscalls per tick even when the WAL hasn't changed, and the
monitor uses fixed 1s polling regardless of activity.

Changes:
- Gate notify channel on `synced` return value so replicas only wake
  when data was actually synced (eliminates spurious wakeups)
- Add WAL change detection: compare WAL file size against
  lastSyncedWALOffset to skip expensive verify+sync when idle
- Add adaptive idle backoff in db.monitor(): exponentially increase
  polling interval from MonitorInterval to MaxIdleInterval (default 60s)
  when no changes detected, reset immediately on activity
- Add Prometheus metrics: litestream_sync_skipped_total counter and
  litestream_db_idle gauge for observability
- Add max-idle-interval config option to control backoff ceiling
- Clear WAL cache in ResetLocalState for correct fresh-snapshot behavior

Fixes #1210
Closes #992
Closes #1171
The adaptive idle backoff in monitor() delayed syncs during burst write
patterns, causing oversized L0 files. The WAL change detection in Sync()
already makes idle syncs cheap (1 os.Stat), so the monitor continues
polling at MonitorInterval for low sync latency while Sync() skips
expensive work when idle.
The fast-path that skips verifyAndSync() now also checks WAL header
salt values (bytes 16-24) in addition to file size. This properly
handles checkpoint-induced WAL resets where content changes while
size stays constant (salt rotation after RESTART/FULL checkpoint,
followed by new writes that refill to the same byte length).

Changes:
- Add lastWALSalt1, lastWALSalt2 fields to DB struct
- Fast-path reads WAL header and compares salts before skipping
- Update salt cache at end of each Sync()
- Clear salt cache in ResetLocalState()

Addresses PR review feedback.
1. Fix replica retry after upload failures (P1):
   - Add needsRetry flag to Replica.monitor()
   - Skip notify wait when retrying after error
   - Ensures transient failures on idle DBs can still retry

2. Remove unused max-idle-interval config (P3):
   - Remove DefaultMaxIdleInterval constant
   - Remove MaxIdleInterval from DB struct and DBConfig
   - Remove wiring in NewDBFromConfig()
   - Config was a no-op, can be re-added with fsnotify work
Move replica notification to happen right after verifyAndSync() succeeds
with synced=true, before checkpointIfNeeded() and Pos() which can fail.

This ensures replicas are notified about new TXIDs even when subsequent
operations fail, preventing idle database replicas from being left behind
after transient post-sync errors.
…rtbeats

P1: Defer replica notification until all LTX creation completes
- Add syncedDuringCurrentSync field to track LTX creation across Sync()
- verifyAndSync() sets this flag when LTX is created (incl. checkpoint)
- Use defer to notify replicas at end of Sync(), after checkpoint
- Ensures replicas don't miss checkpoint-generated LTX files

P2: Keep heartbeat timestamps fresh during idle periods
- Add IdleWakeupInterval = 1 minute constant
- Replica monitor wakes up periodically even without notify
- Ensures Replica.Sync() is called, updating LastSuccessfulSyncAt
- Prevents health checks from failing on idle-but-healthy databases
P2: Add db.mu.Lock() around WAL cache mutations in ResetLocalState()
- Prevents data race with Sync() which reads/writes these fields under db.mu
- Critical for auto-recovery which runs ResetLocalState() from Replica.monitor()

P3: Replace time.After() with reusable time.Timer in replica monitor
- Create single idleTimer before the loop, defer Stop()
- Drain timer channel when notify fires first
- Reset timer after each select
- Reduces GC pressure with many replicas under sustained writes
The fast path that skips verifyAndSync() when WAL is unchanged now
runs full verification at least once per minute. This ensures
corrupted/missing LTX files are detected during idle periods, rather
than going unnoticed until the next write.

Changes:
- Add FullVerifyInterval = 1 minute constant
- Add lastFullVerifyTime field to track verification time
- Fast path checks if full verification is due before skipping
- Update timestamp when verifyAndSync() runs
…ed monitors

Profile-guided optimizations based on 400-DB CPU profiling:

1. Combine WAL change detection into single open+fstat+read (was stat+stat+open+read):
   - Skip ensureWALExists() when lastSyncedWALOffset > 0 (WAL known to exist)
   - Use f.Stat() on already-open fd instead of separate os.Stat()
   - Eliminates 2 syscalls per idle tick (os.Stat x2)

2. Add jittered start delay to db.monitor():
   - Random delay [0, MonitorInterval) before first tick
   - Spreads 400 synchronized wakeups across the interval
   - Reduces scheduler contention from burst patterns

3. Add automated profiling test (TestIdleProfileSuite):
   - Configurable DB count, duration, output directory
   - Collects CPU profile, heap profile, goroutine dump
   - Optional runtime/trace capture
   - JSON metrics export for before/after comparison

Results at 400 idle databases (30s CPU profile):
  Total CPU:         5.18% → 3.42% (-34%)
  os.Stat:           0.30s → 0s (eliminated)
  ensureWALExists:   0.20s → 0s (eliminated)
  Scheduler:         0.37s → 0.29s (-22%)
Treat the fresh-database no-position state as a notify wait instead of a retry path so the first real upload is not held behind backoff or SyncInterval.

Keep the first DB monitor sync at MonitorInterval and apply jitter on the following wakeup so startup latency does not regress while steady-state wakeups still dephase.

Add regression coverage for both review findings and fix soak-test helpers needed for local validation.
@corylanou corylanou force-pushed the issue-1210-perf-reduce-idle-cpu-memory-and-cloud-api-costs branch from 83e8e9a to d51580e Compare March 27, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant