feat(metrics): add metrics for local and replica LTX files by alongwill · Pull Request #1011 · benbjohnson/litestream

alongwill · 2026-01-16T10:26:55Z

Description

Add prometheus metrics to monitor LTX file sizes locally and in replicas:

Metric	Labels	Description
`litestream_local_ltx_files`	db	Total LTX files in local metadata
`litestream_local_ltx_bytes`	db	Total size of local LTX files
`litestream_replica_ltx_files`	db, replica_type, level	LTX files by compaction level
`litestream_replica_ltx_bytes`	db, replica_type, level	LTX bytes by compaction level

Assisted by Claude Opus 4.5.

Example prometheus metrics

# HELP litestream_local_ltx_bytes Total size of LTX files in local metadata directory
# TYPE litestream_local_ltx_bytes gauge
litestream_local_ltx_bytes{db="/data/fruits.db"} 1991
# HELP litestream_local_ltx_files Total number of LTX files in local metadata directory
# TYPE litestream_local_ltx_files gauge
litestream_local_ltx_files{db="/data/fruits.db"} 4
# HELP litestream_replica_ltx_bytes Size of LTX files by compaction level (replica)
# TYPE litestream_replica_ltx_bytes gauge
litestream_replica_ltx_bytes{db="/data/fruits.db",level="0",replica_type="s3"} 1991
litestream_replica_ltx_bytes{db="/data/fruits.db",level="1",replica_type="s3"} 2290
litestream_replica_ltx_bytes{db="/data/fruits.db",level="2",replica_type="s3"} 1035
litestream_replica_ltx_bytes{db="/data/fruits.db",level="3",replica_type="s3"} 1035
litestream_replica_ltx_bytes{db="/data/fruits.db",level="9",replica_type="s3"} 1470
# HELP litestream_replica_ltx_files Number of LTX files by compaction level (replica)
# TYPE litestream_replica_ltx_files gauge
litestream_replica_ltx_files{db="/data/fruits.db",level="0",replica_type="s3"} 4
litestream_replica_ltx_files{db="/data/fruits.db",level="1",replica_type="s3"} 5
litestream_replica_ltx_files{db="/data/fruits.db",level="2",replica_type="s3"} 2
litestream_replica_ltx_files{db="/data/fruits.db",level="3",replica_type="s3"} 2
litestream_replica_ltx_files{db="/data/fruits.db",level="9",replica_type="s3"} 2

Example Grafana dashboard using these metrics

☝️ This example involved making ~2MB of writes to a 13MB sqlite db. Litestream was configured with:

    # Compaction levels
    levels:
      - interval: 5m
      - interval: 30m
      - interval: 1h

    # Global snapshot settings
    snapshot:
      interval: 30m
      retention: 1h

Motivation and Context

Improve observability of LTX files stored locally and in replicas.
- Locally: so we can size storage according to needs
- Replica: so we can calculate storage costs
This information helps understand and optimise settings for compaction levels and snapshots

I implemented this to help debug #976

How Has This Been Tested?

I ran two instances in a test environment against two accounts with different compaction and snapshot settings. This was using R2 as a replica. Can confirm the metrics were generated and captured successfully and align with reality.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (would cause existing functionality to not work as expected)

Checklist

My code follows the code style of this project (go fmt, go vet)
I have tested my changes (go test ./...)
I have updated the documentation accordingly (if needed)

Add LTXStats types and methods to track LTX file counts and sizes both locally and in replicas. Update metrics during sync operations to expose file counts and sizes by compaction level. | Metric | Labels | Description | | -------------------------------|-------------------------|------------- | | `litestream_local_ltx_files` | db | Total LTX files in local metadata | | `litestream_local_ltx_bytes` | db | Total size of local LTX files | | `litestream_replica_ltx_files` | db, replica_type, level | LTX files by compaction level | | `litestream_replica_ltx_bytes` | db, replica_type, level | LTX bytes by compaction level | Assisted by Claude Opus 4.5. Signed-off-by: Andy Longwill <andrew.longwill@siderolabs.com>

corylanou

Hey @alongwill — thanks for this PR and sorry it took so long to get to! The feature is well-motivated and the metric design is clean. Great work on the detailed description and Grafana example.

I found a few issues ranging from a compile-breaking bug in the test to some performance concerns. See the inline comments for details. The main themes are:

Syntax error in replica_test.go that will prevent compilation
Performance — collecting stats on every sync (especially the replica side with remote LIST calls) could be expensive. Consider decoupling from the sync hot path
Stale metrics — emptied compaction levels will retain old non-zero values

Looking forward to the next iteration!

corylanou · 2026-02-21T18:07:23Z

 	}
 	if result != "ok" {
 		t.Fatalf("integrity check returned: %s", result)
+// TestReplica_LTXStats verifies that LTXStats returns correct file counts and sizes from replica.


Bug: This new test function is inserted inside verifyRestoredDB — the closing } for the if result != "ok" block and the function itself (lines 1666-1667 in the base) get pushed to the end of the new test. This won't compile.

The new TestReplica_LTXStats function needs to be placed after the closing } of verifyRestoredDB.

corylanou · 2026-02-21T18:07:23Z

 	}
 	db.walSizeGauge.Set(float64(newWALSize))

+	// Update local LTX file metrics.


Performance: updateLocalLTXMetrics() does a full os.ReadDir + file.Info() walk of every LTX file across all level directories. Since Sync() fires on every WAL change, this could add noticeable I/O overhead for databases with many LTX files.

Consider either:

Running this on a timer (e.g., every 30s) rather than every sync

Implementing the Prometheus Collector interface so stats are computed only when the /metrics endpoint is scraped

corylanou · 2026-02-21T18:07:23Z

 	r.db.RecordSuccessfulSync()

+	// Update replica LTX metrics after successful sync.
+	r.UpdateLTXMetrics(ctx)


Performance: UpdateLTXMetrics iterates levels 0 through 9, calling r.Client.LTXFiles() for each. For remote backends (S3, R2, etc.), that's up to 10 LIST API calls per sync per replica. This adds both latency to the sync path and cost.

Same suggestion as the local side — decouple from the sync hot path with a timer or Prometheus Collector.

corylanou · 2026-02-21T18:07:23Z

+	for level := 0; level <= SnapshotLevel; level++ {
+		itr, err := r.Client.LTXFiles(ctx, level, 0, false)
+		if err != nil {
+			// Skip levels that don't exist or have errors


Silently continuing on LTXFiles errors could mask real issues (permissions, network problems, etc.). Consider logging at debug level here, consistent with the itr.Close() error handling a few lines below:

if err != nil { r.Logger().Debug("error listing LTX files", "level", level, "error", err) continue }

corylanou · 2026-02-21T18:07:23Z

+	}
+
+	dbPath := r.db.Path()
+	replicaType := r.Client.Type()


Stale metrics: This only sets gauges for levels that currently have files. If a level previously had files but is now empty (e.g., after compaction), the old non-zero gauge value will persist in Prometheus until process restart.

Consider resetting the gauge vecs for this db/replica before setting new values, or explicitly iterating all levels and setting empty ones to 0.

corylanou · 2026-02-21T18:07:23Z


+// LTXStats holds aggregated statistics about LTX files.
+type LTXStats struct {
+	TotalFiles int64


Nit: LTXStats has TotalFiles/TotalBytes and ByLevel, but LocalLTXStats() only populates the totals and Replica.LTXStats() only populates ByLevel. The shared struct is a bit misleading — consider either populating all fields in both cases, or using separate types for local vs. replica stats.

alongwill marked this pull request as ready for review January 16, 2026 10:43

alongwill force-pushed the feat/add-ltx-file-metrics branch 2 times, most recently from 3362147 to b06faa3 Compare January 16, 2026 11:02

corylanou mentioned this pull request Jan 19, 2026

feat(ipc): add dynamic database management commands (register/unregister) #1021

Merged

4 tasks

alongwill force-pushed the feat/add-ltx-file-metrics branch 2 times, most recently from 1481304 to 665d0d1 Compare January 26, 2026 18:06

alongwill force-pushed the feat/add-ltx-file-metrics branch from 665d0d1 to 1c546ea Compare February 3, 2026 20:13

corylanou requested changes Feb 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add metrics for local and replica LTX files#1011

feat(metrics): add metrics for local and replica LTX files#1011
alongwill wants to merge 1 commit intobenbjohnson:mainfrom
alongwill:feat/add-ltx-file-metrics

alongwill commented Jan 16, 2026 •

edited

Loading

Uh oh!

corylanou left a comment

Uh oh!

corylanou Feb 21, 2026

Uh oh!

corylanou Feb 21, 2026

Uh oh!

corylanou Feb 21, 2026

Uh oh!

corylanou Feb 21, 2026

Uh oh!

corylanou Feb 21, 2026

Uh oh!

corylanou Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alongwill commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist

Uh oh!

corylanou left a comment

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

corylanou Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alongwill commented Jan 16, 2026 •

edited

Loading