feat(metrics): add metrics for local and replica LTX files#1011
feat(metrics): add metrics for local and replica LTX files#1011alongwill wants to merge 1 commit intobenbjohnson:mainfrom
Conversation
3362147 to
b06faa3
Compare
1481304 to
665d0d1
Compare
Add LTXStats types and methods to track LTX file counts and sizes both locally and in replicas. Update metrics during sync operations to expose file counts and sizes by compaction level. | Metric | Labels | Description | | -------------------------------|-------------------------|------------- | | `litestream_local_ltx_files` | db | Total LTX files in local metadata | | `litestream_local_ltx_bytes` | db | Total size of local LTX files | | `litestream_replica_ltx_files` | db, replica_type, level | LTX files by compaction level | | `litestream_replica_ltx_bytes` | db, replica_type, level | LTX bytes by compaction level | Assisted by Claude Opus 4.5. Signed-off-by: Andy Longwill <andrew.longwill@siderolabs.com>
665d0d1 to
1c546ea
Compare
corylanou
left a comment
There was a problem hiding this comment.
Hey @alongwill — thanks for this PR and sorry it took so long to get to! The feature is well-motivated and the metric design is clean. Great work on the detailed description and Grafana example.
I found a few issues ranging from a compile-breaking bug in the test to some performance concerns. See the inline comments for details. The main themes are:
- Syntax error in
replica_test.gothat will prevent compilation - Performance — collecting stats on every sync (especially the replica side with remote LIST calls) could be expensive. Consider decoupling from the sync hot path
- Stale metrics — emptied compaction levels will retain old non-zero values
Looking forward to the next iteration!
| } | ||
| if result != "ok" { | ||
| t.Fatalf("integrity check returned: %s", result) | ||
| // TestReplica_LTXStats verifies that LTXStats returns correct file counts and sizes from replica. |
There was a problem hiding this comment.
Bug: This new test function is inserted inside verifyRestoredDB — the closing } for the if result != "ok" block and the function itself (lines 1666-1667 in the base) get pushed to the end of the new test. This won't compile.
The new TestReplica_LTXStats function needs to be placed after the closing } of verifyRestoredDB.
| } | ||
| db.walSizeGauge.Set(float64(newWALSize)) | ||
|
|
||
| // Update local LTX file metrics. |
There was a problem hiding this comment.
Performance: updateLocalLTXMetrics() does a full os.ReadDir + file.Info() walk of every LTX file across all level directories. Since Sync() fires on every WAL change, this could add noticeable I/O overhead for databases with many LTX files.
Consider either:
- Running this on a timer (e.g., every 30s) rather than every sync
- Implementing the Prometheus
Collectorinterface so stats are computed only when the/metricsendpoint is scraped
| r.db.RecordSuccessfulSync() | ||
|
|
||
| // Update replica LTX metrics after successful sync. | ||
| r.UpdateLTXMetrics(ctx) |
There was a problem hiding this comment.
Performance: UpdateLTXMetrics iterates levels 0 through 9, calling r.Client.LTXFiles() for each. For remote backends (S3, R2, etc.), that's up to 10 LIST API calls per sync per replica. This adds both latency to the sync path and cost.
Same suggestion as the local side — decouple from the sync hot path with a timer or Prometheus Collector.
| for level := 0; level <= SnapshotLevel; level++ { | ||
| itr, err := r.Client.LTXFiles(ctx, level, 0, false) | ||
| if err != nil { | ||
| // Skip levels that don't exist or have errors |
There was a problem hiding this comment.
Silently continuing on LTXFiles errors could mask real issues (permissions, network problems, etc.). Consider logging at debug level here, consistent with the itr.Close() error handling a few lines below:
if err != nil {
r.Logger().Debug("error listing LTX files", "level", level, "error", err)
continue
}| } | ||
|
|
||
| dbPath := r.db.Path() | ||
| replicaType := r.Client.Type() |
There was a problem hiding this comment.
Stale metrics: This only sets gauges for levels that currently have files. If a level previously had files but is now empty (e.g., after compaction), the old non-zero gauge value will persist in Prometheus until process restart.
Consider resetting the gauge vecs for this db/replica before setting new values, or explicitly iterating all levels and setting empty ones to 0.
|
|
||
| // LTXStats holds aggregated statistics about LTX files. | ||
| type LTXStats struct { | ||
| TotalFiles int64 |
There was a problem hiding this comment.
Nit: LTXStats has TotalFiles/TotalBytes and ByLevel, but LocalLTXStats() only populates the totals and Replica.LTXStats() only populates ByLevel. The shared struct is a bit misleading — consider either populating all fields in both cases, or using separate types for local vs. replica stats.
Description
Add prometheus metrics to monitor LTX file sizes locally and in replicas:
litestream_local_ltx_fileslitestream_local_ltx_byteslitestream_replica_ltx_fileslitestream_replica_ltx_bytesAssisted by Claude Opus 4.5.
Example prometheus metrics
Example Grafana dashboard using these metrics

☝️ This example involved making ~2MB of writes to a 13MB sqlite db. Litestream was configured with:
Motivation and Context
I implemented this to help debug #976
How Has This Been Tested?
I ran two instances in a test environment against two accounts with different compaction and snapshot settings. This was using R2 as a replica. Can confirm the metrics were generated and captured successfully and align with reality.
Types of changes
Checklist
go fmt,go vet)go test ./...)