Skip to content

feat(metrics): add metrics for local and replica LTX files#1011

Open
alongwill wants to merge 1 commit intobenbjohnson:mainfrom
alongwill:feat/add-ltx-file-metrics
Open

feat(metrics): add metrics for local and replica LTX files#1011
alongwill wants to merge 1 commit intobenbjohnson:mainfrom
alongwill:feat/add-ltx-file-metrics

Conversation

@alongwill
Copy link
Copy Markdown

@alongwill alongwill commented Jan 16, 2026

Description

Add prometheus metrics to monitor LTX file sizes locally and in replicas:

Metric Labels Description
litestream_local_ltx_files db Total LTX files in local metadata
litestream_local_ltx_bytes db Total size of local LTX files
litestream_replica_ltx_files db, replica_type, level LTX files by compaction level
litestream_replica_ltx_bytes db, replica_type, level LTX bytes by compaction level

Assisted by Claude Opus 4.5.

Example prometheus metrics
# HELP litestream_local_ltx_bytes Total size of LTX files in local metadata directory
# TYPE litestream_local_ltx_bytes gauge
litestream_local_ltx_bytes{db="/data/fruits.db"} 1991
# HELP litestream_local_ltx_files Total number of LTX files in local metadata directory
# TYPE litestream_local_ltx_files gauge
litestream_local_ltx_files{db="/data/fruits.db"} 4
# HELP litestream_replica_ltx_bytes Size of LTX files by compaction level (replica)
# TYPE litestream_replica_ltx_bytes gauge
litestream_replica_ltx_bytes{db="/data/fruits.db",level="0",replica_type="s3"} 1991
litestream_replica_ltx_bytes{db="/data/fruits.db",level="1",replica_type="s3"} 2290
litestream_replica_ltx_bytes{db="/data/fruits.db",level="2",replica_type="s3"} 1035
litestream_replica_ltx_bytes{db="/data/fruits.db",level="3",replica_type="s3"} 1035
litestream_replica_ltx_bytes{db="/data/fruits.db",level="9",replica_type="s3"} 1470
# HELP litestream_replica_ltx_files Number of LTX files by compaction level (replica)
# TYPE litestream_replica_ltx_files gauge
litestream_replica_ltx_files{db="/data/fruits.db",level="0",replica_type="s3"} 4
litestream_replica_ltx_files{db="/data/fruits.db",level="1",replica_type="s3"} 5
litestream_replica_ltx_files{db="/data/fruits.db",level="2",replica_type="s3"} 2
litestream_replica_ltx_files{db="/data/fruits.db",level="3",replica_type="s3"} 2
litestream_replica_ltx_files{db="/data/fruits.db",level="9",replica_type="s3"} 2

Example Grafana dashboard using these metrics
image

☝️ This example involved making ~2MB of writes to a 13MB sqlite db. Litestream was configured with:

    # Compaction levels
    levels:
      - interval: 5m
      - interval: 30m
      - interval: 1h

    # Global snapshot settings
    snapshot:
      interval: 30m
      retention: 1h

Motivation and Context

  • Improve observability of LTX files stored locally and in replicas.
    • Locally: so we can size storage according to needs
    • Replica: so we can calculate storage costs
  • This information helps understand and optimise settings for compaction levels and snapshots

I implemented this to help debug #976

How Has This Been Tested?

I ran two instances in a test environment against two accounts with different compaction and snapshot settings. This was using R2 as a replica. Can confirm the metrics were generated and captured successfully and align with reality.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (would cause existing functionality to not work as expected)

Checklist

  • My code follows the code style of this project (go fmt, go vet)
  • I have tested my changes (go test ./...)
  • I have updated the documentation accordingly (if needed)

@alongwill alongwill marked this pull request as ready for review January 16, 2026 10:43
@alongwill alongwill force-pushed the feat/add-ltx-file-metrics branch 2 times, most recently from 3362147 to b06faa3 Compare January 16, 2026 11:02
@alongwill alongwill force-pushed the feat/add-ltx-file-metrics branch 2 times, most recently from 1481304 to 665d0d1 Compare January 26, 2026 18:06
Add LTXStats types and methods to track LTX file counts and sizes both locally and in replicas. Update metrics during sync operations to expose file counts and sizes by compaction level.

| Metric                         | Labels                  | Description |
| -------------------------------|-------------------------|------------- |
| `litestream_local_ltx_files`   | db                      | Total LTX files in local metadata |
| `litestream_local_ltx_bytes`   | db                      | Total size of local LTX files |
| `litestream_replica_ltx_files` | db, replica_type, level | LTX files by compaction level |
| `litestream_replica_ltx_bytes` | db, replica_type, level | LTX bytes by compaction level |

Assisted by Claude Opus 4.5.

Signed-off-by: Andy Longwill <andrew.longwill@siderolabs.com>
@alongwill alongwill force-pushed the feat/add-ltx-file-metrics branch from 665d0d1 to 1c546ea Compare February 3, 2026 20:13
Copy link
Copy Markdown
Collaborator

@corylanou corylanou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alongwill — thanks for this PR and sorry it took so long to get to! The feature is well-motivated and the metric design is clean. Great work on the detailed description and Grafana example.

I found a few issues ranging from a compile-breaking bug in the test to some performance concerns. See the inline comments for details. The main themes are:

  1. Syntax error in replica_test.go that will prevent compilation
  2. Performance — collecting stats on every sync (especially the replica side with remote LIST calls) could be expensive. Consider decoupling from the sync hot path
  3. Stale metrics — emptied compaction levels will retain old non-zero values

Looking forward to the next iteration!

Comment thread replica_test.go
}
if result != "ok" {
t.Fatalf("integrity check returned: %s", result)
// TestReplica_LTXStats verifies that LTXStats returns correct file counts and sizes from replica.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This new test function is inserted inside verifyRestoredDB — the closing } for the if result != "ok" block and the function itself (lines 1666-1667 in the base) get pushed to the end of the new test. This won't compile.

The new TestReplica_LTXStats function needs to be placed after the closing } of verifyRestoredDB.

Comment thread db.go
}
db.walSizeGauge.Set(float64(newWALSize))

// Update local LTX file metrics.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: updateLocalLTXMetrics() does a full os.ReadDir + file.Info() walk of every LTX file across all level directories. Since Sync() fires on every WAL change, this could add noticeable I/O overhead for databases with many LTX files.

Consider either:

  • Running this on a timer (e.g., every 30s) rather than every sync
  • Implementing the Prometheus Collector interface so stats are computed only when the /metrics endpoint is scraped

Comment thread replica.go
r.db.RecordSuccessfulSync()

// Update replica LTX metrics after successful sync.
r.UpdateLTXMetrics(ctx)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: UpdateLTXMetrics iterates levels 0 through 9, calling r.Client.LTXFiles() for each. For remote backends (S3, R2, etc.), that's up to 10 LIST API calls per sync per replica. This adds both latency to the sync path and cost.

Same suggestion as the local side — decouple from the sync hot path with a timer or Prometheus Collector.

Comment thread replica.go
for level := 0; level <= SnapshotLevel; level++ {
itr, err := r.Client.LTXFiles(ctx, level, 0, false)
if err != nil {
// Skip levels that don't exist or have errors
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently continuing on LTXFiles errors could mask real issues (permissions, network problems, etc.). Consider logging at debug level here, consistent with the itr.Close() error handling a few lines below:

if err != nil {
    r.Logger().Debug("error listing LTX files", "level", level, "error", err)
    continue
}

Comment thread replica.go
}

dbPath := r.db.Path()
replicaType := r.Client.Type()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale metrics: This only sets gauges for levels that currently have files. If a level previously had files but is now empty (e.g., after compaction), the old non-zero gauge value will persist in Prometheus until process restart.

Consider resetting the gauge vecs for this db/replica before setting new values, or explicitly iterating all levels and setting empty ones to 0.

Comment thread db.go

// LTXStats holds aggregated statistics about LTX files.
type LTXStats struct {
TotalFiles int64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: LTXStats has TotalFiles/TotalBytes and ByLevel, but LocalLTXStats() only populates the totals and Replica.LTXStats() only populates ByLevel. The shared struct is a bit misleading — consider either populating all fields in both cases, or using separate types for local vs. replica stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants