tsdb/agent: Checkpoint based on Series in Memory

### Proposal

The agent currently uses the same Checkpoint implementation as all other parts of prometheus, https://github.com/prometheus/prometheus/blob/61aa82865d9c8474393bbbdcd539c58f63ba514f/tsdb/wlog/checkpoint.go#L87-L90 
The checkpoint serves three purposes for agent mode,
1. Populates the agent db `stripeSeries` with known series + last sample timestamps on startup
2. Populate series caches in queue_manager on startup
3. Pruning the series caches in queue_manager after a new checkpoint is created 
4. _Not applicable for agent mode yet and might be dropped_ Most recently metadata for a series

This is an incredibly small subset of the data vs what is persisted in a checkpoint which includes, series which exist in the WAL, samples above mint, float and regular histogram samples above mint, exemplars above mint, and latest metadata. In order to create a checkpoint with all these records we re-read the current checkpoint + all segments. This is a lot of overhead given all the data we require for the checkpoint is currently in memory between `stripeSeries` and the deleted series in agent db.

I propose we introduce another checkpoint implementation which could look something like,
```go
type ActiveSeries interface {
    Ref() chunks.HeadSeriesRef
    Labels() labels.Labels
    LastSampleTimestamp() int64
}

// Checkpoint creates an unindexed checkpoint containing record.RefSeries and 
// record.RefSample for ActiveSeries and  a record.RefSeries for the recentlyDeleted series. 
func Checkpoint(logger *slog.Logger, w *WL, seriesIter iter.Seq[ActiveSeries], recentlyDeleted []chunks.HeadSeriesRef)
```
that could be driven by the data we currently have in memory which would,
1. Reduce the overhead of taking a checkpoint
2. Reduce the overhead of queue_manager reading a checkpoint as checkpoints will be smaller
3. Improve startup times/resource usage due to smaller checkpoint sizes

I did a [quick implementation](https://github.com/grafana/alloy/compare/main...kgeckhart/segment-tracking-new-checkpoint-and-replay#diff-f8404620b6d8190ba878e405a75ea5b9989163caa46892e98e05afb99ce3d519) of this in Grafana Alloy where it shrunk a 214MB checkpoint by 56% down to 137MB, with the following improvements to creating a checkpoint + loading a checkpoint
```
              │ old-create.txt │           new-create.txt           │
              │     sec/op     │   sec/op     vs base               │
Checkpoint-11     3477.6m ± 7%   913.3m ± 6%  -73.74% (p=0.002 n=6)

              │ old-create.txt │            new-create.txt            │
              │      B/op      │     B/op       vs base               │
Checkpoint-11   2717.25Mi ± 0%   11.52Mi ± 11%  -99.58% (p=0.002 n=6)

              │ old-create.txt  │           new-create.txt           │
              │    allocs/op    │ allocs/op   vs base                │
Checkpoint-11   34087723.5 ± 0%   325.0 ± 1%  -100.00% (p=0.002 n=6)
```
```
                │ baseline-load.txt │           new-load.txt            │
                │      sec/op       │   sec/op    vs base               │
LoadLargeWAL-11          4.195 ± 2%   1.105 ± 5%  -73.67% (p=0.002 n=6)

                │ baseline-load.txt │            new-load.txt             │
                │       B/op        │     B/op      vs base               │
LoadLargeWAL-11        2.001Gi ± 1%   1.204Gi ± 0%  -39.83% (p=0.002 n=6)

                │ baseline-load.txt │            new-load.txt            │
                │     allocs/op     │  allocs/op   vs base               │
LoadLargeWAL-11         35.22M ± 0%   30.76M ± 0%  -12.66% (p=0.002 n=6)
```


	// Checkpoint creates a compacted checkpoint of segments in range [from, to] in the given WAL.
	// It includes the most recent checkpoint if it exists.
	// All series not satisfying keep, samples/tombstones/exemplars below mint and
	// metadata that are not the latest are dropped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb/agent: Checkpoint based on Series in Memory #17617

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tsdb/agent: Checkpoint based on Series in Memory #17617

Description

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions