From beb076ef5f0431ec20fc561c865bcb4887ea7ba9 Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Thu, 18 Dec 2025 15:01:42 -0500 Subject: [PATCH 1/6] Remote Write: Restart from checkpoint Signed-off-by: Kyle Eckhart --- ...00-remote-write-restart-from-checkpoint.md | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 proposals/0000-remote-write-restart-from-checkpoint.md diff --git a/proposals/0000-remote-write-restart-from-checkpoint.md b/proposals/0000-remote-write-restart-from-checkpoint.md new file mode 100644 index 0000000..8c48c64 --- /dev/null +++ b/proposals/0000-remote-write-restart-from-checkpoint.md @@ -0,0 +1,165 @@ +## Remote Write: Restart from checkpoint + +* **Owners:** + * [@kgeckhart](https://github.com/kgeckhart) + +* **Implementation Status:** `Not implemented` + +* **Related Issues and PRs:** + +First issue on the matter https://github.com/prometheus/prometheus/issues/8809 spawned from https://github.com/prometheus/prometheus/pull/7710. + +Since then there have been a lot of discussion / attempts but nothing has been merged. See +* https://github.com/prometheus/prometheus/pull/8918 +* https://github.com/prometheus/prometheus/pull/9862 +* https://github.com/ptodev/prometheus/pull/1 + +I believe the issue for [tsdb/agent: Prevent unread segments from being truncated](https://github.com/prometheus/prometheus/issues/17616) would need to be completed but can stand on its own as it's purely for notifying when segments have been read. + +* **Other docs or links:** + +> This effort aims to have an agreed upon design with requirements for completing the work to allow remote write to restart data delivery from a checkpoint and not from `time.Now()` + +## Why + +Remote write is backed by a write-ahead-log (WAL) where all data is persisted before it is sent. +If a config is reloaded or prometheus/agent is restarted before flushing pending samples we will skip those samples. +Given we have a persistent WAL this behavior is unexpected by users and can cause a lot of confusion. + +### Pitfalls of the current solution + +As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart. + +## Goals + +1. Support resuming from a checkpoint for each configured `remote_write` destination. +2. Taking a checkpoint for a remote_write destination should not incur significant overhead. +3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new checkpoint entry. + * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s + * These can be expected to change under normal circumstances and should not trigger a data loss scenario. +4. Guards need to be in place to protect against infinite WAL growth. +5. Stretch: Remote write supports at-least-once delivery of samples in the WAL. + * Note: This has appeared to be the largest challenge with any existing implementation as it can cause significant overhead. + +### Audience + +`remote_write` users. + +## Non-Goals + +* Creating a watcher that is capable of tracking offsets. +* Remote write supports exactly-once delivery + +## How + +Enabling replay will require changes across `remote.WriteStorage`, `remote.QueueManager`, and `wlog.Watcher`. Implementing https://github.com/prometheus/prometheus/issues/17616 will help as it will provide a hook to signal when segments have been fully read from the watcher through `remote.WriteStorage`. Anytime the `wlog.Watcher` [changes the current segment](https://github.com/prometheus/prometheus/blob/18efd9d629c467877ebe674bbc1edbba8abe54be/tsdb/wlog/watcher.go#L314) the following would happen, + +```mermaid +sequenceDiagram + wlog.Watcher->>wlog.Watcher: Run() + loop Until closed + wlog.Watcher->>wlog.Watcher: currentSegmentMetric.Set + wlog.Watcher->>remote.QueueManager: OnSegmentChange() + remote.QueueManager->>remote.WriteStorage: OnSegmentChange() + remote.WriteStorage->>remote.WriteStorage: Update currentSegment for Queue + remote.WriteStorage->>remote.WriteStorage: Are all queues at or ahead of currentSegment? + alt yes + remote.WriteStorage->>Subscriber: SegmentChange() + end + end +``` + +`remote.WriteStorage` would be tracking the currentSegment for each queue supporting. Since this isn't completed it's open to discussion but it seems like a reasonable chunk to start with that has values on its own. + +A basic replay to accomplishing all non-stretch goals would be as follows + +### Code flow + +Enabling replay will require changes across `remote.WriteStorage`, `remote.QueueManager`, and `wlog.Watcher`. Implementing https://github.com/prometheus/prometheus/issues/17616 will help as it will provide a hook to signal when segments have been fully read from the watcher through `remote.WriteStorage`. Anytime the `wlog.Watcher` [changes the current segment](https://github.com/prometheus/prometheus/blob/18efd9d629c467877ebe674bbc1edbba8abe54be/tsdb/wlog/watcher.go#L314) the following would happen, + +1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue. +2. Ensure `remote.WriteStorage.Close()` will also attempt to write current segments +3. Read persisted queue segment positions in `remote.NewWriteStorage()` +4. Update `remote.WriteStorage.ApplyConfig` to provide the persisted current segment to `remote.NewQueueManager` +5. Update `remote.NewQueueManager` to provide a starting segment to `wlog.NewWatcher` +6. Update `wlog.NewWatcher.Run()` to start sending samples if a starting segment is configured +7. Walk through the `remote.QueueManager` send code to ensure duplicate data errors will not cause slow downs in data delivery (we have a high probability of sending duplicate data). + +This flow should be enough to to accomplish Goal 1: Support resuming from a checkpoint for each configured `remote_write` destination. + +The act of taking a checkpoint will require a lock to be held but given we do it on a schedule this will be infrequent enough that the implementation should safely accomplish Goal 2: Taking a checkpoint for a remote_write destination should not incur significant overhead (see testing for further info). + +### Checkpoint file format/location + +The segment checkpoint would be stored in the `remote.WriteStorage.dir` which would be next to the `/wal` directory. + +We only care about the queue hash and the current segment so a json encoded file seems reasonable for this. A key value format should make it easier to evolve over time vs a more basic delimited file. + +Solving for, Goal 3: Changing the `queue_configuration` for a `remote_write` destination should not result in a new checkpoint entry. + +This will be done via adding a specific toHash function for RemoteWriteConfig which zeros the QueueConfig before taking the hash. RemoteWriteConfig is managed as a pointer so we'll need to keep the value before, set to empty, and put the original value back but all is reasonably managed. We could look at identifying other "operational" fields which could be excluded from hashing for the same reasons. + +This will change existing queue hashes but I don't believe that to be a big problem and if it is we can do this hashing specifically for segment tracking only. It is proposed as the first task so we can reduce the amount of use cases which can trigger data loss. + +### Testing / Safety + +Goal 4: Guards need to be in place to protect against infinite WAL growth is capable of being accomplished through adjusting config defaults when replaying is enabled. We would require `remote_write.queue_config.sample_age_limit` be non-zero and would have a default of `2h`. + +I believe prombench is sufficient to prove Goal 2: Taking a checkpoint for a remote_write destination should not incur significant overhead. Open to further benchmarking ideas but given the components + time necessary for a proper test ensuring prombench is capable of covering this would be the most ideal. + +### Further reducing duplicated data sent + +Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the checkpoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data from within the checkpointed segment ideally reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment. + +### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL. + +The amount of complexity in this goal is large, it is my opinion that our current state where all samples are lost is worse than implementing a replay which does not give us at-least-once delivery. I believe the proposed replay implementation would provide a good basis for an at-least-once solution. + +A solution would need to involve internals of `remote.QueueManager` as part of an `OnSegmentChange` pipeline. One option could be to implement the same pattern where `remote.WriteStorage` tracks the segment for each `remote.QueueManager`, in this case `remote.QueueManager` would track the segment of each shard and take responsibility for propagating the notification when all shards are at or beyond the segment. + +```mermaid +sequenceDiagram + wlog.Watcher->>wlog.Watcher: Run() + loop Until closed + wlog.Watcher->>wlog.Watcher: currentSegmentMetric.Set + wlog.Watcher->>remote.QueueManager: OnSegmentChange() + remote.QueueManager->>remote.QueueManager.shards: OnSegmentChange() + remote.QueueManager.shards->>remote.QueueManager.shards: Store new segment and current batchQueue depth + 1
(number of batches to send to clear the segment) + remote.QueueManager.shards->>remote.QueueManager.shards: Decrement depths on send
(we could have more than 1 segment enqueued) + remote.QueueManager.shards->>remote.QueueManager.shards: Depth is zero for a segment? + alt yes + remote.QueueManager.shards->>remote.QueueManager: Update currentSegment for shard id + end + remote.QueueManager->>remote.QueueManager: All segments at or ahead of currentSegment? + alt yes + remote.QueueManager->>remote.WriteStorage: OnSegmentChange() + end + remote.WriteStorage->>remote.WriteStorage: Update currentSegment for Queue + remote.WriteStorage->>remote.WriteStorage: Are all queues at or ahead of currentSegment? + alt yes + remote.WriteStorage->>Subscriber: SegmentChange() + end + end +``` + +I believe this bypasses resharding complexity, as a reshard triggers a purging of all queues clearing which would also clear any pending segment changes. The complexity will come from ensuring the overhead from added locking is low enough to keep remote write delivery rates relatively unchanged. + +## Alternatives + +1. `remote.QueueManager` should own syncing its own checkpoint (most early implementations took this approach). + * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once. + * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity. +2. The checkpoint should be synchronously updated when segments change. + * Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change. + * Synchronously committing it makes the potential gap larger. + * If we assume a 15 second queue delay then syncing the checkpoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed. + * The trade-off being more unnecessary data being replayed on startup. + * After implementing a solution for at-least-once we can reassess how often we commit/if we should make it synchronous. + +## Action Plan + +The tasks to do in order to migrate to the new idea. + +* [ ] Adjust the queue hash function to exclude parameters often adjusted during normal operations (reduces the surface area where data can be lost). +* [ ] Implement the segment change notification pattern proposed in https://github.com/prometheus/prometheus/issues/17616. +* [ ] Add the functionality proposed in the How section (I think it can be accomplished in a single PR without being massive). From d4945e12da21c6c2392670a0086fc48a6b3f0da5 Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Thu, 18 Dec 2025 15:16:28 -0500 Subject: [PATCH 2/6] Add pr id to the proposal file Signed-off-by: Kyle Eckhart --- ...checkpoint.md => 0072-remote-write-restart-from-checkpoint.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename proposals/{0000-remote-write-restart-from-checkpoint.md => 0072-remote-write-restart-from-checkpoint.md} (100%) diff --git a/proposals/0000-remote-write-restart-from-checkpoint.md b/proposals/0072-remote-write-restart-from-checkpoint.md similarity index 100% rename from proposals/0000-remote-write-restart-from-checkpoint.md rename to proposals/0072-remote-write-restart-from-checkpoint.md From df97ae1c2fb62cf662414d9079872f53bdb6e770 Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Thu, 18 Dec 2025 15:29:09 -0500 Subject: [PATCH 3/6] Remove duplicated block Signed-off-by: Kyle Eckhart --- proposals/0072-remote-write-restart-from-checkpoint.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/proposals/0072-remote-write-restart-from-checkpoint.md b/proposals/0072-remote-write-restart-from-checkpoint.md index 8c48c64..3f2c1a7 100644 --- a/proposals/0072-remote-write-restart-from-checkpoint.md +++ b/proposals/0072-remote-write-restart-from-checkpoint.md @@ -75,8 +75,6 @@ A basic replay to accomplishing all non-stretch goals would be as follows ### Code flow -Enabling replay will require changes across `remote.WriteStorage`, `remote.QueueManager`, and `wlog.Watcher`. Implementing https://github.com/prometheus/prometheus/issues/17616 will help as it will provide a hook to signal when segments have been fully read from the watcher through `remote.WriteStorage`. Anytime the `wlog.Watcher` [changes the current segment](https://github.com/prometheus/prometheus/blob/18efd9d629c467877ebe674bbc1edbba8abe54be/tsdb/wlog/watcher.go#L314) the following would happen, - 1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue. 2. Ensure `remote.WriteStorage.Close()` will also attempt to write current segments 3. Read persisted queue segment positions in `remote.NewWriteStorage()` From cdc82e6f3852839114a777c87d89f8eb828fb6ef Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Thu, 29 Jan 2026 14:11:06 -0500 Subject: [PATCH 4/6] Rename checkpoint -> savepoint to eliminate confusion with existing checkpoint Signed-off-by: Kyle Eckhart --- ...72-remote-write-restart-from-savepoint.md} | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) rename proposals/{0072-remote-write-restart-from-checkpoint.md => 0072-remote-write-restart-from-savepoint.md} (82%) diff --git a/proposals/0072-remote-write-restart-from-checkpoint.md b/proposals/0072-remote-write-restart-from-savepoint.md similarity index 82% rename from proposals/0072-remote-write-restart-from-checkpoint.md rename to proposals/0072-remote-write-restart-from-savepoint.md index 3f2c1a7..4c4909a 100644 --- a/proposals/0072-remote-write-restart-from-checkpoint.md +++ b/proposals/0072-remote-write-restart-from-savepoint.md @@ -1,4 +1,4 @@ -## Remote Write: Restart from checkpoint +## Remote Write: Restart from savepoint * **Owners:** * [@kgeckhart](https://github.com/kgeckhart) @@ -18,7 +18,7 @@ I believe the issue for [tsdb/agent: Prevent unread segments from being truncate * **Other docs or links:** -> This effort aims to have an agreed upon design with requirements for completing the work to allow remote write to restart data delivery from a checkpoint and not from `time.Now()` +> This effort aims to have an agreed upon design with requirements for completing the work to allow remote write to restart data delivery from a savepoint and not from `time.Now()` ## Why @@ -32,9 +32,9 @@ As mentioned in the why, this behavior is often confusing to users who know a WA ## Goals -1. Support resuming from a checkpoint for each configured `remote_write` destination. -2. Taking a checkpoint for a remote_write destination should not incur significant overhead. -3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new checkpoint entry. +1. Support resuming from a savepoint for each configured `remote_write` destination. +2. Taking a savepoint for a remote_write destination should not incur significant overhead. +3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s * These can be expected to change under normal circumstances and should not trigger a data loss scenario. 4. Guards need to be in place to protect against infinite WAL growth. @@ -83,17 +83,17 @@ A basic replay to accomplishing all non-stretch goals would be as follows 6. Update `wlog.NewWatcher.Run()` to start sending samples if a starting segment is configured 7. Walk through the `remote.QueueManager` send code to ensure duplicate data errors will not cause slow downs in data delivery (we have a high probability of sending duplicate data). -This flow should be enough to to accomplish Goal 1: Support resuming from a checkpoint for each configured `remote_write` destination. +This flow should be enough to to accomplish Goal 1: Support resuming from a savepoint for each configured `remote_write` destination. -The act of taking a checkpoint will require a lock to be held but given we do it on a schedule this will be infrequent enough that the implementation should safely accomplish Goal 2: Taking a checkpoint for a remote_write destination should not incur significant overhead (see testing for further info). +The act of taking a savepoint will require a lock to be held but given we do it on a schedule this will be infrequent enough that the implementation should safely accomplish Goal 2: Taking a savepoint for a remote_write destination should not incur significant overhead (see testing for further info). -### Checkpoint file format/location +### Savepoint file format/location -The segment checkpoint would be stored in the `remote.WriteStorage.dir` which would be next to the `/wal` directory. +The savepoint would be stored in the `remote.WriteStorage.dir` which would be next to the `/wal` directory. We only care about the queue hash and the current segment so a json encoded file seems reasonable for this. A key value format should make it easier to evolve over time vs a more basic delimited file. -Solving for, Goal 3: Changing the `queue_configuration` for a `remote_write` destination should not result in a new checkpoint entry. +Solving for, Goal 3: Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. This will be done via adding a specific toHash function for RemoteWriteConfig which zeros the QueueConfig before taking the hash. RemoteWriteConfig is managed as a pointer so we'll need to keep the value before, set to empty, and put the original value back but all is reasonably managed. We could look at identifying other "operational" fields which could be excluded from hashing for the same reasons. @@ -103,11 +103,11 @@ This will change existing queue hashes but I don't believe that to be a big prob Goal 4: Guards need to be in place to protect against infinite WAL growth is capable of being accomplished through adjusting config defaults when replaying is enabled. We would require `remote_write.queue_config.sample_age_limit` be non-zero and would have a default of `2h`. -I believe prombench is sufficient to prove Goal 2: Taking a checkpoint for a remote_write destination should not incur significant overhead. Open to further benchmarking ideas but given the components + time necessary for a proper test ensuring prombench is capable of covering this would be the most ideal. +I believe prombench is sufficient to prove Goal 2: Taking a savepoint for a remote_write destination should not incur significant overhead. Open to further benchmarking ideas but given the components + time necessary for a proper test ensuring prombench is capable of covering this would be the most ideal. ### Further reducing duplicated data sent -Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the checkpoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data from within the checkpointed segment ideally reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment. +Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the savepoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data, reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment. ### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL. @@ -144,13 +144,13 @@ I believe this bypasses resharding complexity, as a reshard triggers a purging o ## Alternatives -1. `remote.QueueManager` should own syncing its own checkpoint (most early implementations took this approach). +1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach). * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once. * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity. -2. The checkpoint should be synchronously updated when segments change. +2. The savepoint should be synchronously updated when segments change. * Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change. * Synchronously committing it makes the potential gap larger. - * If we assume a 15 second queue delay then syncing the checkpoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed. + * If we assume a 15 second queue delay then syncing the savepoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed. * The trade-off being more unnecessary data being replayed on startup. * After implementing a solution for at-least-once we can reassess how often we commit/if we should make it synchronous. From eb322333ab5361c79457f1f40f2d26982a833364 Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Tue, 24 Feb 2026 16:51:04 -0500 Subject: [PATCH 5/6] Remove feedback: typos and other small fixes, higher level implementation proposal, expand more on at-least-once, alternatives pros/cons Signed-off-by: Kyle Eckhart --- ...072-remote-write-restart-from-savepoint.md | 123 ++++++------------ 1 file changed, 42 insertions(+), 81 deletions(-) diff --git a/proposals/0072-remote-write-restart-from-savepoint.md b/proposals/0072-remote-write-restart-from-savepoint.md index 4c4909a..9ceb9b7 100644 --- a/proposals/0072-remote-write-restart-from-savepoint.md +++ b/proposals/0072-remote-write-restart-from-savepoint.md @@ -1,4 +1,4 @@ -## Remote Write: Restart from savepoint +## Remote Write: Restart from segment-based savepoint * **Owners:** * [@kgeckhart](https://github.com/kgeckhart) @@ -14,8 +14,6 @@ Since then there have been a lot of discussion / attempts but nothing has been m * https://github.com/prometheus/prometheus/pull/9862 * https://github.com/ptodev/prometheus/pull/1 -I believe the issue for [tsdb/agent: Prevent unread segments from being truncated](https://github.com/prometheus/prometheus/issues/17616) would need to be completed but can stand on its own as it's purely for notifying when segments have been read. - * **Other docs or links:** > This effort aims to have an agreed upon design with requirements for completing the work to allow remote write to restart data delivery from a savepoint and not from `time.Now()` @@ -26,16 +24,12 @@ Remote write is backed by a write-ahead-log (WAL) where all data is persisted be If a config is reloaded or prometheus/agent is restarted before flushing pending samples we will skip those samples. Given we have a persistent WAL this behavior is unexpected by users and can cause a lot of confusion. -### Pitfalls of the current solution - -As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart. - ## Goals -1. Support resuming from a savepoint for each configured `remote_write` destination. +1. Support resuming from a savepoint for each configured `remote_write` destination via an opt-in feature flag. 2. Taking a savepoint for a remote_write destination should not incur significant overhead. -3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. - * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s +3. Changing the `queue_configuration` for a `remote_write` destination should not result in losing a savepoint entry. + * The `queue_configuration` includes fields like min/max shards and other performance tuning parameters. * These can be expected to change under normal circumstances and should not trigger a data loss scenario. 4. Guards need to be in place to protect against infinite WAL growth. 5. Stretch: Remote write supports at-least-once delivery of samples in the WAL. @@ -47,45 +41,30 @@ As mentioned in the why, this behavior is often confusing to users who know a WA ## Non-Goals -* Creating a watcher that is capable of tracking offsets. +* Tracking position within a WAL segment (byte or record-level offsets). The savepoint tracks at segment granularity only. * Remote write supports exactly-once delivery ## How -Enabling replay will require changes across `remote.WriteStorage`, `remote.QueueManager`, and `wlog.Watcher`. Implementing https://github.com/prometheus/prometheus/issues/17616 will help as it will provide a hook to signal when segments have been fully read from the watcher through `remote.WriteStorage`. Anytime the `wlog.Watcher` [changes the current segment](https://github.com/prometheus/prometheus/blob/18efd9d629c467877ebe674bbc1edbba8abe54be/tsdb/wlog/watcher.go#L314) the following would happen, - -```mermaid -sequenceDiagram - wlog.Watcher->>wlog.Watcher: Run() - loop Until closed - wlog.Watcher->>wlog.Watcher: currentSegmentMetric.Set - wlog.Watcher->>remote.QueueManager: OnSegmentChange() - remote.QueueManager->>remote.WriteStorage: OnSegmentChange() - remote.WriteStorage->>remote.WriteStorage: Update currentSegment for Queue - remote.WriteStorage->>remote.WriteStorage: Are all queues at or ahead of currentSegment? - alt yes - remote.WriteStorage->>Subscriber: SegmentChange() - end - end -``` +A basic replay to accomplish all non-stretch goals would be as follows. -`remote.WriteStorage` would be tracking the currentSegment for each queue supporting. Since this isn't completed it's open to discussion but it seems like a reasonable chunk to start with that has values on its own. +### Implementation flow -A basic replay to accomplishing all non-stretch goals would be as follows +**On startup:** -### Code flow +1. Read the savepoint file and load the last saved segment number for each queue. +2. Pass the saved segment to the watcher for each queue so replay begins from that segment rather than the current WAL head. -1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue. -2. Ensure `remote.WriteStorage.Close()` will also attempt to write current segments -3. Read persisted queue segment positions in `remote.NewWriteStorage()` -4. Update `remote.WriteStorage.ApplyConfig` to provide the persisted current segment to `remote.NewQueueManager` -5. Update `remote.NewQueueManager` to provide a starting segment to `wlog.NewWatcher` -6. Update `wlog.NewWatcher.Run()` to start sending samples if a starting segment is configured -7. Walk through the `remote.QueueManager` send code to ensure duplicate data errors will not cause slow downs in data delivery (we have a high probability of sending duplicate data). +**At runtime:** -This flow should be enough to to accomplish Goal 1: Support resuming from a savepoint for each configured `remote_write` destination. +3. On a configurable schedule, write the current segment for each queue to the savepoint file on disk. +4. On clean shutdown, also write the savepoint before exiting. -The act of taking a savepoint will require a lock to be held but given we do it on a schedule this will be infrequent enough that the implementation should safely accomplish Goal 2: Taking a savepoint for a remote_write destination should not incur significant overhead (see testing for further info). +**Duplicate handling:** + +5. Since replay starts at a segment boundary rather than an exact position within the segment, some already-delivered samples may be re-sent. The remote write destination must handle duplicate or out-of-order sample errors gracefully so these do not slow down delivery. The probability of duplicates on startup after replay is high due to redoing whole segments. + +This flow should be enough to accomplish Goals 1 and 2. The savepoint write requires a lock but given it happens on a schedule it will be infrequent enough to avoid significant overhead (see testing for further info). Ideally, the implementation could help solve [tsdb/agent: Prevent unread segments from being truncated](https://github.com/prometheus/prometheus/issues/17616) which would require the agent to be made aware when remote write has progressed passed a specific segment. ### Savepoint file format/location @@ -93,6 +72,15 @@ The savepoint would be stored in the `remote.WriteStorage.dir` which would be ne We only care about the queue hash and the current segment so a json encoded file seems reasonable for this. A key value format should make it easier to evolve over time vs a more basic delimited file. +Example savepoint file (keys are queue hashes, values are savepoint entries): + +```json +{ + "abc123def456": { "segment": 42 }, + "789xyz012abc": { "segment": 39 } +} +``` + Solving for, Goal 3: Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. This will be done via adding a specific toHash function for RemoteWriteConfig which zeros the QueueConfig before taking the hash. RemoteWriteConfig is managed as a pointer so we'll need to keep the value before, set to empty, and put the original value back but all is reasonably managed. We could look at identifying other "operational" fields which could be excluded from hashing for the same reasons. @@ -105,54 +93,27 @@ Goal 4: Guards need to be in place to protect against infinite WAL growth is cap I believe prombench is sufficient to prove Goal 2: Taking a savepoint for a remote_write destination should not incur significant overhead. Open to further benchmarking ideas but given the components + time necessary for a proper test ensuring prombench is capable of covering this would be the most ideal. -### Further reducing duplicated data sent - -Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the savepoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data, reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment. - ### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL. -The amount of complexity in this goal is large, it is my opinion that our current state where all samples are lost is worse than implementing a replay which does not give us at-least-once delivery. I believe the proposed replay implementation would provide a good basis for an at-least-once solution. - -A solution would need to involve internals of `remote.QueueManager` as part of an `OnSegmentChange` pipeline. One option could be to implement the same pattern where `remote.WriteStorage` tracks the segment for each `remote.QueueManager`, in this case `remote.QueueManager` would track the segment of each shard and take responsibility for propagating the notification when all shards are at or beyond the segment. - -```mermaid -sequenceDiagram - wlog.Watcher->>wlog.Watcher: Run() - loop Until closed - wlog.Watcher->>wlog.Watcher: currentSegmentMetric.Set - wlog.Watcher->>remote.QueueManager: OnSegmentChange() - remote.QueueManager->>remote.QueueManager.shards: OnSegmentChange() - remote.QueueManager.shards->>remote.QueueManager.shards: Store new segment and current batchQueue depth + 1
(number of batches to send to clear the segment) - remote.QueueManager.shards->>remote.QueueManager.shards: Decrement depths on send
(we could have more than 1 segment enqueued) - remote.QueueManager.shards->>remote.QueueManager.shards: Depth is zero for a segment? - alt yes - remote.QueueManager.shards->>remote.QueueManager: Update currentSegment for shard id - end - remote.QueueManager->>remote.QueueManager: All segments at or ahead of currentSegment? - alt yes - remote.QueueManager->>remote.WriteStorage: OnSegmentChange() - end - remote.WriteStorage->>remote.WriteStorage: Update currentSegment for Queue - remote.WriteStorage->>remote.WriteStorage: Are all queues at or ahead of currentSegment? - alt yes - remote.WriteStorage->>Subscriber: SegmentChange() - end - end -``` +The amount of complexity in this goal is large, it is my opinion that our current state where all samples are lost is worse than implementing a replay which does not give us at-least-once delivery. The basic segment replay has a gap: the savepoint advances when the watcher moves to a new segment, but the queue may not have finished sending all samples from the previous segment — a restart between the savepoint being written and the queue flushing that segment still loses those samples. I believe the proposed replay provides a good basis for closing this gap. -I believe this bypasses resharding complexity, as a reshard triggers a purging of all queues clearing which would also clear any pending segment changes. The complexity will come from ensuring the overhead from added locking is low enough to keep remote write delivery rates relatively unchanged. +An intermediate step would be to track the lowest timestamp successfully delivered in the savepoint. At startup, this timestamp would be used as a marker to skip already-delivered samples within the replayed segment, reducing duplicates. The lowest timestamp is required rather than the latest because the WAL supports out-of-order writes. At worst, replay still starts from the beginning of the segment. This doesn't help solve our at-least-once goal it helps reduce the amount of duplicated data sent on startup. + +A true at-least-once solution would require tracking the segments through the queue. Since each queue uses multiple parallel shards to send data to the remote destination we would need every shard to confirm it has finished delivering all samples from a segment before the savepoint advances for that segment. The potential for more blocking here is large and before attempting to solve this problem it would be best to tackle reported remote write contention (see https://github.com/prometheus/prometheus/issues/17277). ## Alternatives -1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach). - * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once. - * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity. -2. The savepoint should be synchronously updated when segments change. - * Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change. - * Synchronously committing it makes the potential gap larger. - * If we assume a 15 second queue delay then syncing the savepoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed. - * The trade-off being more unnecessary data being replayed on startup. - * After implementing a solution for at-least-once we can reassess how often we commit/if we should make it synchronous. +1. **The queue owns syncing its own savepoint** (most early implementations took this approach). + * Pros: Savepoint logic lives close to the data being tracked. + * Cons: The queue already has significant responsibilities and will take on more for the at-least-once stretch goal. Centralizing savepoint persistence in the write storage layer keeps the queue focused. + +2. **The savepoint is synchronously updated when segments change during WAL watching.** + * Pros: Simpler implementation — no separate timer needed. Given the queue has limited depth, watcher segment tracking may be a sufficient persistence point. + * Cons: Synchronously committing on every segment change means the savepoint may advance before the queue has had time to deliver the batch, increasing the amount of data replayed on restart. A periodic delayed approach (e.g., persist every 30s with a ~15s queue delay) gives the queue more time to process a segment before committing. After implementing at-least-once this decision can be revisited. + +3. **A `SegmentTracker` component injected into the watcher owns savepoint persistence** (rather than write storage orchestrating it). + * Pros: Simpler for the basic replay — persistence stays close to where segment changes are observed, and reuses the work from https://github.com/prometheus/prometheus/issues/17616. + * Cons: This approach breaks down as requirements grow. Adding the lowest timestamp to the savepoint requires the savepoint to move up to the queue. Adding at-least-once requires segment tracking to consider when data was fully sent, which also moves up to the queue. The write storage approach is chosen to avoid migrating ownership multiple times. ## Action Plan From 6f5e380e0fa6e3d80df124db282e23b3076353a9 Mon Sep 17 00:00:00 2001 From: Kyle Eckhart Date: Tue, 24 Feb 2026 20:29:25 -0500 Subject: [PATCH 6/6] fmt Signed-off-by: Kyle Eckhart --- proposals/0072-remote-write-restart-from-savepoint.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/0072-remote-write-restart-from-savepoint.md b/proposals/0072-remote-write-restart-from-savepoint.md index 9ceb9b7..ffea793 100644 --- a/proposals/0072-remote-write-restart-from-savepoint.md +++ b/proposals/0072-remote-write-restart-from-savepoint.md @@ -64,7 +64,7 @@ A basic replay to accomplish all non-stretch goals would be as follows. 5. Since replay starts at a segment boundary rather than an exact position within the segment, some already-delivered samples may be re-sent. The remote write destination must handle duplicate or out-of-order sample errors gracefully so these do not slow down delivery. The probability of duplicates on startup after replay is high due to redoing whole segments. -This flow should be enough to accomplish Goals 1 and 2. The savepoint write requires a lock but given it happens on a schedule it will be infrequent enough to avoid significant overhead (see testing for further info). Ideally, the implementation could help solve [tsdb/agent: Prevent unread segments from being truncated](https://github.com/prometheus/prometheus/issues/17616) which would require the agent to be made aware when remote write has progressed passed a specific segment. +This flow should be enough to accomplish Goals 1 and 2. The savepoint write requires a lock but given it happens on a schedule it will be infrequent enough to avoid significant overhead (see testing for further info). Ideally, the implementation could help solve [tsdb/agent: Prevent unread segments from being truncated](https://github.com/prometheus/prometheus/issues/17616) which would require the agent to be made aware when remote write has progressed passed a specific segment. ### Savepoint file format/location @@ -97,7 +97,7 @@ I believe prombench is sufficient to prove Goal 2: Taking a savepoint for a remo The amount of complexity in this goal is large, it is my opinion that our current state where all samples are lost is worse than implementing a replay which does not give us at-least-once delivery. The basic segment replay has a gap: the savepoint advances when the watcher moves to a new segment, but the queue may not have finished sending all samples from the previous segment — a restart between the savepoint being written and the queue flushing that segment still loses those samples. I believe the proposed replay provides a good basis for closing this gap. -An intermediate step would be to track the lowest timestamp successfully delivered in the savepoint. At startup, this timestamp would be used as a marker to skip already-delivered samples within the replayed segment, reducing duplicates. The lowest timestamp is required rather than the latest because the WAL supports out-of-order writes. At worst, replay still starts from the beginning of the segment. This doesn't help solve our at-least-once goal it helps reduce the amount of duplicated data sent on startup. +An intermediate step would be to track the lowest timestamp successfully delivered in the savepoint. At startup, this timestamp would be used as a marker to skip already-delivered samples within the replayed segment, reducing duplicates. The lowest timestamp is required rather than the latest because the WAL supports out-of-order writes. At worst, replay still starts from the beginning of the segment. This doesn't help solve our at-least-once goal it helps reduce the amount of duplicated data sent on startup. A true at-least-once solution would require tracking the segments through the queue. Since each queue uses multiple parallel shards to send data to the remote destination we would need every shard to confirm it has finished delivering all samples from a segment before the savepoint advances for that segment. The potential for more blocking here is large and before attempting to solve this problem it would be best to tackle reported remote write contention (see https://github.com/prometheus/prometheus/issues/17277).