proposal: TSDB Support for Start Timestamp (ST) by bwplotka · Pull Request #60 · prometheus/proposals

bwplotka · 2025-09-09T11:26:26Z

As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT/ST native Prometheus storage could look like and how to make it useful (unblock) delta temporality.

bwplotka · 2025-09-12T04:34:25Z

FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time!

Here are some notes:

Bartek introducing proposal details.
Fiona adding more context on reset hints proposal: TSDB Support for Start Timestamp (ST) #60 (comment)
Fiona sharing suggestions for delta being a "mini-cumulative" story.
General alignment on technical decisions.
Fiona suggested we double check if start/create : end time is inclusive (left TODO, Otel is inclusive on end time only).
Fiona asked around gauge vs counter CT storages differences, especially given gauges in some systems can have CT.
- Bartek (now): There are none for now, mentioned that in general decisions.
On TSDB read (programmatic) interfaces:
- Owen: Should we discuss here failover algorithms, what can you do? Don't take too much invariants/assumptions, leave room for flexibility e.g lack of CTs
- Fiona: Is it worth adding any assumptions around CT semantics (e.g. on append)
- Owen: We should document our assumption, and evolve with reality
- Arthur: Fiona has a lot of details proposal.
- Bartek: So far we didn't put ANY requirements on CT on write
- Bartek (now): I added related section # Proposed CT semantics and validation -- @ywwg could you help me explore how those restriction could look like? And what if we do SHOULD or MUST on those?
Artur noticed delta feature is a "SHOULD" goal for CT proposal, he aligned expectations around Grafana interest to deliver delta support.
- Bartek: Ack. Happy to move to MUST if it helps. I left should to be open minded for extreme cases when solution to cumulative CT and delta ST are better to be entirely different, it would silly to push in single inefficient direction in this proposal. I don't see this being a case now though.
Outlining pros & cons for CT -> ST renaming alternative:
- Fiona/Ar/B: Just stick with one naming
- Bartek/Fiona: No strong opinion at this point
- Arthur: I'd vote for CT
- Owen: There are more future users than previous users, I'd vote for changing to ST.
- Bartek (now): I added one more argument to keep CT -- CT or ST naming is equally correct/incorrect in this context - Prometheus is cumulative-first system so choosing CT might be fair.

Also updated proposal today with some learnings. Finally proposed a single feature flag for this work (ct-storage).

Still lots of TODOs and anyone is welcome to help!

Signed-off-by: bwplotka <bwplotka@gmail.com>

ywwg

Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.

ywwg · 2025-09-12T15:49:28Z

+
+TODO: Just a draft, to be discussed.
+TODO: There are questions around:
+* Should we do inclusive vs exclusive intervals?


I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.

Don't worry about Google, and definitely let's find the best balanced choice for now, without config flags - if anything it's baked in data, so I any read config would be hard to use (too deep user knowledge needed about their metrics, not very common)

What makes the most sense for Prometheus?

What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.

Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.

We've discussed this in the delta WG.

For the rate calculations indeed it doesn't matter. And that's the main use case.
For non rate calculations there are some questions:

What happens if we plot the series? Do we try to show zero point ? For delta metrics where the ST is equal to previous sample's timestamp it would lead to having two points at the same timestamp, that doesn't make much sense. Showing a saw tooth pattern with 0 1ms after the previous timestamp doesn't seem production either. For a cumulative counter it may make some sense, indeed I feel like that's something customers will ask for it ?

What happens if we do sum_over_time() ? Again for cumulative, it's ok to add a 0, but for delta metric, which value would we take ? the previous value or 0 ? In this case exclusive makes more sense.

How about count()? This is less about exclusive/inclusive, more about whether the time pointed out by the start time counts as the start of the series?

I'm leaving this open, but we agreed to keep the semantics out of this proposal. See ### Proposed ST semantics and validation

Given majority of our delta data is coming from OTel I'd lean towards Otel semantics, so exclusive

https://opentelemetry.io/docs/specs/otel/metrics/data-model/#:~:text=Delta%20temporality%20means%20that%20successive%20data%20points%20advance%20the%20starting%20timestamp.%20For%20example%2C%20from%20start%20time%20T0%2C%20delta%20data%20points%20cover%20time%20ranges%20(T0%2C%20T1%5D%2C%20(T1%2C%20T2%5D%2C%20(T2%2C%20T3%5D%2C%20and%20so%20on.

dashpole · 2025-09-12T19:16:27Z

This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker.

bwplotka · 2025-10-06T08:29:35Z

Back from some PTO/leave, will try to address comments and finalize interface and TSDB piece soon

ArthurSens

I'm adding just a few stylish/correction comments. I'm midway through the document and still haven't reviewed the proposed interfaces.

Regarding the CT vs ST discussion, I see the point that we'll always have more new users than old ones, but I feel like the CT terminology is so ingrained in the ecosystem that even if we change it now, people will continue to call it CT. Of course this is based on "voices in my head" and there's no real confirmation that this is gonna happen in the future 😅

bwplotka

I addressed first pass, thanks for reviews!

krajorama · 2026-01-07T15:35:48Z

+
+TODO: Just a draft, to be discussed.
+TODO: There are questions around:
+* Should we do inclusive vs exclusive intervals?


What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.

Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.

Related to prometheus/proposals#60 Signed-off-by: bwplotka <bwplotka@gmail.com>

Related to prometheus/proposals#60 Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

bwplotka · 2026-02-18T13:42:03Z

TODO: Propose path for remote read (see https://cloud-native.slack.com/archives/C08C6CMEUF6/p1771420614963339?thread_ts=1770901936.929979&cid=C08C6CMEUF6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Owen Williams <owen.williams@grafana.com>

Signed-off-by: Owen Williams <owen.williams@grafana.com>

Update PROM-60 with current development progress

Signed-off-by: Owen Williams <owen.williams@grafana.com>

add some links

Incorporate feedback from reviewers: update delta support goal to MUST, add section explaining unknown start-time resets mapping to 0 at ingestion, clarify storage flexibility across metric types, fix broken markdown links, and correct multiple minor typos. Signed-off-by: bwplotka <bwplotka@gmail.com>

Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka · 2026-05-14T10:59:43Z

Addressed all comments, plus it's already implemented 🎉

Waiting for formal approvals as discussed in Slack

ywwg

🚀

dashpole · 2026-05-14T15:35:52Z

+* The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T).
+* We can always add validation features opt-in.
+
+For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching.


What about the points that follow in the series? Won't those have wildly incorrect rates?

Yea, I thought we decided this and AI filled this - but it's good to unpack again. Let's stop and decide.

~~Benefit of keeping ST==T vs ST==0 separate is that you know if unknown ST is on cumulative vs delta. But is that separation needed? Do we want that?~~

~~Naive thinking... For cumulatives, it feels it should resolve to old value detection. For delta it should treat ST==T as ST is previous T, right? Unless we detect delta vs cumulatives some other way~~

EDIT: I read this as delta ST==T is converted to 0, but it's for cumulatives which we kind of have to support as 0 anyway? Why not making those consistent?

cc @vpranckaitis @krajorama what we assume on current query path? I have to catch with the latest updates there (:

Actually it's only for cumulatives here... so could be ok?

It's just weird that on scrape unknown ST is 0 and from OTLP it's ST==T. We need to handle 0 anyway.

Deltas are allowed to have ST=T so this section might be what we want?

So for context, the ST == T points come from this: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#cumulative-streams-inserting-true-reset-points.

Consider:

Point 1 Has ST == T1. This seems OK to set start time to unknown.

Point 2 has ST == T1, so interval is [T1, T2]. This seems problematic because the interval is a valid time interval. But OTel sends the entire cumulative value with this point, so rates are way too high.

Ideally, we would somehow mark T1 as permanently unknown for the series, so that Point 2 would also get an unknown timestamp?

Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:

the first sample is in range:

reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset

extrapolation: no pair to check, ignored

the first sample is not in range

reset detection: we don't do it for the left side of the range

extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?

Ah, ok, important context.

Sounds like @dashpole, you propose to optimize for another algorithm for unknown ST handling (true_reset_point vs the one we have now in Prometheus with st-synthesis, so subtract_initial_point that arguably fits well with the traditional rate logic).

true_reset_point never worked in Prometheus, does it make sense to optimize for it now vs recommending to not use it?

Can we even handle true_reset_point? the first sample is in range case even.. what do we do if we see

(ST1, T2),(ST3,T3) aka "true reset point", (ST3,T4)

Between T2 and T3 we have zero idea if there was a reset or not, we just know the source didn't know either, no? am I missing sth?

What if we some backend designed another one? Can we even keep up with different variations?

After all Prometheus is some backend and we could be opinionated in minimal semantics we need 🤔

My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.

I would suggest: If this case causes problems, reject points with ST == T. If it doesn't cause problems, just accept them as-is. I couldn't quite tell from @krajorama's comment above if it will actually cause issues or not.

I don't think we should introduce special handling for this case.

My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.

This isn't a problem for query side, and we still can recognize OTel's cumulative with unknown start time even if we modify just the first ST in the sequence.

The image above shows how start timestamps should behave in OTel for deltas and cumulative counters. If we would compare datapoints T1 and T2 between delta and cumulative with unknown start time, we would see that the only difference is the value of ST1. However, the treatment of these two should be different: there should be a reset between T1 and T2 for deltas, but no reset for cumulative with unknown start time. In order to decide, we check that ST2 == T1 to rule normal cumulative, and then we check that ST1 == T1 to single out cumulative with unknown start time.

Datapoints T3 and beyond don't matter, because for deltas you can apply the same logic as in previous paragraph, and for cumulatives ST3 < T2.

Now what if we decided to replace ST1 == T1 with some other value, let's say 0, but keep all the other ST values intact? Then you would still check ST2 == T1 to rule out normal cumulative, and then you would check ST1 == 0 to single out cumulative with unknown start time. You could even come up with some other magic number for unknowns, and then you could check ST1 == math.MinInt64 or whatever to single out cumulative with unknown start time case.

Or to explain this in a different way: to known whether a datapoint X belongs to a cumulative series with unknown start time, you look whether there's a datapoint at time ST_X, and check if it that datapoint has unknown start time. This works in reverse: if a datapoint Y has unknown start time, then any datapoint that has ST = T_Y also belong to the same cumulative series with unknown start time. Thus you can change only the datapoint Y, and the change could be traced back from any other datapoint in this cumulative series, as if the change has propagated to the whole cumulative series.

Now, should we commit to doing such replacement? In my opinion, if we don't come up with a very good reason why it would be useful to treat ST==T case as a reset, then we shouldn't do that. Otherwise, we're introducing complexity for little benefit.

Co-authored-by: David Ashpole <dashpole@google.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

krajorama

Looking good, I think the unknown start time is the sticking point.

krajorama · 2026-05-15T09:12:37Z

+* The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T).
+* We can always add validation features opt-in.
+
+For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching.


Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:

the first sample is in range:

reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset

extrapolation: no pair to check, ignored

the first sample is not in range

reset detection: we don't do it for the left side of the range

extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?

krajorama · 2026-05-15T09:17:30Z

+to optimize XOR chunk around joint DoD encoding for both timestamps and value change semantics for common scenarios.
+
+[XOR2 encoding details](https://github.com/prometheus/prometheus/blob/e793b26713cc7052c7558ae6ceffaa66c2a5b39f/tsdb/docs/format/chunks.md#xor2-chunk-data)
+


Missing reference to work on native histograms storage. prometheus/prometheus#18609

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka changed the title ~~proposal[PROM-60]: Prometheus CT Storage~~ proposal: Prometheus CT Storage Sep 9, 2025

bwplotka added the proposal label Sep 9, 2025

bwplotka force-pushed the ctstorage branch 4 times, most recently from 4f9eb07 to 03c37e3 Compare September 10, 2025 14:04

bwplotka changed the title ~~proposal: Prometheus CT Storage~~ proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) Sep 10, 2025

bwplotka changed the title ~~proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way)~~ proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025

bwplotka mentioned this pull request Sep 10, 2025

prw: Remote Write 2.0 CT per Sample/Histogram prometheus/prometheus#17036

Closed

bwplotka force-pushed the ctstorage branch from 03c37e3 to 7df3d64 Compare September 10, 2025 14:11

bwplotka changed the title ~~proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way)~~ proposal: TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025

bwplotka force-pushed the ctstorage branch 4 times, most recently from 033d077 to 704dee5 Compare September 11, 2025 13:23

bwplotka marked this pull request as ready for review September 11, 2025 13:23

fionaliao reviewed Sep 11, 2025

View reviewed changes

Comment thread proposals/0060-ct-storage.md Outdated

bwplotka force-pushed the ctstorage branch from 704dee5 to 9dee382 Compare September 12, 2025 04:34

proposal[PROM-60]: Prometheus CT Storage

7a00542

Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka force-pushed the ctstorage branch from 9dee382 to 7a00542 Compare September 12, 2025 07:54

bwplotka requested a review from dashpole September 12, 2025 07:55

ywwg reviewed Sep 12, 2025

View reviewed changes

fionaliao mentioned this pull request Sep 12, 2025

Proposal: OTEL delta temporality support #48

Open

dashpole reviewed Sep 12, 2025

View reviewed changes

Comment thread proposals/0060-ct-storage.md Outdated

bwplotka commented Oct 1, 2025

View reviewed changes

Comment thread proposals/0060-ct-storage.md Outdated

ArthurSens reviewed Oct 7, 2025

View reviewed changes

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

bwplotka commented Oct 15, 2025

View reviewed changes

ywwg added this to Delta Temporality Dec 4, 2025

github-project-automation Bot moved this to Backlog in Delta Temporality Dec 4, 2025

ywwg moved this from Backlog to In progress in Delta Temporality Dec 4, 2025

krajorama reviewed Jan 7, 2026

View reviewed changes

krajorama added a commit to prometheus/prometheus that referenced this pull request Jan 8, 2026

feat(chunk,ct): add chunk format that supports start timestamp

e5f68f5

Related to prometheus/proposals#60 Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka mentioned this pull request Jan 12, 2026

tsdb(wal): st-per-sample initial code and benchmarks prometheus/prometheus#17671

Merged

bwplotka mentioned this pull request Jan 28, 2026

feat(tsdb/chunkenc): add float chunk format with start timestamp support prometheus/prometheus#17909

Merged

eriksywu mentioned this pull request Feb 17, 2026

Enable optional OpenMetrics exposition format and _created metrics nginx/nginx-prometheus-exporter#1252

Open

eriksywu mentioned this pull request Feb 25, 2026

[prometheus] instrument metrics with createdtimestamps google/cadvisor#3846

Open

ywwg and others added 3 commits March 6, 2026 10:44

Update PROM-60 with current development progress

cefa68b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Owen Williams <owen.williams@grafana.com>

fix ascii art

96f0d34

Signed-off-by: Owen Williams <owen.williams@grafana.com>

Merge pull request #75 from ywwg/owilliams/ctstorage-updates

415b134

Update PROM-60 with current development progress

vpranckaitis mentioned this pull request Mar 18, 2026

proposal: Use Start Timestamp in rate-like functions for deltas #77

Open

vpranckaitis reviewed Mar 19, 2026

View reviewed changes

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

Comment thread proposals/0060-ct-storage.md Outdated

carrieedwards mentioned this pull request Apr 29, 2026

Add histogram chunk encoding with Start Timestamp support prometheus/prometheus#18609

Open

ywwg and others added 5 commits May 11, 2026 15:21

add some links

d30cdbe

Signed-off-by: Owen Williams <owen.williams@grafana.com>

Merge pull request #82 from ywwg/ctstorage

5e1e0fa

add some links

rename file

d02732b

Signed-off-by: bwplotka <bwplotka@gmail.com>

added more context on validation decision

a930a1e

Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka force-pushed the ctstorage branch from 813b5be to a930a1e Compare May 14, 2026 10:56

vpranckaitis approved these changes May 14, 2026

View reviewed changes

ywwg approved these changes May 14, 2026

View reviewed changes

dashpole approved these changes May 14, 2026

View reviewed changes

Apply suggestions from code review

2eb9a04

Co-authored-by: David Ashpole <dashpole@google.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

krajorama reviewed May 15, 2026

View reviewed changes

Apply suggestions from code review

7e56950

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

		to optimize XOR chunk around joint DoD encoding for both timestamps and value change semantics for common scenarios.

		[XOR2 encoding details](https://github.com/prometheus/prometheus/blob/e793b26713cc7052c7558ae6ceffaa66c2a5b39f/tsdb/docs/format/chunks.md#xor2-chunk-data)

Conversation

bwplotka commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bwplotka commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywwg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dashpole commented Sep 12, 2025

Uh oh!

Uh oh!

bwplotka commented Oct 6, 2025

Uh oh!

ArthurSens left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka commented May 14, 2026

Uh oh!

ywwg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

bwplotka commented Sep 9, 2025 •

edited

Loading

bwplotka commented Sep 12, 2025 •

edited

Loading

bwplotka Oct 15, 2025 •

edited

Loading

ArthurSens left a comment •

edited

Loading

bwplotka May 14, 2026 •

edited

Loading

dashpole May 18, 2026 •

edited

Loading

vpranckaitis May 20, 2026 •

edited

Loading