Skip to content

proposal: TSDB Support for Start Timestamp (ST)#60

Open
bwplotka wants to merge 12 commits into
mainfrom
ctstorage
Open

proposal: TSDB Support for Start Timestamp (ST)#60
bwplotka wants to merge 12 commits into
mainfrom
ctstorage

Conversation

@bwplotka
Copy link
Copy Markdown
Member

@bwplotka bwplotka commented Sep 9, 2025

As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT/ST native Prometheus storage could look like and how to make it useful (unblock) delta temporality.

@bwplotka bwplotka changed the title proposal[PROM-60]: Prometheus CT Storage proposal: Prometheus CT Storage Sep 9, 2025
@bwplotka bwplotka force-pushed the ctstorage branch 4 times, most recently from 4f9eb07 to 03c37e3 Compare September 10, 2025 14:04
@bwplotka bwplotka changed the title proposal: Prometheus CT Storage proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) Sep 10, 2025
@bwplotka bwplotka changed the title proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025
@bwplotka bwplotka changed the title proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) proposal: TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025
@bwplotka bwplotka force-pushed the ctstorage branch 4 times, most recently from 033d077 to 704dee5 Compare September 11, 2025 13:23
@bwplotka bwplotka marked this pull request as ready for review September 11, 2025 13:23
Comment thread proposals/0060-ct-storage.md Outdated
@bwplotka
Copy link
Copy Markdown
Member Author

bwplotka commented Sep 12, 2025

FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time!

Here are some notes:

  • Bartek introducing proposal details.
  • Fiona adding more context on reset hints proposal: TSDB Support for Start Timestamp (ST) #60 (comment)
  • Fiona sharing suggestions for delta being a "mini-cumulative" story.
  • General alignment on technical decisions.
  • Fiona suggested we double check if start/create : end time is inclusive (left TODO, Otel is inclusive on end time only).
  • Fiona asked around gauge vs counter CT storages differences, especially given gauges in some systems can have CT.
    • Bartek (now): There are none for now, mentioned that in general decisions.
  • On TSDB read (programmatic) interfaces:
    • Owen: Should we discuss here failover algorithms, what can you do? Don't take too much invariants/assumptions, leave room for flexibility e.g lack of CTs
    • Fiona: Is it worth adding any assumptions around CT semantics (e.g. on append)
    • Owen: We should document our assumption, and evolve with reality
    • Arthur: Fiona has a lot of details proposal.
    • Bartek: So far we didn't put ANY requirements on CT on write
    • Bartek (now): I added related section # Proposed CT semantics and validation -- @ywwg could you help me explore how those restriction could look like? And what if we do SHOULD or MUST on those?
  • Artur noticed delta feature is a "SHOULD" goal for CT proposal, he aligned expectations around Grafana interest to deliver delta support.
    • Bartek: Ack. Happy to move to MUST if it helps. I left should to be open minded for extreme cases when solution to cumulative CT and delta ST are better to be entirely different, it would silly to push in single inefficient direction in this proposal. I don't see this being a case now though.
  • Outlining pros & cons for CT -> ST renaming alternative:
    • Fiona/Ar/B: Just stick with one naming
    • Bartek/Fiona: No strong opinion at this point
    • Arthur: I'd vote for CT
    • Owen: There are more future users than previous users, I'd vote for changing to ST.
    • Bartek (now): I added one more argument to keep CT -- CT or ST naming is equally correct/incorrect in this context - Prometheus is cumulative-first system so choosing CT might be fair.

Also updated proposal today with some learnings. Finally proposed a single feature flag for this work (ct-storage).

Still lots of TODOs and anyone is welcome to help!

Signed-off-by: bwplotka <bwplotka@gmail.com>
Copy link
Copy Markdown
Member

@ywwg ywwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.

Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md
Comment thread proposals/0060-ct-storage.md Outdated

TODO: Just a draft, to be discussed.
TODO: There are questions around:
* Should we do inclusive vs exclusive intervals?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.

Copy link
Copy Markdown
Member Author

@bwplotka bwplotka Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about Google, and definitely let's find the best balanced choice for now, without config flags - if anything it's baked in data, so I any read config would be hard to use (too deep user knowledge needed about their metrics, not very common)

What makes the most sense for Prometheus?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.

Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this in the delta WG.

For the rate calculations indeed it doesn't matter. And that's the main use case.
For non rate calculations there are some questions:

What happens if we plot the series? Do we try to show zero point ? For delta metrics where the ST is equal to previous sample's timestamp it would lead to having two points at the same timestamp, that doesn't make much sense. Showing a saw tooth pattern with 0 1ms after the previous timestamp doesn't seem production either. For a cumulative counter it may make some sense, indeed I feel like that's something customers will ask for it ?

What happens if we do sum_over_time() ? Again for cumulative, it's ok to add a 0, but for delta metric, which value would we take ? the previous value or 0 ? In this case exclusive makes more sense.

How about count()? This is less about exclusive/inclusive, more about whether the time pointed out by the start time counts as the start of the series?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
@dashpole
Copy link
Copy Markdown
Contributor

This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker.

Comment thread proposals/0060-ct-storage.md Outdated
@bwplotka
Copy link
Copy Markdown
Member Author

bwplotka commented Oct 6, 2025

Back from some PTO/leave, will try to address comments and finalize interface and TSDB piece soon

Copy link
Copy Markdown
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding just a few stylish/correction comments. I'm midway through the document and still haven't reviewed the proposed interfaces.

Regarding the CT vs ST discussion, I see the point that we'll always have more new users than old ones, but I feel like the CT terminology is so ingrained in the ecosystem that even if we change it now, people will continue to call it CT. Of course this is based on "voices in my head" and there's no real confirmation that this is gonna happen in the future 😅

Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Copy link
Copy Markdown
Member Author

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed first pass, thanks for reviews!

@ywwg ywwg moved this from Backlog to In progress in Delta Temporality Dec 4, 2025
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated

TODO: Just a draft, to be discussed.
TODO: There are questions around:
* Should we do inclusive vs exclusive intervals?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.

Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.

krajorama added a commit to prometheus/prometheus that referenced this pull request Jan 8, 2026
Related to prometheus/proposals#60

Signed-off-by: bwplotka <bwplotka@gmail.com>
krajorama added a commit to prometheus/prometheus that referenced this pull request Jan 8, 2026
Related to prometheus/proposals#60

Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@bwplotka
Copy link
Copy Markdown
Member Author

ywwg and others added 3 commits March 6, 2026 10:44
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Update PROM-60 with current development progress
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
Comment thread proposals/0060-ct-storage.md Outdated
ywwg and others added 5 commits May 11, 2026 15:21
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Incorporate feedback from reviewers: update delta support goal to MUST,
add section explaining unknown start-time resets mapping to 0 at
ingestion, clarify storage flexibility across metric types, fix broken
markdown links, and correct multiple minor typos.

Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka
Copy link
Copy Markdown
Member Author

Addressed all comments, plus it's already implemented 🎉

Waiting for formal approvals as discussed in Slack

Copy link
Copy Markdown
Member

@ywwg ywwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Comment thread proposals/0060-st-storage.md Outdated
Comment thread proposals/0060-st-storage.md Outdated
* The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T).
* We can always add validation features opt-in.

For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the points that follow in the series? Won't those have wildly incorrect rates?

Copy link
Copy Markdown
Member Author

@bwplotka bwplotka May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I thought we decided this and AI filled this - but it's good to unpack again. Let's stop and decide.

Benefit of keeping ST==T vs ST==0 separate is that you know if unknown ST is on cumulative vs delta. But is that separation needed? Do we want that?

Naive thinking... For cumulatives, it feels it should resolve to old value detection. For delta it should treat ST==T as ST is previous T, right? Unless we detect delta vs cumulatives some other way

EDIT: I read this as delta ST==T is converted to 0, but it's for cumulatives which we kind of have to support as 0 anyway? Why not making those consistent?

cc @vpranckaitis @krajorama what we assume on current query path? I have to catch with the latest updates there (:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's only for cumulatives here... so could be ok?

It's just weird that on scrape unknown ST is 0 and from OTLP it's ST==T. We need to handle 0 anyway.

Deltas are allowed to have ST=T so this section might be what we want?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for context, the ST == T points come from this: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#cumulative-streams-inserting-true-reset-points.

Consider:

  • Point 1 Has ST == T1. This seems OK to set start time to unknown.
  • Point 2 has ST == T1, so interval is [T1, T2]. This seems problematic because the interval is a valid time interval. But OTel sends the entire cumulative value with this point, so rates are way too high.

Ideally, we would somehow mark T1 as permanently unknown for the series, so that Point 2 would also get an unknown timestamp?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:

  • the first sample is in range:
    • reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset
    • extrapolation: no pair to check, ignored
  • the first sample is not in range
    • reset detection: we don't do it for the left side of the range
    • extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, important context.

Sounds like @dashpole, you propose to optimize for another algorithm for unknown ST handling (true_reset_point vs the one we have now in Prometheus with st-synthesis, so subtract_initial_point that arguably fits well with the traditional rate logic).

  1. true_reset_point never worked in Prometheus, does it make sense to optimize for it now vs recommending to not use it?
  2. Can we even handle true_reset_point? the first sample is in range case even.. what do we do if we see
(ST1, T2),(ST3,T3) aka "true reset point", (ST3,T4)

Between T2 and T3 we have zero idea if there was a reset or not, we just know the source didn't know either, no? am I missing sth?

  1. What if we some backend designed another one? Can we even keep up with different variations?

After all Prometheus is some backend and we could be opinionated in minimal semantics we need 🤔

Copy link
Copy Markdown
Contributor

@dashpole dashpole May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.

I would suggest: If this case causes problems, reject points with ST == T. If it doesn't cause problems, just accept them as-is. I couldn't quite tell from @krajorama's comment above if it will actually cause issues or not.

I don't think we should introduce special handling for this case.

Copy link
Copy Markdown

@vpranckaitis vpranckaitis May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.

This isn't a problem for query side, and we still can recognize OTel's cumulative with unknown start time even if we modify just the first ST in the sequence.

image

The image above shows how start timestamps should behave in OTel for deltas and cumulative counters. If we would compare datapoints T1 and T2 between delta and cumulative with unknown start time, we would see that the only difference is the value of ST1. However, the treatment of these two should be different: there should be a reset between T1 and T2 for deltas, but no reset for cumulative with unknown start time. In order to decide, we check that ST2 == T1 to rule normal cumulative, and then we check that ST1 == T1 to single out cumulative with unknown start time.

Datapoints T3 and beyond don't matter, because for deltas you can apply the same logic as in previous paragraph, and for cumulatives ST3 < T2.

Now what if we decided to replace ST1 == T1 with some other value, let's say 0, but keep all the other ST values intact? Then you would still check ST2 == T1 to rule out normal cumulative, and then you would check ST1 == 0 to single out cumulative with unknown start time. You could even come up with some other magic number for unknowns, and then you could check ST1 == math.MinInt64 or whatever to single out cumulative with unknown start time case.

Or to explain this in a different way: to known whether a datapoint X belongs to a cumulative series with unknown start time, you look whether there's a datapoint at time ST_X, and check if it that datapoint has unknown start time. This works in reverse: if a datapoint Y has unknown start time, then any datapoint that has ST = T_Y also belong to the same cumulative series with unknown start time. Thus you can change only the datapoint Y, and the change could be traced back from any other datapoint in this cumulative series, as if the change has propagated to the whole cumulative series.


Now, should we commit to doing such replacement? In my opinion, if we don't come up with a very good reason why it would be useful to treat ST==T case as a reset, then we shouldn't do that. Otherwise, we're introducing complexity for little benefit.

Co-authored-by: David Ashpole <dashpole@google.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think the unknown start time is the sticking point.

Comment thread proposals/0060-st-storage.md Outdated
Comment thread proposals/0060-st-storage.md Outdated
* The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T).
* We can always add validation features opt-in.

For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:

  • the first sample is in range:
    • reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset
    • extrapolation: no pair to check, ignored
  • the first sample is not in range
    • reset detection: we don't do it for the left side of the range
    • extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?

to optimize XOR chunk around joint DoD encoding for both timestamps and value change semantics for common scenarios.

[XOR2 encoding details](https://github.com/prometheus/prometheus/blob/e793b26713cc7052c7558ae6ceffaa66c2a5b39f/tsdb/docs/format/chunks.md#xor2-chunk-data)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing reference to work on native histograms storage. prometheus/prometheus#18609

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

8 participants