Conversation
4f9eb07 to
03c37e3
Compare
033d077 to
704dee5
Compare
|
FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time! Here are some notes:
Also updated proposal today with some learnings. Finally proposed a single feature flag for this work ( Still lots of TODOs and anyone is welcome to help! |
Signed-off-by: bwplotka <bwplotka@gmail.com>
ywwg
left a comment
There was a problem hiding this comment.
Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.
|
|
||
| TODO: Just a draft, to be discussed. | ||
| TODO: There are questions around: | ||
| * Should we do inclusive vs exclusive intervals? |
There was a problem hiding this comment.
I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.
There was a problem hiding this comment.
Don't worry about Google, and definitely let's find the best balanced choice for now, without config flags - if anything it's baked in data, so I any read config would be hard to use (too deep user knowledge needed about their metrics, not very common)
What makes the most sense for Prometheus?
There was a problem hiding this comment.
What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.
Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.
There was a problem hiding this comment.
We've discussed this in the delta WG.
For the rate calculations indeed it doesn't matter. And that's the main use case.
For non rate calculations there are some questions:
What happens if we plot the series? Do we try to show zero point ? For delta metrics where the ST is equal to previous sample's timestamp it would lead to having two points at the same timestamp, that doesn't make much sense. Showing a saw tooth pattern with 0 1ms after the previous timestamp doesn't seem production either. For a cumulative counter it may make some sense, indeed I feel like that's something customers will ask for it ?
What happens if we do sum_over_time() ? Again for cumulative, it's ok to add a 0, but for delta metric, which value would we take ? the previous value or 0 ? In this case exclusive makes more sense.
How about count()? This is less about exclusive/inclusive, more about whether the time pointed out by the start time counts as the start of the series?
There was a problem hiding this comment.
I'm leaving this open, but we agreed to keep the semantics out of this proposal. See ### Proposed ST semantics and validation
Given majority of our delta data is coming from OTel I'd lean towards Otel semantics, so exclusive
|
This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker. |
|
Back from some PTO/leave, will try to address comments and finalize interface and TSDB piece soon |
There was a problem hiding this comment.
I'm adding just a few stylish/correction comments. I'm midway through the document and still haven't reviewed the proposed interfaces.
Regarding the CT vs ST discussion, I see the point that we'll always have more new users than old ones, but I feel like the CT terminology is so ingrained in the ecosystem that even if we change it now, people will continue to call it CT. Of course this is based on "voices in my head" and there's no real confirmation that this is gonna happen in the future 😅
bwplotka
left a comment
There was a problem hiding this comment.
I addressed first pass, thanks for reviews!
|
|
||
| TODO: Just a draft, to be discussed. | ||
| TODO: There are questions around: | ||
| * Should we do inclusive vs exclusive intervals? |
There was a problem hiding this comment.
What's the use case where inclusive/exclusive matters for users? The most common usage will be increase/rate calculation, where using the range T-ST or T-(ST+epsilon) doesn't make any difference, or even 1ms won't make practical difference.
Seems to me it only makes some difference if we implement promql by emulating zero sample again, not if we implement it "directly". I don't know enough about the read side plans to judge.
Related to prometheus/proposals#60 Signed-off-by: bwplotka <bwplotka@gmail.com>
Related to prometheus/proposals#60 Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
|
TODO: Propose path for remote read (see https://cloud-native.slack.com/archives/C08C6CMEUF6/p1771420614963339?thread_ts=1770901936.929979&cid=C08C6CMEUF6) |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Update PROM-60 with current development progress
Signed-off-by: Owen Williams <owen.williams@grafana.com>
add some links
Incorporate feedback from reviewers: update delta support goal to MUST, add section explaining unknown start-time resets mapping to 0 at ingestion, clarify storage flexibility across metric types, fix broken markdown links, and correct multiple minor typos. Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
|
Addressed all comments, plus it's already implemented 🎉 Waiting for formal approvals as discussed in Slack |
| * The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T). | ||
| * We can always add validation features opt-in. | ||
|
|
||
| For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching. |
There was a problem hiding this comment.
What about the points that follow in the series? Won't those have wildly incorrect rates?
There was a problem hiding this comment.
Yea, I thought we decided this and AI filled this - but it's good to unpack again. Let's stop and decide.
Benefit of keeping ST==T vs ST==0 separate is that you know if unknown ST is on cumulative vs delta. But is that separation needed? Do we want that?
Naive thinking... For cumulatives, it feels it should resolve to old value detection. For delta it should treat ST==T as ST is previous T, right? Unless we detect delta vs cumulatives some other way
EDIT: I read this as delta ST==T is converted to 0, but it's for cumulatives which we kind of have to support as 0 anyway? Why not making those consistent?
cc @vpranckaitis @krajorama what we assume on current query path? I have to catch with the latest updates there (:
There was a problem hiding this comment.
Actually it's only for cumulatives here... so could be ok?
It's just weird that on scrape unknown ST is 0 and from OTLP it's ST==T. We need to handle 0 anyway.
Deltas are allowed to have ST=T so this section might be what we want?
There was a problem hiding this comment.
So for context, the ST == T points come from this: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#cumulative-streams-inserting-true-reset-points.
Consider:
- Point 1 Has
ST == T1. This seems OK to set start time to unknown. - Point 2 has
ST == T1, so interval is [T1, T2]. This seems problematic because the interval is a valid time interval. But OTel sends the entire cumulative value with this point, so rates are way too high.
Ideally, we would somehow mark T1 as permanently unknown for the series, so that Point 2 would also get an unknown timestamp?
There was a problem hiding this comment.
Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:
- the first sample is in range:
- reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset
- extrapolation: no pair to check, ignored
- the first sample is not in range
- reset detection: we don't do it for the left side of the range
- extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?
There was a problem hiding this comment.
Ah, ok, important context.
Sounds like @dashpole, you propose to optimize for another algorithm for unknown ST handling (true_reset_point vs the one we have now in Prometheus with st-synthesis, so subtract_initial_point that arguably fits well with the traditional rate logic).
true_reset_pointnever worked in Prometheus, does it make sense to optimize for it now vs recommending to not use it?- Can we even handle
true_reset_point?the first sample is in rangecase even.. what do we do if we see
(ST1, T2),(ST3,T3) aka "true reset point", (ST3,T4)
Between T2 and T3 we have zero idea if there was a reset or not, we just know the source didn't know either, no? am I missing sth?
- What if we some backend designed another one? Can we even keep up with different variations?
After all Prometheus is some backend and we could be opinionated in minimal semantics we need 🤔
There was a problem hiding this comment.
My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.
I would suggest: If this case causes problems, reject points with ST == T. If it doesn't cause problems, just accept them as-is. I couldn't quite tell from @krajorama's comment above if it will actually cause issues or not.
I don't think we should introduce special handling for this case.
There was a problem hiding this comment.
My main point was that I don't think we can't make the start timestamp of only the first point unknown without making all of the start timestamps after it also unknown.
This isn't a problem for query side, and we still can recognize OTel's cumulative with unknown start time even if we modify just the first ST in the sequence.
The image above shows how start timestamps should behave in OTel for deltas and cumulative counters. If we would compare datapoints T1 and T2 between delta and cumulative with unknown start time, we would see that the only difference is the value of ST1. However, the treatment of these two should be different: there should be a reset between T1 and T2 for deltas, but no reset for cumulative with unknown start time. In order to decide, we check that ST2 == T1 to rule normal cumulative, and then we check that ST1 == T1 to single out cumulative with unknown start time.
Datapoints T3 and beyond don't matter, because for deltas you can apply the same logic as in previous paragraph, and for cumulatives ST3 < T2.
Now what if we decided to replace ST1 == T1 with some other value, let's say 0, but keep all the other ST values intact? Then you would still check ST2 == T1 to rule out normal cumulative, and then you would check ST1 == 0 to single out cumulative with unknown start time. You could even come up with some other magic number for unknowns, and then you could check ST1 == math.MinInt64 or whatever to single out cumulative with unknown start time case.
Or to explain this in a different way: to known whether a datapoint X belongs to a cumulative series with unknown start time, you look whether there's a datapoint at time ST_X, and check if it that datapoint has unknown start time. This works in reverse: if a datapoint Y has unknown start time, then any datapoint that has ST = T_Y also belong to the same cumulative series with unknown start time. Thus you can change only the datapoint Y, and the change could be traced back from any other datapoint in this cumulative series, as if the change has propagated to the whole cumulative series.
Now, should we commit to doing such replacement? In my opinion, if we don't come up with a very good reason why it would be useful to treat ST==T case as a reset, then we shouldn't do that. Otherwise, we're introducing complexity for little benefit.
Co-authored-by: David Ashpole <dashpole@google.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
krajorama
left a comment
There was a problem hiding this comment.
Looking good, I think the unknown start time is the sticking point.
| * The exact consumption semantics is still experimental thus we want to stay flexible and don't block future use cases (e.g. exact semantics of ST > T). | ||
| * We can always add validation features opt-in. | ||
|
|
||
| For unknown start-time reset points (e.g. OpenTelemetry cumulative points where `ST == T`), ingestion layers (such as OTLP receivers or Remote Write endpoints) are expected to map these to `0` (unknown ST) when appending to storage. This optimizes storage (0 takes exactly 1 bit in chunk/WAL formats) and simplifies detection without requiring timestamp matching. |
There was a problem hiding this comment.
Assuming we store ST==T as is, there are two cases at query and two ways we want to use the start time:
- the first sample is in range:
- reset detection (works on pairs of samples): we check that ST is between sample 2 and 1 and we can also check if it was unknow at the 1st sample, so we can avoid detecting the reset
- extrapolation: no pair to check, ignored
- the first sample is not in range
- reset detection: we don't do it for the left side of the range
- extrapolation: I think since the first sample wasn't in range, the zero point calculated from ST wouldn't be in range either??? so not relevant?
| to optimize XOR chunk around joint DoD encoding for both timestamps and value change semantics for common scenarios. | ||
|
|
||
| [XOR2 encoding details](https://github.com/prometheus/prometheus/blob/e793b26713cc7052c7558ae6ceffaa66c2a5b39f/tsdb/docs/format/chunks.md#xor2-chunk-data) | ||
|
|
There was a problem hiding this comment.
Missing reference to work on native histograms storage. prometheus/prometheus#18609
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT/ST native Prometheus storage could look like and how to make it useful (unblock) delta temporality.