Proposal for memory limiting in the scrape loop#76
Conversation
5e8d860 to
9076c48
Compare
bwplotka
left a comment
There was a problem hiding this comment.
IMO this is a great plan, something I'd like to see 👍🏽
aaa6bdf to
951d80e
Compare
bwplotka
left a comment
There was a problem hiding this comment.
I'm supportive. It should break the ice on the non trivial issues you listed and it's elegant. This would also help with Otel-collector Prometheus scrapes.
- prometheus/prometheus#17109
- prometheus/prometheus#13939
- prometheus/prometheus#11306
- prometheus/prometheus#16917
Disclaimer @dashpole works with me at Google, so I'm not going to merge until we have buy-in from other maintainers.
|
|
||
| **2. The Prometheus Server Operator:** | ||
| Server operators need to understand the global impact of memory limiting so they can take corrective action (e.g., increasing memory limits, adding Prometheus replicas, or investigating massive targets). | ||
| * **Counter for aborted scrapes:** A new internal Prometheus metric (e.g., `prometheus_target_scrapes_skipped_memory_limit_total`) will be introduced to track the total number of aborted scrapes globally. Operators can set alerts on this metric to be notified of memory pressure, allowing them to intervene if data loss becomes too widespread. |
There was a problem hiding this comment.
Note to reviewers: The name can be debated/changed during review.
8200c74 to
db29e0b
Compare
9cbc215 to
677f3a0
Compare
|
Discussed this at the Prometheus Dev-summit today. Feedback:
|
|
Another memory mitigation that could be done would be to pause rule evaluations. Specifically recordings could be paused since they run queries which can be an alloc churn impact. |
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
|
@SuperQ Do you think it should just pause rule evaluations, or should it cause all PromQL queries to fail? If the issue is queries, seems like pausing all queries would be better? OTOH, intermittently not being able to query prometheus seems like it would make it hard to tell what is going wrong. |
|
That's a good question. Pausing rules is nice because you can, in theory, resume without loss. Pausing all PromQL is also a good option. For "hard to tell what's going on" is why meta-monitoring exists. Having a separate Prometheus-to-watch-Prometheus is usually recommended for that. 😁 |
There was a problem hiding this comment.
I like this proposal and we saw quite a lot of OOM kills of scraping due to quick churning during deployment or any sudden load increase.
One thing I am wondering is that if there is a way to deal with scraper OOM kills without losing data. That's definitely more for future but if the scraper can scrape the metric snapshot with a timestamp then the missing metrics can be backfilled with OOO
|
It would be interesting to have some heuristic that uses Canceling scrapes is, IMO, an action of last resort because it means we're dropping data permanently. |
|
I've had some time to think about how to incorporate recording rules and queries. Taking a broader look at the entire prometheus server, I think we could apply memory limiting in some form to:
I think the best way to address this is to make the feature naming more generic ("memory_limiter" instead of "scrape_memory_limiter") and to bundle memory limiting behavior behind a "hard" limit (data loss), and a "soft" limit (no data loss). That seems like it would be easy to understand, comprehensive, and configurable enough for most usage. When the "soft" limit is met:
When the "hard" limit is met:
Question for reviewers: Would you prefer I:
|
|
I think expanding the proposal to support multiple mitigation options is a good idea. Since all of the options discussed so far have pretty impactful downsides different users are going to want opt-in to different options. We likely want to support configuring different options, have some kind of tunable heuristics, etc. But all of it is going to need to be tested for various failure modes. |
9e8baa5 to
5ab4584
Compare
Signed-off-by: David Ashpole <dashpole@google.com>
5ab4584 to
dc95d8d
Compare
|
@SuperQ I expanded the proposal. The things I want to make sure we get right during the proposal stage are:
|
As @bwplotka presented in his "Scrape Trolley Dillema" talk last year at promcon, Prometheus could use a mechanism to prevent OOMs caused by short-term memory usage of scraping targets. This has also been a request from some users of the OTel collector's Prometheus receiver.
PoC: prometheus/prometheus@main...dashpole:prometheus:memory_limiter_simple
cc @bernot-dev @ArthurSens