Proposal for memory limiting in the scrape loop by dashpole · Pull Request #76 · prometheus/proposals

dashpole · 2026-03-10T19:22:36Z

As @bwplotka presented in his "Scrape Trolley Dillema" talk last year at promcon, Prometheus could use a mechanism to prevent OOMs caused by short-term memory usage of scraping targets. This has also been a request from some users of the OTel collector's Prometheus receiver.

PoC: prometheus/prometheus@main...dashpole:prometheus:memory_limiter_simple

cc @bernot-dev @ArthurSens

bwplotka · 2026-03-11T12:09:25Z

FYI: Talk link: https://youtu.be/ulHQUCarjjo?list=PLoz-W_CUquUlHOg314_YttjHL0iGTdE3O

bwplotka

IMO this is a great plan, something I'd like to see 👍🏽

bwplotka

I'm supportive. It should break the ice on the non trivial issues you listed and it's elegant. This would also help with Otel-collector Prometheus scrapes.

Disclaimer @dashpole works with me at Google, so I'm not going to merge until we have buy-in from other maintainers.

WDYT @krajorama @roidelapluie @bboreham @ArthurSens ?

saswatamcode

Would love to see this!

dashpole · 2026-03-12T14:16:44Z

+
+**2. The Prometheus Server Operator:**
+Server operators need to understand the global impact of memory limiting so they can take corrective action (e.g., increasing memory limits, adding Prometheus replicas, or investigating massive targets).
+* **Counter for aborted scrapes:** A new internal Prometheus metric (e.g., `prometheus_target_scrapes_skipped_memory_limit_total`) will be introduced to track the total number of aborted scrapes globally. Operators can set alerts on this metric to be notified of memory pressure, allowing them to intervene if data loss becomes too widespread.


Note to reviewers: The name can be debated/changed during review.

dashpole · 2026-04-23T15:49:54Z

Discussed this at the Prometheus Dev-summit today. Feedback:

@SuperQ has a lot of ideas around this, and we should discuss further.
@bboreham says there is something similar in Mimir (but is AGPL), so this general idea has some backing by production usage.
Memory usage is cyclic, and tends to be highest right before compaction.
Series already in the index don't add to long-term memory usage. It is new series that grow the long-term memory usage.

SuperQ · 2026-04-23T16:35:05Z

Another memory mitigation that could be done would be to pause rule evaluations. Specifically recordings could be paused since they run queries which can be an alloc churn impact.

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole · 2026-04-27T20:27:59Z

@SuperQ Do you think it should just pause rule evaluations, or should it cause all PromQL queries to fail? If the issue is queries, seems like pausing all queries would be better? OTOH, intermittently not being able to query prometheus seems like it would make it hard to tell what is going wrong.

SuperQ · 2026-04-27T21:11:36Z

That's a good question. Pausing rules is nice because you can, in theory, resume without loss.

Pausing all PromQL is also a good option.

For "hard to tell what's going on" is why meta-monitoring exists. Having a separate Prometheus-to-watch-Prometheus is usually recommended for that. 😁

yeya24

I like this proposal and we saw quite a lot of OOM kills of scraping due to quick churning during deployment or any sudden load increase.

One thing I am wondering is that if there is a way to deal with scraper OOM kills without losing data. That's definitely more for future but if the scraper can scrape the metric snapshot with a timestamp then the missing metrics can be backfilled with OOO

SuperQ · 2026-04-27T21:58:25Z

It would be interesting to have some heuristic that uses scrape_series_added, or maybe something like this:

increase(prometheus_tsdb_head_series_created_total[5m)
/
prometheus_tsdb_head_series

Canceling scrapes is, IMO, an action of last resort because it means we're dropping data permanently.

dashpole · 2026-05-01T18:53:16Z

I've had some time to think about how to incorporate recording rules and queries. Taking a broader look at the entire prometheus server, I think we could apply memory limiting in some form to:

scrape_configs (this proposal)
otlp: Reject incoming OTLP metrics
remote_read: Reject incoming remote_read
rule_files: Pause recording rules (I assume we don't want to pause alerting rules!?)
compaction: Pause TSDB compaction.
Parts of the API (queries, targets, series, metadata)?

I think the best way to address this is to make the feature naming more generic ("memory_limiter" instead of "scrape_memory_limiter") and to bundle memory limiting behavior behind a "hard" limit (data loss), and a "soft" limit (no data loss). That seems like it would be easy to understand, comprehensive, and configurable enough for most usage.

When the "soft" limit is met:

Pause compaction
Pause recording rules
Reject PromQL queries (and other API queries?)

When the "hard" limit is met:

Fail scrapes
Reject OTLP
Reject remote read

Question for reviewers:

Would you prefer I:

Expand the scope of the proposal to include all of the above behavior?
Just change the name of "scrape_memory_limiter" to "memory_limiter", and leave the rest of the topics (soft limits, recording rules, compaction, etc) as future enhancements?
Expand the scope of the proposal to only include pausing recording rules (in addition to failing scrapes).

SuperQ · 2026-05-04T16:30:19Z

I think expanding the proposal to support multiple mitigation options is a good idea.

Since all of the options discussed so far have pretty impactful downsides different users are going to want opt-in to different options.

We likely want to support configuring different options, have some kind of tunable heuristics, etc.

But all of it is going to need to be tested for various failure modes.

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole · 2026-05-07T16:28:33Z

@SuperQ I expanded the proposal. The things I want to make sure we get right during the proposal stage are:

Categories: Did I miss any important sources of short-term memory usage? I've initially excluded standard promql queries, or other debugging parts of the API surface (e.g. targets or metadata pages).
Soft and hard limits: Does this categorization of mitigations into "destructive" (hard) vs "non-destructive" (soft) make sense? Or should we allow setting more granular memory thresholds for each mitigation?
Configuration: Is turning on/off each mitigation enough configurability? Or will we need something more granular at the global level?

dashpole force-pushed the memory_limiter_proposal branch 2 times, most recently from 5e8d860 to 9076c48 Compare March 10, 2026 19:33

bwplotka reviewed Mar 11, 2026

View reviewed changes

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

dashpole marked this pull request as ready for review March 11, 2026 13:49

dashpole force-pushed the memory_limiter_proposal branch from aaa6bdf to 951d80e Compare March 11, 2026 13:52

bwplotka mentioned this pull request Mar 12, 2026

Consider adding built-in metric for scrape failure reason prometheus/prometheus#18284

Open

bwplotka approved these changes Mar 12, 2026

View reviewed changes

saswatamcode approved these changes Mar 12, 2026

View reviewed changes

dashpole commented Mar 12, 2026

View reviewed changes

GiedriusS reviewed Mar 12, 2026

View reviewed changes

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

GiedriusS reviewed Mar 12, 2026

View reviewed changes

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

dashpole force-pushed the memory_limiter_proposal branch from 8200c74 to db29e0b Compare March 17, 2026 01:25

Aneurysm9 mentioned this pull request Mar 23, 2026

[receiver/prometheus] Enable using memory limiter extension open-telemetry/opentelemetry-collector-contrib#45439

Open

nicolastakashi reviewed Mar 28, 2026

View reviewed changes

Comment thread proposals/0076-scrape-memory-limiter.md Outdated

dashpole force-pushed the memory_limiter_proposal branch from 9cbc215 to 677f3a0 Compare March 30, 2026 20:16

dashpole added 4 commits April 27, 2026 20:17

proposal: Add memory limiting in the scrape loop

3a4e9c5

Signed-off-by: David Ashpole <dashpole@google.com>

mention increased cardinality from a target

930499b

Signed-off-by: David Ashpole <dashpole@google.com>

adddress comments

5d509e6

Signed-off-by: David Ashpole <dashpole@google.com>

address interaction between GOMEMLIMIT and scrape memory limiting

78072c2

Signed-off-by: David Ashpole <dashpole@google.com>

yeya24 reviewed Apr 27, 2026

View reviewed changes

dashpole force-pushed the memory_limiter_proposal branch 2 times, most recently from 9e8baa5 to 5ab4584 Compare May 7, 2026 16:20

expand proposal to include other memory mitigations

dc95d8d

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole force-pushed the memory_limiter_proposal branch from 5ab4584 to dc95d8d Compare May 7, 2026 16:20

Conversation

dashpole commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwplotka commented Mar 11, 2026

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

saswatamcode left a comment

Choose a reason for hiding this comment

Uh oh!

dashpole Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dashpole commented Apr 23, 2026

Uh oh!

SuperQ commented Apr 23, 2026

Uh oh!

dashpole commented Apr 27, 2026

Uh oh!

SuperQ commented Apr 27, 2026

Uh oh!

yeya24 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Apr 27, 2026

Uh oh!

dashpole commented May 1, 2026

Uh oh!

SuperQ commented May 4, 2026

Uh oh!

dashpole commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dashpole commented Mar 10, 2026 •

edited

Loading

yeya24 left a comment •

edited

Loading