From 3a4e9c56da2b864478b1eae65e8e0c2c3ca8f2b2 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Tue, 10 Mar 2026 19:15:59 +0000 Subject: [PATCH 1/5] proposal: Add memory limiting in the scrape loop Signed-off-by: David Ashpole --- proposals/0076-scrape-memory-limiter.md | 126 ++++++++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 proposals/0076-scrape-memory-limiter.md diff --git a/proposals/0076-scrape-memory-limiter.md b/proposals/0076-scrape-memory-limiter.md new file mode 100644 index 0000000..339a637 --- /dev/null +++ b/proposals/0076-scrape-memory-limiter.md @@ -0,0 +1,126 @@ +# Scrape Memory Limiter + +* **Owners:** + * @dashpole + +* **Implementation Status:** `Not started` + +* **Related Issues and PRs:** + * https://github.com/prometheus/prometheus/issues/17109 + * https://github.com/prometheus/prometheus/issues/13939 + * https://github.com/prometheus/prometheus/issues/11306 + * https://github.com/prometheus/prometheus/issues/16917 + +* **Other docs or links:** + * Promcon 2025 - Scrape Trolley Dillema talk (credit to @bwplotka) + * [YouTub Recording](https://www.youtube.com/watch?v=ulHQUCarjjo) + * [Slides](https://docs.google.com/presentation/d/1jKrUklPdAor9292HrPWtJkIa6ruUhOGo9IFO7fNj-DE/edit?slide=id.p#slide=id.p) + +> TL;DR: This proposal introduces a Scrape Memory Limiter. It allows Prometheus to proactively and gracefully drop scrapes when the server's memory usage approaches a configured limit, preventing out-of-memory (OOM) crashes. + +## Why + +Dynamic service discovery can lead to growth in the number of targets (e.g., when new workloads are spun up in Kubernetes). These new targets, which may have high cardinality or expose large amounts of metrics, can cause memory growth in Prometheus, leading to OOM kills and total monitoring unavailability. + +When Prometheus runs out of memory, it crashes. This not only stops data collection for the newly added workloads but also stops data collection for all other workloads being monitored by that Prometheus instance. + +### Pitfalls of the current solution + +Current mitigations, such as the static per-job `sample_limit`, are insufficient since they require prior knowledge of target sizes and apply on a per-scrape basis. They do not dynamically protect the global heap across all targets. + +Relying on OS-level boundaries (such as a container memory limit) guarantees a hard crash of the entire Prometheus process when memory is exhausted, affecting the monitoring of all other targets. + +## Goals + +- Prevent Prometheus from crashing due to memory exhaustion when scrape load increases beyond what the server can handle. +- Provide a simple, top-level global configuration to enable the feature. +- Provide clear debuggability when scrapes are failed due to memory pressure. +- Maintain transactionality when a scrape is failed due to memory pressure. + +### Audience + +Prometheus operators running in memory-constrained environments (like Kubernetes) who have to deal with OOM kills, and/or who do not have full control over the applications being scraped. + +## Non-Goals + +- Soft limits, fairness, and per-job QoS controls are out of scope for the initial implementation. +- This does not address long-term memory leaks. It is only designed to prevent OOMs caused by short-term spikes in memory usage from scraping. + +## How + +The Scrape Memory Limiter acts as a proactive circuit breaker for the Prometheus server. Periodically, a background routine checks the current memory usage of the Prometheus process against a configured global limit. + +Right before initiating an HTTP request to scrape a target, the scrape loop will check the memory limiter status. If the memory usage is currently above the configured limit, the scrape transaction is aborted early. This ensures transactionality—-the scrape is skipped in its entirety, preventing the allocation of memory for a potentially large influx of metrics that the system cannot currently handle. + +### Configuration + +A new top-level `scrape_memory_limiter` configuration block will be introduced in the Prometheus configuration file. + +The configuration is a subset of the configuration of the OpenTelemetry Collector's memory limiter processor, which has been used widely in production. It will be defined as a top-level block in the Prometheus configuration file. + +```yaml +# A new top-level block for the Scrape Memory Limiter. +scrape_memory_limiter: + # Target a maximum of 80% of total system memory. + # If total memory usage exceeds this percentage, scrapes are dropped. + limit_percentage: 80 + + # Alternatively, an absolute limit in MiB can be used: + # limit_mib: 1000 +``` + +### Feature Flag + +While the feature is experimental, the Scrape Memory Limiter will be gated behind a command-line feature flag: `--enable-feature=scrape-memory-limiter`, and will follow the usual process for feature graduation. + +If this flag is absent, the memory limiter will not be active and the configuration block will be ignored, even if configured in the prometheus configuration. + +### Debuggability and User Experience + +Understanding that data is missing and *why* it is missing is a critical part of the user experience. This feature caters to two personas: + +**1. The Application Owner:** +Application owners need to understand why their specific application failed to be scraped. +* **Up Metric:** The `up` metric for their dropped target will record a `0`. This is the standard mechanism to indicate a failed scrape, which preserves their existing alerts on the `up` metric. +* **UI /targets Page:** A descriptive scrape error (e.g., `scrape memory limit exceeded`) will be attached to the target's state. This error message will be visible on the Prometheus `/targets` UI page so the application owner knows the failure was due to Prometheus memory limits rather than their own application being down. + +**2. The Prometheus Server Operator:** +Server operators need to understand the global impact of memory limiting so they can take corrective action (e.g., increasing memory limits, adding Prometheus replicas, or investigating massive targets). +* **Counter for aborted scrapes:** A new internal Prometheus metric (e.g., `prometheus_target_scrapes_skipped_memory_limit_total`) will be introduced to track the total number of aborted scrapes globally. Operators can set alerts on this metric to be notified of memory pressure, allowing them to intervene if data loss becomes too widespread. + +## Future Enhancements + +### Gradual Degradation (Soft Limits) + +Future support for soft memory limits (e.g., a `spike_limit_mib` parameter) will allow the limiter to degrade scrape load gradually before the hard limit is reached. Instead of a binary drop-everything approach, the limiter would drop an increasing percentage of scrapes as memory usage approaches the hard limit. + +### Fairness Mechanisms + +The initial implementation of the memory limiter proposed above might inadvertently starve small, critical targets when a noisy neighbor introduces memory pressure. Future iterations could introduce scheduling algorithms to ensure fairness. Advanced approaches like [Deficit Round Robin (DRR)](https://en.wikipedia.org/wiki/Deficit_round_robin) can mathematically guarantee fairness across targets during memory pressure, isolating the disruption to high-cardinality targets. +To implement fairness, the mechanism will need to predict the cost of a scrape. This prediction should be based on the **total number of samples** from the target's previous scrape, *not* the number of *new series* added. New series are highly volatile (a target rotating a label will add many new series in one scrape, but zero in the next), making them a poor heuristic for proactive load shedding. Total samples accurately correlate with the short-lived parsing overhead the scrape loop will incur. + +### Per-Job Controls + +Future enhancements could provide support for overriding or specifying memory bounds at the individual scrape-job level. This would grant operators granular control to protect critical monitoring jobs at the expense of less important jobs during memory shortages. +To implement this, Prometheus could leverage Quality of Service (QoS) or criticality metadata (e.g., `severity="critical"`) attached to specific metrics or jobs. This would allow the limiter to intelligently determine which scrapes or series are safe to drop. There is a weighted variant of [DRR](https://en.wikipedia.org/wiki/Deficit_round_robin) that could be used to implement this mechanism. + +## Alternatives + +1. **Do nothing** +2. **Rejecting only new series ([#16917](https://github.com/prometheus/prometheus/issues/16917), [PR #11124](https://github.com/prometheus/prometheus/pull/11124))**: Instead of dropping the entire scrape, Prometheus would accept updates for time series it already knows about but reject the allocation of *new* series. This violates scrape transactionality, as scrapes should be ingested in full or not at all. Partial ingestion leads to unpredictable query skew (e.g., a success rate query where the success metric is ingested but the newly created error metric is dropped) and breaks fundamental system behavior assumptions. This creates confusing, inconsistent data for the application owner that goes against the principle of least surprise. +3. **Slowing down scrapes**: Dynamically backing off the scrape interval (e.g., from 15s to 60s) for targets under memory pressure. While this might temporarily reduce memory intake, skipping scrapes entirely sends a clearer signal to users (`up = 0`) that something is wrong. Skipping a single scrape is usually acceptable because the query window generally covers at least twice the scrape interval. Conversely, dynamically slowing down scrapes might silently break assumptions users have built into their alerts and recording rules. + +### Complementary Ideas + +The following ideas are compatible and complementary with a Scrape Memory Limiter, but do not try to prevent memory exhaustion from scraping. They instead deal with recovering from an OOM crash loop, or target other sources of memory usage: +1. **Automated WAL Deletion on OOM ([#13939](https://github.com/prometheus/prometheus/issues/13939))**: Automatically deleting the Write-Ahead Log (WAL) when Prometheus is recovering from an OOM crash. While this allows the server to eventually start again, it is a reactive measure that still allows the server to crash (causing global monitoring downtime) and forces the deletion of recent data. +2. **Force Head Compaction/WAL Truncation Before Scraping ([#11306](https://github.com/prometheus/prometheus/issues/11306))**: Pausing scraping on startup until the WAL is fully replayed and compacted. This helps break a specific OOM crash cycle during startup but does not prevent the process from exhausting memory during normal operation. +3. **Limit Label Churn / New Series Over Time ([#17109](https://github.com/prometheus/prometheus/issues/17109))**: Introduce a per-instance or per-job configuration that tracks and limits the number of *new* series a specific target can introduce into the TSDB over a given time window. A Scrape Memory Limiter protects the *active heap* from sudden bursts during a scrape, while a label churn limiter protects the *TSDB* from slow cardinality growth memory leaks over time. They are complementary safeguards. +4. **Early Compaction / Forced GC**: Proactively forcing a Go Garbage Collection or triggering an early TSDB Head compaction when memory pressure builds to flush data to disk and free memory. While this might temporarily relieve pressure, the primary driver of OOMs in sudden-growth scenarios is new series cardinality, not just sample volume. Thus, the new series would immediately cause memory to balloon again. + +## Action Plan + +* [ ] Propose and finalize initial design +* [ ] Expose configuration via feature flag +* [ ] Implement configuration and memory tracking logic +* [ ] Add scrape-abort logic and debuggability metrics From 930499b385a67c4de71e582ebd73bc1676d40c33 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Wed, 11 Mar 2026 13:51:13 +0000 Subject: [PATCH 2/5] mention increased cardinality from a target Signed-off-by: David Ashpole --- proposals/0076-scrape-memory-limiter.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/0076-scrape-memory-limiter.md b/proposals/0076-scrape-memory-limiter.md index 339a637..f024175 100644 --- a/proposals/0076-scrape-memory-limiter.md +++ b/proposals/0076-scrape-memory-limiter.md @@ -20,7 +20,7 @@ ## Why -Dynamic service discovery can lead to growth in the number of targets (e.g., when new workloads are spun up in Kubernetes). These new targets, which may have high cardinality or expose large amounts of metrics, can cause memory growth in Prometheus, leading to OOM kills and total monitoring unavailability. +Dynamic service discovery can lead to growth in the number of targets (e.g., when new workloads are spun up in Kubernetes). Existing targets can also occasionally have sharp increases in the cardinality of metrics they expose, or have slow "leaks" of new series over time. This additional load causes increased memory usage in Prometheus, which can lead to OOM kills and total monitoring unavailability. When Prometheus runs out of memory, it crashes. This not only stops data collection for the newly added workloads but also stops data collection for all other workloads being monitored by that Prometheus instance. From 5d509e6f73cfc4d468bde6438f856705674feb1f Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Tue, 17 Mar 2026 01:24:28 +0000 Subject: [PATCH 3/5] adddress comments Signed-off-by: David Ashpole --- proposals/0076-scrape-memory-limiter.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/0076-scrape-memory-limiter.md b/proposals/0076-scrape-memory-limiter.md index f024175..dbbfddd 100644 --- a/proposals/0076-scrape-memory-limiter.md +++ b/proposals/0076-scrape-memory-limiter.md @@ -48,7 +48,7 @@ Prometheus operators running in memory-constrained environments (like Kubernetes ## How -The Scrape Memory Limiter acts as a proactive circuit breaker for the Prometheus server. Periodically, a background routine checks the current memory usage of the Prometheus process against a configured global limit. +The Scrape Memory Limiter acts as a proactive circuit breaker for the Prometheus server. Periodically (e.g. every second), a background routine checks the current memory usage of the Prometheus process against a configured global limit. Right before initiating an HTTP request to scrape a target, the scrape loop will check the memory limiter status. If the memory usage is currently above the configured limit, the scrape transaction is aborted early. This ensures transactionality—-the scrape is skipped in its entirety, preventing the allocation of memory for a potentially large influx of metrics that the system cannot currently handle. @@ -97,7 +97,7 @@ Future support for soft memory limits (e.g., a `spike_limit_mib` parameter) will ### Fairness Mechanisms The initial implementation of the memory limiter proposed above might inadvertently starve small, critical targets when a noisy neighbor introduces memory pressure. Future iterations could introduce scheduling algorithms to ensure fairness. Advanced approaches like [Deficit Round Robin (DRR)](https://en.wikipedia.org/wiki/Deficit_round_robin) can mathematically guarantee fairness across targets during memory pressure, isolating the disruption to high-cardinality targets. -To implement fairness, the mechanism will need to predict the cost of a scrape. This prediction should be based on the **total number of samples** from the target's previous scrape, *not* the number of *new series* added. New series are highly volatile (a target rotating a label will add many new series in one scrape, but zero in the next), making them a poor heuristic for proactive load shedding. Total samples accurately correlate with the short-lived parsing overhead the scrape loop will incur. +To implement fairness, the mechanism will need to predict the relative cost of a scrape so that it can throttle targets proportionally to the expected short-term memory usage they will incurr. This prediction should be based on the **total number of samples** from the target's previous scrape, *not* the number of *new series* added. New series are highly volatile (a target rotating a label will add many new series in one scrape, but zero in the next), making them a poor heuristic for proactive load shedding. Total samples accurately correlate with the short-lived parsing overhead the scrape loop will incur. ### Per-Job Controls From 78072c2608c8a8ec7a7df71b66c37420d0f0964d Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Mon, 30 Mar 2026 20:15:42 +0000 Subject: [PATCH 4/5] address interaction between GOMEMLIMIT and scrape memory limiting Signed-off-by: David Ashpole --- proposals/0076-scrape-memory-limiter.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/proposals/0076-scrape-memory-limiter.md b/proposals/0076-scrape-memory-limiter.md index dbbfddd..a4cd0f6 100644 --- a/proposals/0076-scrape-memory-limiter.md +++ b/proposals/0076-scrape-memory-limiter.md @@ -61,14 +61,18 @@ The configuration is a subset of the configuration of the OpenTelemetry Collecto ```yaml # A new top-level block for the Scrape Memory Limiter. scrape_memory_limiter: - # Target a maximum of 80% of total system memory. + # Target a maximum of 90% of total system memory. # If total memory usage exceeds this percentage, scrapes are dropped. - limit_percentage: 80 + limit_percentage: 90 # Alternatively, an absolute limit in MiB can be used: # limit_mib: 1000 ``` +#### Interaction with GOMEMLIMIT + +Prometheus automatically configures GOMEMLIMIT to 90% of its memory limit. When scrape memory limiting is enabled, the configured GOMEMLIMIT ratio will be applied to the scrape memory limiter's limit. This ensures that GOMEMLIMIT is always lower than the scrape memory limiter's limit, ensuring scrapes are only failed when memory usage could not be reduced by garbage collection. For example, if the scrape memory limiter is configured to 90% of total memory, GOMEMLIMIT will be set to 81% of total memory by default. + ### Feature Flag While the feature is experimental, the Scrape Memory Limiter will be gated behind a command-line feature flag: `--enable-feature=scrape-memory-limiter`, and will follow the usual process for feature graduation. @@ -109,6 +113,7 @@ To implement this, Prometheus could leverage Quality of Service (QoS) or critica 1. **Do nothing** 2. **Rejecting only new series ([#16917](https://github.com/prometheus/prometheus/issues/16917), [PR #11124](https://github.com/prometheus/prometheus/pull/11124))**: Instead of dropping the entire scrape, Prometheus would accept updates for time series it already knows about but reject the allocation of *new* series. This violates scrape transactionality, as scrapes should be ingested in full or not at all. Partial ingestion leads to unpredictable query skew (e.g., a success rate query where the success metric is ingested but the newly created error metric is dropped) and breaks fundamental system behavior assumptions. This creates confusing, inconsistent data for the application owner that goes against the principle of least surprise. 3. **Slowing down scrapes**: Dynamically backing off the scrape interval (e.g., from 15s to 60s) for targets under memory pressure. While this might temporarily reduce memory intake, skipping scrapes entirely sends a clearer signal to users (`up = 0`) that something is wrong. Skipping a single scrape is usually acceptable because the query window generally covers at least twice the scrape interval. Conversely, dynamically slowing down scrapes might silently break assumptions users have built into their alerts and recording rules. +4. **Independent GOMEMLIMIT configuration**: Instead of applying the GOMEMLIMIT ratio to the scrape memory limiter's limit, we could keep the two configuration knobs entirely separate. This would allow someone to set a higher GOMEMLIMIT compared to their scrape limit, which isn't really something users would want to do. It would also make the configuration more confusing to reason about. ### Complementary Ideas From dc95d8db1c96727c7f3540dfae8221ffc529d8e8 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Thu, 7 May 2026 16:18:43 +0000 Subject: [PATCH 5/5] expand proposal to include other memory mitigations Signed-off-by: David Ashpole --- ...mory-limiter.md => 0076-memory-limiter.md} | 128 ++++++++++++------ 1 file changed, 90 insertions(+), 38 deletions(-) rename proposals/{0076-scrape-memory-limiter.md => 0076-memory-limiter.md} (51%) diff --git a/proposals/0076-scrape-memory-limiter.md b/proposals/0076-memory-limiter.md similarity index 51% rename from proposals/0076-scrape-memory-limiter.md rename to proposals/0076-memory-limiter.md index a4cd0f6..18d3b1e 100644 --- a/proposals/0076-scrape-memory-limiter.md +++ b/proposals/0076-memory-limiter.md @@ -1,4 +1,4 @@ -# Scrape Memory Limiter +# Memory Limiter * **Owners:** * @dashpole @@ -16,87 +16,133 @@ * [YouTub Recording](https://www.youtube.com/watch?v=ulHQUCarjjo) * [Slides](https://docs.google.com/presentation/d/1jKrUklPdAor9292HrPWtJkIa6ruUhOGo9IFO7fNj-DE/edit?slide=id.p#slide=id.p) -> TL;DR: This proposal introduces a Scrape Memory Limiter. It allows Prometheus to proactively and gracefully drop scrapes when the server's memory usage approaches a configured limit, preventing out-of-memory (OOM) crashes. +> TL;DR: This proposal introduces a Memory Limiter for Prometheus. It allows the server to proactively and gracefully apply mitigations (such as pausing compaction, pausing recording rules, and dropping scrapes or rejecting OTLP metrics) when memory usage approaches configured limits, preventing out-of-memory (OOM) crashes. ## Why -Dynamic service discovery can lead to growth in the number of targets (e.g., when new workloads are spun up in Kubernetes). Existing targets can also occasionally have sharp increases in the cardinality of metrics they expose, or have slow "leaks" of new series over time. This additional load causes increased memory usage in Prometheus, which can lead to OOM kills and total monitoring unavailability. +Memory exhaustion is a common cause of Prometheus crashes (OOM kills). This can be triggered by many factors: +- Spikes in scrape load or metric cardinality (e.g., new workloads spun up in Kubernetes). +- Expensive PromQL queries or recording rules. +- High volume of incoming OTLP metrics or remote read requests. +- TSDB compaction requiring significant memory. -When Prometheus runs out of memory, it crashes. This not only stops data collection for the newly added workloads but also stops data collection for all other workloads being monitored by that Prometheus instance. +When Prometheus runs out of memory and crashes, it causes total monitoring unavailability, affecting all targets and users. ### Pitfalls of the current solution -Current mitigations, such as the static per-job `sample_limit`, are insufficient since they require prior knowledge of target sizes and apply on a per-scrape basis. They do not dynamically protect the global heap across all targets. - -Relying on OS-level boundaries (such as a container memory limit) guarantees a hard crash of the entire Prometheus process when memory is exhausted, affecting the monitoring of all other targets. +Current mitigations are fragmented and often static: +- `sample_limit` applies on a per-scrape basis and requires prior knowledge of target sizes. +- There is no global mechanism to coordinate load shedding across different sources of memory usage (scrapes, OTLP, rules, etc.). +- Relying on OS-level boundaries (like cgroup limits) guarantees a hard crash of the entire process. ## Goals -- Prevent Prometheus from crashing due to memory exhaustion when scrape load increases beyond what the server can handle. -- Provide a simple, top-level global configuration to enable the feature. -- Provide clear debuggability when scrapes are failed due to memory pressure. -- Maintain transactionality when a scrape is failed due to memory pressure. +- Prevent Prometheus from crashing due to memory exhaustion by applying graceful mitigations. +- Provide a unified, top-level global configuration similar to the OpenTelemetry Collector's memory limiter. +- Support both "soft" limits (non-destructive mitigations like pausing compaction) and "hard" limits (destructive mitigations like dropping data). +- Allow operators to enable/disable specific mitigations based on their needs. +- Provide clear debuggability when mitigations are triggered. ### Audience -Prometheus operators running in memory-constrained environments (like Kubernetes) who have to deal with OOM kills, and/or who do not have full control over the applications being scraped. +Prometheus operators running in memory-constrained environments who need to protect the server from unpredictable memory spikes from various sources. ## Non-Goals -- Soft limits, fairness, and per-job QoS controls are out of scope for the initial implementation. -- This does not address long-term memory leaks. It is only designed to prevent OOMs caused by short-term spikes in memory usage from scraping. +- Fairness and per-job QoS controls are out of scope for the initial implementation. +- This does not address long-term memory leaks. It is designed to handle spikes and overload scenarios. ## How -The Scrape Memory Limiter acts as a proactive circuit breaker for the Prometheus server. Periodically (e.g. every second), a background routine checks the current memory usage of the Prometheus process against a configured global limit. +The Memory Limiter acts as a proactive circuit breaker. Periodically (configured by `check_interval`), a background routine checks the current memory usage of the Prometheus process. -Right before initiating an HTTP request to scrape a target, the scrape loop will check the memory limiter status. If the memory usage is currently above the configured limit, the scrape transaction is aborted early. This ensures transactionality—-the scrape is skipped in its entirety, preventing the allocation of memory for a potentially large influx of metrics that the system cannot currently handle. +The limiter maintains a **Soft Limit** and a **Hard Limit**. +* **Soft Limit** = `limit_mib` - `spike_limit_mib` (or calculated via percentages). +* **Hard Limit** = `limit_mib` (or calculated via `limit_percentage`). -### Configuration +### Mitigations -A new top-level `scrape_memory_limiter` configuration block will be introduced in the Prometheus configuration file. +When memory usage exceeds the limits, the following mitigations are applied (if enabled): -The configuration is a subset of the configuration of the OpenTelemetry Collector's memory limiter processor, which has been used widely in production. It will be defined as a top-level block in the Prometheus configuration file. +**At Soft Limit:** +- **Pause Compaction**: Pause background TSDB compaction. +- **Pause Recording Rules**: Pause evaluation of recording rules (alerting rules are not paused). + +**At Hard Limit:** +- **Fail Scrapes**: Skip scrapes to prevent allocation of memory for new samples. +- **Reject OTLP**: Reject incoming OTLP metrics requests. +- **Reject Remote Read**: Reject incoming remote read requests. + +### Configuration + +The configuration closely follows the OpenTelemetry Collector's memory limiter processor, with added toggles for specific mitigations. ```yaml -# A new top-level block for the Scrape Memory Limiter. -scrape_memory_limiter: - # Target a maximum of 90% of total system memory. - # If total memory usage exceeds this percentage, scrapes are dropped. - limit_percentage: 90 - - # Alternatively, an absolute limit in MiB can be used: +memory_limiter: + # Time between measurements of memory usage. Recommended value is 1s. + check_interval: 1s + + # Maximum amount of memory, in MiB, targeted to be allocated. Defines the hard limit. # limit_mib: 1000 + + # Maximum spike expected between measurements. + # Soft limit = limit_mib - spike_limit_mib + # spike_limit_mib: 200 + + # Maximum amount of total memory targeted to be allocated (percentage). + limit_percentage: 90 + + # Maximum spike expected between measurements (percentage). + # Soft limit = limit_percentage - spike_limit_percentage + spike_limit_percentage: 20 + + # Granular controls to enable/disable specific mitigations + enforcement: + pause_compaction: true + pause_recording_rules: true + fail_scrapes: true + reject_otlp: true + reject_remote_read: true ``` -#### Interaction with GOMEMLIMIT +#### Interaction with `GOMEMLIMIT` -Prometheus automatically configures GOMEMLIMIT to 90% of its memory limit. When scrape memory limiting is enabled, the configured GOMEMLIMIT ratio will be applied to the scrape memory limiter's limit. This ensures that GOMEMLIMIT is always lower than the scrape memory limiter's limit, ensuring scrapes are only failed when memory usage could not be reduced by garbage collection. For example, if the scrape memory limiter is configured to 90% of total memory, GOMEMLIMIT will be set to 81% of total memory by default. +Prometheus already automatically sets `GOMEMLIMIT` to 90% of its total memory limit. When the memory limiter is enabled, we will maintain this automatic behavior but refine it to set `GOMEMLIMIT` to a percentage (default 90%) of the calculated **Soft Limit**. + +For example, if the Soft Limit is calculated to be 700 MiB, `GOMEMLIMIT` will be set to 630 MiB. This lowers the threshold for Go's garbage collector, ensuring it attempts to reclaim memory before Prometheus starts pausing background tasks. ### Feature Flag -While the feature is experimental, the Scrape Memory Limiter will be gated behind a command-line feature flag: `--enable-feature=scrape-memory-limiter`, and will follow the usual process for feature graduation. +While the feature is experimental, the Memory Limiter will be gated behind a command-line feature flag: `--enable-feature=memory-limiter`, and will follow the usual process for feature graduation. If this flag is absent, the memory limiter will not be active and the configuration block will be ignored, even if configured in the prometheus configuration. ### Debuggability and User Experience -Understanding that data is missing and *why* it is missing is a critical part of the user experience. This feature caters to two personas: +Understanding that data is missing or delayed and *why* is critical. This feature caters to two personas: **1. The Application Owner:** -Application owners need to understand why their specific application failed to be scraped. -* **Up Metric:** The `up` metric for their dropped target will record a `0`. This is the standard mechanism to indicate a failed scrape, which preserves their existing alerts on the `up` metric. -* **UI /targets Page:** A descriptive scrape error (e.g., `scrape memory limit exceeded`) will be attached to the target's state. This error message will be visible on the Prometheus `/targets` UI page so the application owner knows the failure was due to Prometheus memory limits rather than their own application being down. +Application owners need to understand why their specific application failed to be scraped or why their OTLP metrics were rejected. +* **Up Metric:** The `up` metric for their dropped target will record a `0`. +* **UI /targets Page:** A descriptive scrape error (e.g., `memory limit exceeded`) will be attached to the target's state. +* **OTLP/Remote-Read Rejections:** OTLP and remote read requests will receive a 503 Service Unavailable error, indicating overload and signaling clients to retry with backoff. **2. The Prometheus Server Operator:** -Server operators need to understand the global impact of memory limiting so they can take corrective action (e.g., increasing memory limits, adding Prometheus replicas, or investigating massive targets). -* **Counter for aborted scrapes:** A new internal Prometheus metric (e.g., `prometheus_target_scrapes_skipped_memory_limit_total`) will be introduced to track the total number of aborted scrapes globally. Operators can set alerts on this metric to be notified of memory pressure, allowing them to intervene if data loss becomes too widespread. +Server operators need to understand the global impact of mitigations, including: +* **Compaction Backlog**: [New] `prometheus_tsdb_compaction_pending_blocks`: Tracks how far behind compaction is in blocks. +* **Scrape Skips:** [New] `prometheus_target_scrapes_skipped_total`: Tracks how many scrapes the server has skipped. +* **Rule Evaluation Pipeline**: [Existing] `prometheus_rule_group_iterations_missed_total`: Tracks how many times rule group iterations have been missed. +* **Rejected Metrics**: [Existing] `prometheus_http_requests_total`: Tracks rejections of OTLP and remote read requests. ## Future Enhancements -### Gradual Degradation (Soft Limits) +### Reject PromQL Queries + +Rejecting expensive PromQL queries (or all queries) when memory pressure is high. This was deferred from the initial proposal because determining which queries to reject is complex, and intermittent query failures make debugging hard. + +### Gradual Degradation -Future support for soft memory limits (e.g., a `spike_limit_mib` parameter) will allow the limiter to degrade scrape load gradually before the hard limit is reached. Instead of a binary drop-everything approach, the limiter would drop an increasing percentage of scrapes as memory usage approaches the hard limit. +Future support for degrading scrape load gradually before the hard limit is reached. Instead of a binary drop-everything approach, the limiter would drop an increasing percentage of scrapes as memory usage approaches the hard limit. ### Fairness Mechanisms @@ -128,4 +174,10 @@ The following ideas are compatible and complementary with a Scrape Memory Limite * [ ] Propose and finalize initial design * [ ] Expose configuration via feature flag * [ ] Implement configuration and memory tracking logic -* [ ] Add scrape-abort logic and debuggability metrics +* [ ] Implement scrape-abort logic and debuggability metrics (Hard Limit) + * Metric to add: `prometheus_target_scrapes_skipped_total`. +* [ ] Implement logic to pause/resume TSDB compaction (Soft Limit) + * Metric to add: `prometheus_tsdb_compaction_pending_blocks`. +* [ ] Implement logic to pause/resume recording rule evaluation (Soft Limit) +* [ ] Implement OTLP request rejection logic (Hard Limit) +* [ ] Implement Remote Read request rejection logic (Hard Limit)