Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,7 @@
** xref:sql:manage/index.adoc[Manage Redpanda SQL]
** xref:sql:troubleshoot/index.adoc[Troubleshoot]
*** xref:sql:troubleshoot/degraded-state-handling.adoc[]
*** xref:sql:troubleshoot/memory-management.adoc[Memory Management]
*** xref:sql:troubleshoot/oom-cancellations.adoc[OOM Cancellations]

* xref:develop:index.adoc[Develop]
** xref:develop:kafka-clients.adoc[]
Expand Down
66 changes: 66 additions & 0 deletions modules/sql/pages/troubleshoot/oom-cancellations.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
= Troubleshoot Memory-related Query Cancellations
:description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention.
:page-topic-type: troubleshooting
:personas: platform_admin, data_engineer
:learning-objective-1: Identify when query cancellations are caused by OOM prevention
:learning-objective-2: Recover from OOM-cancelled queries
:learning-objective-3: Monitor node memory usage to anticipate cancellations

// TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad.
// Options:
// "Troubleshoot memory-related query cancellations" (current; matches Troubleshoot section voice)
// "Recover from OOM cancellation" (concise; uses internal term)
// Keep "Memory management" (matches current nav label but doesn't signal action)

Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbatuigas here I think we need to explain the overall memory limits principles with RP SQL so users understand the reason why they might be seeing this kind of error ,(esp if we dont have a separate scale guide, lets thus give them a hint here).

Something like: "While RP SQL queries can process very large input sources (many TB) RP SQL query results (and intermediate results created by operations like joins and aggregations) must fit into the aggregate available memory available to all nodes in the cluster (as reported by this metric ...._). All concurrently running queries contribute to total memory consumption and any one query can cause the node memory limits to be hit based on other concurrent queries ..."

I think @Greketrotny should draft /update this ^^^


[source,text]
----
cancelled due to OOM prevention
Copy link
Copy Markdown

@Greketrotny Greketrotny May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancelled due to OOM prevention is a sibling error to a primary user-facing one: Query Out of Memory.
Query out of Memory is reported when a particular query exhausted all memory resources and had to be cancelled. This is a normal behavior, as the engine counts the allocated memory and prevents it from entering an unexpected state or a deadlock. With this error, it is advised to retry the query or cancel/wait for other concurrently running tasks to finish before retry. I feel like this page is describing this case, but with the wrong error message.
The thing is, the engine doesn't track all allocations, so it doesn't have full control over the allocated memory. This is where the cancelled due to OOM prevention error comes in.

The OOM prevention mechanism is an overseer. It's addressing this by monitoring the overall memory usage in an external, independent way. It's more of an emergency handler, which quickly frees reclaimable resources to remain operational. However, triggering this situation is a result of either the untracked pool exceeding unexpectedly or a serious problem with memory tracking, and should probably almost always result in a bug report by the client with access to the logs. This, I suspect, is more like a runbook/customer support scenario.

I don't know what should be exactly visible in the public documentation, but I feel like this page blends two problems, and there are two parts to describe/discuss, the first one should be definitely visible to the user with an explanation why this happens, and the second (the emergency one) is more like an issue/emergency. Maybe it should be present in the docs too, but on a different page.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's focus on the user-facing error @kbatuigas

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbatuigas where exactly does the user see this error? This doesn't appear to be a complete example.

What does that actually look like in the psql client. Let's show the actual real-world example.

Is this standard Portgres error code?

----

// TODO: SME — confirm the exact client-facing error envelope. The string above is the error reason raised internally by the engine. Clients connecting through `psql` or a PostgreSQL driver typically receive it wrapped in a PostgreSQL error message. Confirm:
// - Is a SQLSTATE code set on this error? If so, which one?
// - Does the message reach the client verbatim, or is the wording different?

Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds.

Use this page to:

* [ ] {learning-objective-1}
* [ ] {learning-objective-2}
* [ ] {learning-objective-3}

== If the error keeps happening

If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit.

// TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate:
// - Reduce query concurrency on the affected workload.
// - Simplify the query — narrow the scan range, add filters, reduce parallel CTEs.
// - Scale up the cluster.
// Also confirm: is there a heuristic for choosing among them (for example, look at oxla_process_memory_total over time)?

== Why this happens

Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it.

== Monitor memory usage

Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit:

[cols="1,3"]
|===
| Metric | Description

| `oxla_process_memory_total`
| Process Resident Set Size (RSS) in bytes, reported per node.
|===

// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented and add a cross-link from "Suggested reading" below to that page.

== Suggested reading

* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster.
* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state.