diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 142049a66..964b8efc2 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -358,7 +358,7 @@ ** xref:sql:manage/index.adoc[Manage Redpanda SQL] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[Memory Management] +*** xref:sql:troubleshoot/query-out-of-memory.adoc[Query Out-of-Memory Errors] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc new file mode 100644 index 000000000..7e8bb5966 --- /dev/null +++ b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc @@ -0,0 +1,86 @@ += Troubleshoot Query Out-of-Memory Errors +:description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. +:page-topic-type: troubleshooting +:personas: platform_admin, data_engineer +:learning-objective-1: Identify when a query was cancelled because it ran out of memory +:learning-objective-2: Recover from a `Query out of memory` error and reduce its frequency +:learning-objective-3: Monitor node memory usage to anticipate memory pressure + +If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: + +// TODO: SME — the OOM error currently surfaces with SQLSTATE XX000 +// (InternalError, the fall-through default in session.cpp). That's a generic +// "internal error" class, not a memory-class code like 53200 (out_of_memory) +// or 53400 (configuration_limit_exceeded). Confirm whether this is intentional +// or a bug to be fixed — if the SQLSTATE changes, add the +// verbose-mode example. +[source,text] +---- +ERROR: Query Out of Memory! +---- + +Canceling the query frees its memory and allows the engine to continue serving other queries. This is a normal protection mechanism and is not a sign of cluster failure. + +Use this page to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== How Redpanda SQL uses memory + +// TODO: SME — rewrite this section per the PR 584 review thread. +// Placeholder below is suggested draft from the review. +// +// Goal of the section: explain enough about RP SQL's memory model that a +// user reading this troubleshooting page understands *why* a Query out of +// Memory error can happen even on large clusters, and what shapes their +// query / workload to make it more or less likely. + +Redpanda SQL queries can read very large input sources (many terabytes). However, the result set and any intermediate results produced by operations such as joins and aggregations must fit into the aggregate memory available across all nodes in the cluster. All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time. + +== Recover from the error + +When a single query fails with `Query out of Memory`, retry it. The error frees the query's memory, so the next attempt often succeeds, especially if other concurrent queries have completed in the meantime. + +If the same query keeps failing, the query itself is too memory-hungry for the current cluster size, or too many other queries are competing for memory at the same time. Reduce the query's memory footprint or reduce concurrent load: + +* Reduce concurrency. ++ +Run fewer queries in parallel against the cluster. Other queries running at the same time contribute to the total memory pressure. +* Simplify the query. ++ +Narrow the scan range with tighter `WHERE` filters, reduce the number of `JOIN`s, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most. +* Scale the cluster. ++ +Add SQL nodes to increase the aggregate memory available to queries. See xref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL]. + +// TODO: SME — confirm the recovery order above and whether a heuristic +// exists for choosing among them (for example, watching +// `oxla_process_memory_total` over time before deciding to scale). + +== Monitor memory usage + +Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: + +[cols="1,3"] +|=== +| Metric | Description + +| `oxla_process_memory_total` +| Process Resident Set Size (RSS) in bytes, reported per node. +|=== + +// TODO: Once the Redpanda SQL metrics catalog is finalized, replace this +// inline table with a cross-link to the metrics page. + +== If you see `cancelled due to OOM prevention` instead + +The `cancelled due to OOM prevention` error is a separate case. Redpanda SQL's engine includes an overseer that monitors overall node memory independently of per-query accounting. When the overseer detects that the untracked memory pool has grown unexpectedly, it cancels running queries on the affected node to keep the engine operational. + +This condition is rare and almost always indicates a bug in memory accounting or an unexpected workload pattern. Collect the cluster logs from around the time of the error and contact https://support.redpanda.com/hc/en-us/requests/new[Redpanda Support^]. + +== Suggested reading + +* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. +* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state.