Skip to content

feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399

Open
jamesbroadhead wants to merge 3 commits into
stack/arrow-3-inline-arrow-fixfrom
stack/arrow-5-error-allowlist
Open

feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399
jamesbroadhead wants to merge 3 commits into
stack/arrow-3-inline-arrow-fixfrom
stack/arrow-5-error-allowlist

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Summary

Follow-up to the error-sanitization work in #329. Restores user-facing SQL error messages (e.g. `Table 'foo' not found`, `Syntax error near ','`) without re-introducing the CWE-209 leak from raw control-plane wording.

#329 collapses every `ExecutionError.statementFailed` to a generic "Query execution failed" client message. That was the right safe-default for the sanitization pass, but it loses real user value — anyone with a typo in their SQL now debugs blind, because the error is the user's own SQL artifact and was authored by DBR for them to read.

This PR splits the two error sources by `ServiceErrorCode`:

  1. DBR (data plane) — populates `errorDisplayMessage` for user-authored SQL errors. Source of truth: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208` — `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`. Codes: `BAD_REQUEST`, `NOT_FOUND`, `ALREADY_EXISTS`, `DEADLINE_EXCEEDED`, `CANCELLED`, `UNAUTHENTICATED`. These messages are the user's own SQL — passthrough is correct.

  2. Control plane (proxy / scheduler / metaservice) — wraps internal RPC failures, scheduler exceptions, capacity rejections. Source: `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`, `exceptions.scala`. Codes: `INTERNAL_ERROR`, `IO_ERROR`, `UNKNOWN`, `RESOURCE_EXHAUSTED`, `SERVICE_UNDER_MAINTENANCE`, `TEMPORARILY_UNAVAILABLE`, `WORKSPACE_TEMPORARILY_UNAVAILABLE`, `ABORTED`. These can carry correlation IDs, RPC paths, stack traces — must not pass through.

Design

  • New module `packages/appkit/src/errors/dbsql-error-allowlist.ts` exports `isSqlErrorPassthrough(errorCode)` — a `ReadonlySet`-backed predicate over `ServiceErrorCode`.
  • `ExecutionError.statementFailed` consults the allowlist before defaulting `clientMessage`. Allowlisted codes get the verbatim warehouse text; anything else (including future SDK additions) collapses to the generic "Query execution failed".
  • An explicit `clientMessage` argument always wins — the allowlist is a default, not a ceiling.
  • Server-side `.message` always preserves the raw upstream text for `logger.error` capture; sanitization is purely at the wire boundary.

Stack

Stacks on #329. Targets `stack/arrow-3-inline-arrow-fix` (not `main`) so it merges as the next layer once the parent lands.

Test plan

  • `describe("statementFailed clientMessage passthrough")` — 13 new cases covering each allowlisted code, key denied codes, default-deny on unknown / undefined / lowercased input, explicit clientMessage override, and raw-message preservation for log debugging.
  • `dbsql-error-allowlist.test.ts` — focused tests on the predicate itself (allowlist members, explicit denylist members, undefined / empty / unknown / lowercase).
  • Full `appkit` suite: 2,129 tests pass (+31 new).
  • `tsc --noEmit` clean; `biome check` clean.

This pull request and its description were written by Isaac.

…forward

Follow-up to the error-sanitization work on #329. The clientMessage
default of "Query execution failed" was applied uniformly to every
ExecutionError.statementFailed call, which hid genuinely user-facing
errors — the user's own SQL "Table 'foo' not found", "Syntax error
near ',' on line 3", etc. — and forced users to debug in the dark.

The fix is *not* to remove sanitization (CWE-209 is real — control
plane messages carry correlation IDs, internal RPC paths, stack traces),
but to distinguish the two sources by `ServiceErrorCode`:

1. **DBR (data plane)** — populates `errorDisplayMessage` for
   user-authored SQL errors and surfaces them via the gateway untouched.
   Source: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208`,
   `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`.
   Codes: BAD_REQUEST, NOT_FOUND, ALREADY_EXISTS, DEADLINE_EXCEEDED,
   CANCELLED, UNAUTHENTICATED.

2. **Control plane** — wraps internal RPC failures, scheduler exceptions,
   capacity rejections. Source:
   `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`,
   `exceptions.scala`. Codes: INTERNAL_ERROR, IO_ERROR, UNKNOWN,
   RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE,
   TEMPORARILY_UNAVAILABLE, WORKSPACE_TEMPORARILY_UNAVAILABLE, ABORTED.

`dbsql-error-allowlist.ts` enumerates source-1 codes. The Set is
default-deny: any code not on the list — including future SDK additions
the allowlist hasn't yet been reviewed against — collapses to the
generic clientMessage. An explicit `clientMessage` argument to
`statementFailed()` always wins.

Server-side `.message` always preserves the raw upstream text for
`logger.error` capture; sanitization is purely at the wire boundary.

Validation: 2,129 appkit tests pass (+31 new); tsc clean; biome clean.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
@jamesbroadhead jamesbroadhead requested a review from a team as a code owner May 21, 2026 22:16
@jamesbroadhead jamesbroadhead requested review from MarioCadenas and removed request for a team May 21, 2026 22:16
Reviewer pushback on the original allowlist: TEMPORARILY_UNAVAILABLE,
RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE, WORKSPACE_TEMPORARILY_-
UNAVAILABLE, and ABORTED carry messages that ARE the entire point of
having an error string at the wire — telling users "Query execution
failed" when the actual cause is a quota / maintenance / shard-moving
condition forces them to guess. Allowlisting these.

Verified against universe sources before reclassifying:

- TEMPORARILY_UNAVAILABLE: stable template "DBSQL temporarily
  unavailable. Please try again in a few minutes." constructed at
  `sqlgateway/scheduler/api/src/main/scala/client/SchedulerRpcClient.scala`
  and `client/linkstore/SqlGatewayClerk.scala`.

- RESOURCE_EXHAUSTED: stable user-actionable templates from
  `SchedulerRpcValidatorHook.scala`: "The maximum number of warehouses
  has been reached. Please contact Databricks support.", "You've hit
  the limit for warehouses for free usage. Stop or delete existing
  warehouses to free up capacity."

- WORKSPACE_TEMPORARILY_UNAVAILABLE: workspace-level shard-routing
  messages at `sqlgateway/proxy/src/main/scala/thrift/Exceptions.scala`
  — names a shard ID, which is internal topology metadata but
  low-sensitivity and operationally useful for retry.

- ABORTED: short reason strings like "version mismatch" at
  `ReydenWarehouseMonitor`.

- SERVICE_UNDER_MAINTENANCE: maintenance-window template messages.

INTERNAL_ERROR, IO_ERROR, UNKNOWN stay denied — these wrap raw
`ex.getMessage` and interpolate orgIds, warehouse IDs, stack traces.
`EndpointModel.scala` emits "SQL ${conf.warehouse} cannot be created
due to repeated collisions on unique IDs." — exactly the kind of
internal-state leak the sanitization exists to prevent.

Reframed the module docs from "DBR vs control plane" to
"designed-for-user vs designed-for-debugging" — the same code can be
constructed at user-facing or debug-facing sites, but the contested
5xx-class codes are overwhelmingly the former in practice.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
@jamesbroadhead
Copy link
Copy Markdown
Contributor Author

Pushed a5c8e4a4 — expanded the allowlist after reviewer pushback on the 5xx-class codes.

Newly allowlisted (all verified against universe construction sites):

  • TEMPORARILY_UNAVAILABLE — stable template "DBSQL temporarily unavailable. Please try again in a few minutes." (SchedulerRpcClient.scala, SqlGatewayClerk.scala)
  • RESOURCE_EXHAUSTED — explicit user-actionable templates: "The maximum number of warehouses has been reached. Please contact Databricks support.", "Stop or delete existing warehouses to free up capacity." (SchedulerRpcValidatorHook.scala)
  • WORKSPACE_TEMPORARILY_UNAVAILABLE — shard-routing messages (proxy/thrift/Exceptions.scala); names a shard ID, low-sensitivity, retry-relevant
  • SERVICE_UNDER_MAINTENANCE — maintenance-window templates
  • ABORTED — short reason strings ("version mismatch")

Still denied — these wrap raw ex.getMessage or interpolate internal identifiers, so leak risk is concrete:

  • INTERNAL_ERROREndpointModel.scala emits "SQL ${conf.warehouse} cannot be created due to repeated collisions on unique IDs."; SchedulerRpcContextValidatorHook.workspaceNotOwned(orgId) interpolates orgId
  • IO_ERROR — storage / network paths
  • UNKNOWN — unclassified by definition

Also reframed the module-level docs from "DBR vs control plane" to "designed-for-user vs designed-for-debugging" — same construction sites can land in either bucket, and the latter framing matches how the codes are actually used in practice.

… triage

Threat model walkthrough — the recipient of these messages is an
authenticated workspace user, not an anonymous internet visitor. They
already know:
- Their workspace URL and orgId (visible in every SDK call)
- The warehouses they have grants on (listable via the SDK)
- Their catalog contents

Interpolated identifiers in INTERNAL_ERROR / IO_ERROR messages
(warehouse names, orgIds, internal class names) are not new information
to this audience — the classic CWE-209 leak-to-anonymous-attacker
scenario does not apply. The cost of denial (every internal error
becomes "Query execution failed" with zero signal) is certain; the
benefit (defense-in-depth against unbounded wrapped `ex.getMessage`
content) is speculative.

Allowlisting both, with a requestId on the SSE error frame as the
safety net:

- `SSEError` now carries `requestId` (the SSE event id, same value the
  server logs in its `Stream execution failed` line — quoting it in a
  support ticket lets staff grep logs directly).
- `SSEWriter.writeError` populates `requestId` from the event id.
- `stream-manager.ts` logs `requestId=...` alongside `rawMsg`,
  `errorCode`, `upstreamCode` so a user-supplied requestId pinpoints
  the exact log line.
- Analytics `/arrow-result` 410/404 paths generate their own requestId
  and emit it in both the JSON response and the server log line.
- `useAnalyticsQuery` exposes `requestId: string | null` on
  `UseAnalyticsQueryResult` so UI code can render "Error ref: <id>".

Only `UNKNOWN` remains denied — unclassified by definition, so the
contents are unbounded in both shape and source.

Validation: 2,133 appkit + 292 appkit-ui tests pass; tsc clean.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant