feat(appkit): allowlist DBSQL error codes whose messages are safe to forward by jamesbroadhead · Pull Request #399 · databricks/appkit

jamesbroadhead · 2026-05-21T22:16:01Z

Summary

Follow-up to the error-sanitization work in #329. Restores user-facing SQL error messages (e.g. `Table 'foo' not found`, `Syntax error near ','`) without re-introducing the CWE-209 leak from raw control-plane wording.

#329 collapses every `ExecutionError.statementFailed` to a generic "Query execution failed" client message. That was the right safe-default for the sanitization pass, but it loses real user value — anyone with a typo in their SQL now debugs blind, because the error is the user's own SQL artifact and was authored by DBR for them to read.

This PR splits the two error sources by `ServiceErrorCode`:

DBR (data plane) — populates `errorDisplayMessage` for user-authored SQL errors. Source of truth: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208` — `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`. Codes: `BAD_REQUEST`, `NOT_FOUND`, `ALREADY_EXISTS`, `DEADLINE_EXCEEDED`, `CANCELLED`, `UNAUTHENTICATED`. These messages are the user's own SQL — passthrough is correct.
Control plane (proxy / scheduler / metaservice) — wraps internal RPC failures, scheduler exceptions, capacity rejections. Source: `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`, `exceptions.scala`. Codes: `INTERNAL_ERROR`, `IO_ERROR`, `UNKNOWN`, `RESOURCE_EXHAUSTED`, `SERVICE_UNDER_MAINTENANCE`, `TEMPORARILY_UNAVAILABLE`, `WORKSPACE_TEMPORARILY_UNAVAILABLE`, `ABORTED`. These can carry correlation IDs, RPC paths, stack traces — must not pass through.

Design

New module `packages/appkit/src/errors/dbsql-error-allowlist.ts` exports `isSqlErrorPassthrough(errorCode)` — a `ReadonlySet`-backed predicate over `ServiceErrorCode`.
`ExecutionError.statementFailed` consults the allowlist before defaulting `clientMessage`. Allowlisted codes get the verbatim warehouse text; anything else (including future SDK additions) collapses to the generic "Query execution failed".
An explicit `clientMessage` argument always wins — the allowlist is a default, not a ceiling.
Server-side `.message` always preserves the raw upstream text for `logger.error` capture; sanitization is purely at the wire boundary.

Stack

Stacks on #329. Targets `stack/arrow-3-inline-arrow-fix` (not `main`) so it merges as the next layer once the parent lands.

Test plan

`describe("statementFailed clientMessage passthrough")` — 13 new cases covering each allowlisted code, key denied codes, default-deny on unknown / undefined / lowercased input, explicit clientMessage override, and raw-message preservation for log debugging.
`dbsql-error-allowlist.test.ts` — focused tests on the predicate itself (allowlist members, explicit denylist members, undefined / empty / unknown / lowercase).
Full `appkit` suite: 2,129 tests pass (+31 new).
`tsc --noEmit` clean; `biome check` clean.

This pull request and its description were written by Isaac.

…forward Follow-up to the error-sanitization work on #329. The clientMessage default of "Query execution failed" was applied uniformly to every ExecutionError.statementFailed call, which hid genuinely user-facing errors — the user's own SQL "Table 'foo' not found", "Syntax error near ',' on line 3", etc. — and forced users to debug in the dark. The fix is *not* to remove sanitization (CWE-209 is real — control plane messages carry correlation IDs, internal RPC paths, stack traces), but to distinguish the two sources by `ServiceErrorCode`: 1. **DBR (data plane)** — populates `errorDisplayMessage` for user-authored SQL errors and surfaces them via the gateway untouched. Source: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208`, `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`. Codes: BAD_REQUEST, NOT_FOUND, ALREADY_EXISTS, DEADLINE_EXCEEDED, CANCELLED, UNAUTHENTICATED. 2. **Control plane** — wraps internal RPC failures, scheduler exceptions, capacity rejections. Source: `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`, `exceptions.scala`. Codes: INTERNAL_ERROR, IO_ERROR, UNKNOWN, RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE, TEMPORARILY_UNAVAILABLE, WORKSPACE_TEMPORARILY_UNAVAILABLE, ABORTED. `dbsql-error-allowlist.ts` enumerates source-1 codes. The Set is default-deny: any code not on the list — including future SDK additions the allowlist hasn't yet been reviewed against — collapses to the generic clientMessage. An explicit `clientMessage` argument to `statementFailed()` always wins. Server-side `.message` always preserves the raw upstream text for `logger.error` capture; sanitization is purely at the wire boundary. Validation: 2,129 appkit tests pass (+31 new); tsc clean; biome clean. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

Reviewer pushback on the original allowlist: TEMPORARILY_UNAVAILABLE, RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE, WORKSPACE_TEMPORARILY_- UNAVAILABLE, and ABORTED carry messages that ARE the entire point of having an error string at the wire — telling users "Query execution failed" when the actual cause is a quota / maintenance / shard-moving condition forces them to guess. Allowlisting these. Verified against universe sources before reclassifying: - TEMPORARILY_UNAVAILABLE: stable template "DBSQL temporarily unavailable. Please try again in a few minutes." constructed at `sqlgateway/scheduler/api/src/main/scala/client/SchedulerRpcClient.scala` and `client/linkstore/SqlGatewayClerk.scala`. - RESOURCE_EXHAUSTED: stable user-actionable templates from `SchedulerRpcValidatorHook.scala`: "The maximum number of warehouses has been reached. Please contact Databricks support.", "You've hit the limit for warehouses for free usage. Stop or delete existing warehouses to free up capacity." - WORKSPACE_TEMPORARILY_UNAVAILABLE: workspace-level shard-routing messages at `sqlgateway/proxy/src/main/scala/thrift/Exceptions.scala` — names a shard ID, which is internal topology metadata but low-sensitivity and operationally useful for retry. - ABORTED: short reason strings like "version mismatch" at `ReydenWarehouseMonitor`. - SERVICE_UNDER_MAINTENANCE: maintenance-window template messages. INTERNAL_ERROR, IO_ERROR, UNKNOWN stay denied — these wrap raw `ex.getMessage` and interpolate orgIds, warehouse IDs, stack traces. `EndpointModel.scala` emits "SQL ${conf.warehouse} cannot be created due to repeated collisions on unique IDs." — exactly the kind of internal-state leak the sanitization exists to prevent. Reframed the module docs from "DBR vs control plane" to "designed-for-user vs designed-for-debugging" — the same code can be constructed at user-facing or debug-facing sites, but the contested 5xx-class codes are overwhelmingly the former in practice. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

jamesbroadhead · 2026-05-22T10:58:19Z

Pushed a5c8e4a4 — expanded the allowlist after reviewer pushback on the 5xx-class codes.

Newly allowlisted (all verified against universe construction sites):

TEMPORARILY_UNAVAILABLE — stable template "DBSQL temporarily unavailable. Please try again in a few minutes." (SchedulerRpcClient.scala, SqlGatewayClerk.scala)
RESOURCE_EXHAUSTED — explicit user-actionable templates: "The maximum number of warehouses has been reached. Please contact Databricks support.", "Stop or delete existing warehouses to free up capacity." (SchedulerRpcValidatorHook.scala)
WORKSPACE_TEMPORARILY_UNAVAILABLE — shard-routing messages (proxy/thrift/Exceptions.scala); names a shard ID, low-sensitivity, retry-relevant
SERVICE_UNDER_MAINTENANCE — maintenance-window templates
ABORTED — short reason strings ("version mismatch")

Still denied — these wrap raw ex.getMessage or interpolate internal identifiers, so leak risk is concrete:

INTERNAL_ERROR — EndpointModel.scala emits "SQL ${conf.warehouse} cannot be created due to repeated collisions on unique IDs."; SchedulerRpcContextValidatorHook.workspaceNotOwned(orgId) interpolates orgId
IO_ERROR — storage / network paths
UNKNOWN — unclassified by definition

Also reframed the module-level docs from "DBR vs control plane" to "designed-for-user vs designed-for-debugging" — same construction sites can land in either bucket, and the latter framing matches how the codes are actually used in practice.

… triage Threat model walkthrough — the recipient of these messages is an authenticated workspace user, not an anonymous internet visitor. They already know: - Their workspace URL and orgId (visible in every SDK call) - The warehouses they have grants on (listable via the SDK) - Their catalog contents Interpolated identifiers in INTERNAL_ERROR / IO_ERROR messages (warehouse names, orgIds, internal class names) are not new information to this audience — the classic CWE-209 leak-to-anonymous-attacker scenario does not apply. The cost of denial (every internal error becomes "Query execution failed" with zero signal) is certain; the benefit (defense-in-depth against unbounded wrapped `ex.getMessage` content) is speculative. Allowlisting both, with a requestId on the SSE error frame as the safety net: - `SSEError` now carries `requestId` (the SSE event id, same value the server logs in its `Stream execution failed` line — quoting it in a support ticket lets staff grep logs directly). - `SSEWriter.writeError` populates `requestId` from the event id. - `stream-manager.ts` logs `requestId=...` alongside `rawMsg`, `errorCode`, `upstreamCode` so a user-supplied requestId pinpoints the exact log line. - Analytics `/arrow-result` 410/404 paths generate their own requestId and emit it in both the JSON response and the server log line. - `useAnalyticsQuery` exposes `requestId: string | null` on `UseAnalyticsQueryResult` so UI code can render "Error ref: <id>". Only `UNKNOWN` remains denied — unclassified by definition, so the contents are unbounded in both shape and source. Validation: 2,133 appkit + 292 appkit-ui tests pass; tsc clean. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

jamesbroadhead requested a review from a team as a code owner May 21, 2026 22:16

jamesbroadhead requested review from MarioCadenas and removed request for a team May 21, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399

feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399
jamesbroadhead wants to merge 3 commits into
stack/arrow-3-inline-arrow-fixfrom
stack/arrow-5-error-allowlist

jamesbroadhead commented May 21, 2026

Uh oh!

jamesbroadhead commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesbroadhead commented May 21, 2026

Summary

Design

Stack

Test plan

Uh oh!

jamesbroadhead commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant