feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399
Open
jamesbroadhead wants to merge 3 commits into
Open
feat(appkit): allowlist DBSQL error codes whose messages are safe to forward#399jamesbroadhead wants to merge 3 commits into
jamesbroadhead wants to merge 3 commits into
Conversation
…forward Follow-up to the error-sanitization work on #329. The clientMessage default of "Query execution failed" was applied uniformly to every ExecutionError.statementFailed call, which hid genuinely user-facing errors — the user's own SQL "Table 'foo' not found", "Syntax error near ',' on line 3", etc. — and forced users to debug in the dark. The fix is *not* to remove sanitization (CWE-209 is real — control plane messages carry correlation IDs, internal RPC paths, stack traces), but to distinguish the two sources by `ServiceErrorCode`: 1. **DBR (data plane)** — populates `errorDisplayMessage` for user-authored SQL errors and surfaces them via the gateway untouched. Source: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208`, `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`. Codes: BAD_REQUEST, NOT_FOUND, ALREADY_EXISTS, DEADLINE_EXCEEDED, CANCELLED, UNAUTHENTICATED. 2. **Control plane** — wraps internal RPC failures, scheduler exceptions, capacity rejections. Source: `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`, `exceptions.scala`. Codes: INTERNAL_ERROR, IO_ERROR, UNKNOWN, RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE, TEMPORARILY_UNAVAILABLE, WORKSPACE_TEMPORARILY_UNAVAILABLE, ABORTED. `dbsql-error-allowlist.ts` enumerates source-1 codes. The Set is default-deny: any code not on the list — including future SDK additions the allowlist hasn't yet been reviewed against — collapses to the generic clientMessage. An explicit `clientMessage` argument to `statementFailed()` always wins. Server-side `.message` always preserves the raw upstream text for `logger.error` capture; sanitization is purely at the wire boundary. Validation: 2,129 appkit tests pass (+31 new); tsc clean; biome clean. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Reviewer pushback on the original allowlist: TEMPORARILY_UNAVAILABLE,
RESOURCE_EXHAUSTED, SERVICE_UNDER_MAINTENANCE, WORKSPACE_TEMPORARILY_-
UNAVAILABLE, and ABORTED carry messages that ARE the entire point of
having an error string at the wire — telling users "Query execution
failed" when the actual cause is a quota / maintenance / shard-moving
condition forces them to guess. Allowlisting these.
Verified against universe sources before reclassifying:
- TEMPORARILY_UNAVAILABLE: stable template "DBSQL temporarily
unavailable. Please try again in a few minutes." constructed at
`sqlgateway/scheduler/api/src/main/scala/client/SchedulerRpcClient.scala`
and `client/linkstore/SqlGatewayClerk.scala`.
- RESOURCE_EXHAUSTED: stable user-actionable templates from
`SchedulerRpcValidatorHook.scala`: "The maximum number of warehouses
has been reached. Please contact Databricks support.", "You've hit
the limit for warehouses for free usage. Stop or delete existing
warehouses to free up capacity."
- WORKSPACE_TEMPORARILY_UNAVAILABLE: workspace-level shard-routing
messages at `sqlgateway/proxy/src/main/scala/thrift/Exceptions.scala`
— names a shard ID, which is internal topology metadata but
low-sensitivity and operationally useful for retry.
- ABORTED: short reason strings like "version mismatch" at
`ReydenWarehouseMonitor`.
- SERVICE_UNDER_MAINTENANCE: maintenance-window template messages.
INTERNAL_ERROR, IO_ERROR, UNKNOWN stay denied — these wrap raw
`ex.getMessage` and interpolate orgIds, warehouse IDs, stack traces.
`EndpointModel.scala` emits "SQL ${conf.warehouse} cannot be created
due to repeated collisions on unique IDs." — exactly the kind of
internal-state leak the sanitization exists to prevent.
Reframed the module docs from "DBR vs control plane" to
"designed-for-user vs designed-for-debugging" — the same code can be
constructed at user-facing or debug-facing sites, but the contested
5xx-class codes are overwhelmingly the former in practice.
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Contributor
Author
|
Pushed Newly allowlisted (all verified against universe construction sites):
Still denied — these wrap raw
Also reframed the module-level docs from "DBR vs control plane" to "designed-for-user vs designed-for-debugging" — same construction sites can land in either bucket, and the latter framing matches how the codes are actually used in practice. |
… triage Threat model walkthrough — the recipient of these messages is an authenticated workspace user, not an anonymous internet visitor. They already know: - Their workspace URL and orgId (visible in every SDK call) - The warehouses they have grants on (listable via the SDK) - Their catalog contents Interpolated identifiers in INTERNAL_ERROR / IO_ERROR messages (warehouse names, orgIds, internal class names) are not new information to this audience — the classic CWE-209 leak-to-anonymous-attacker scenario does not apply. The cost of denial (every internal error becomes "Query execution failed" with zero signal) is certain; the benefit (defense-in-depth against unbounded wrapped `ex.getMessage` content) is speculative. Allowlisting both, with a requestId on the SSE error frame as the safety net: - `SSEError` now carries `requestId` (the SSE event id, same value the server logs in its `Stream execution failed` line — quoting it in a support ticket lets staff grep logs directly). - `SSEWriter.writeError` populates `requestId` from the event id. - `stream-manager.ts` logs `requestId=...` alongside `rawMsg`, `errorCode`, `upstreamCode` so a user-supplied requestId pinpoints the exact log line. - Analytics `/arrow-result` 410/404 paths generate their own requestId and emit it in both the JSON response and the server log line. - `useAnalyticsQuery` exposes `requestId: string | null` on `UseAnalyticsQueryResult` so UI code can render "Error ref: <id>". Only `UNKNOWN` remains denied — unclassified by definition, so the contents are unbounded in both shape and source. Validation: 2,133 appkit + 292 appkit-ui tests pass; tsc clean. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the error-sanitization work in #329. Restores user-facing SQL error messages (e.g. `Table 'foo' not found`, `Syntax error near ','`) without re-introducing the CWE-209 leak from raw control-plane wording.
#329 collapses every `ExecutionError.statementFailed` to a generic "Query execution failed" client message. That was the right safe-default for the sanitization pass, but it loses real user value — anyone with a typo in their SQL now debugs blind, because the error is the user's own SQL artifact and was authored by DBR for them to read.
This PR splits the two error sources by `ServiceErrorCode`:
DBR (data plane) — populates `errorDisplayMessage` for user-authored SQL errors. Source of truth: `sqlgateway/scheduler/src/main/scala/DriverRequests.scala:181-208` — `ErrorInfo.withMessage(command.errorDisplayMessage).withErrorReturnedFromDataPlane(true)`. Codes: `BAD_REQUEST`, `NOT_FOUND`, `ALREADY_EXISTS`, `DEADLINE_EXCEEDED`, `CANCELLED`, `UNAUTHENTICATED`. These messages are the user's own SQL — passthrough is correct.
Control plane (proxy / scheduler / metaservice) — wraps internal RPC failures, scheduler exceptions, capacity rejections. Source: `sqlgateway/scheduler/src/main/scala/utils/CommandUtils.scala`, `exceptions.scala`. Codes: `INTERNAL_ERROR`, `IO_ERROR`, `UNKNOWN`, `RESOURCE_EXHAUSTED`, `SERVICE_UNDER_MAINTENANCE`, `TEMPORARILY_UNAVAILABLE`, `WORKSPACE_TEMPORARILY_UNAVAILABLE`, `ABORTED`. These can carry correlation IDs, RPC paths, stack traces — must not pass through.
Design
Stack
Stacks on #329. Targets `stack/arrow-3-inline-arrow-fix` (not `main`) so it merges as the next layer once the parent lands.
Test plan
This pull request and its description were written by Isaac.