Make secure token storage the default storage mode#5272
Conversation
U2M tokens for the databricks-cli auth type now write to the OS-native keyring by default. Users who need the previous file-backed cache can opt back via DATABRICKS_AUTH_STORAGE=plaintext or auth_storage = plaintext under [__settings__] in .databrickscfg; the env var takes precedence. The login-time keyring probe and fallback (already on main) activate with this change. Co-authored-by: Isaac
When the resolver returns secure from the default (no env, no config), login now writes auth_storage = secure to [__settings__] after the keyring Store succeeds. Subsequent invocations see source=Config, so the explicit-secure branch of applyLoginFallback surfaces a transient keyring probe failure as an error instead of silently demoting the user to plaintext. Without this pin, a working secure-storage user could get stranded on the file cache after a single flaky probe. No-op when mode is plaintext (silent fallback already happened) or when the user already chose a mode explicitly. Persistence failures are logged at debug and never block login. Co-authored-by: Isaac
- login.go: move PinSecureMode call out of the existing comment block so the "At this point... / The rest of the command focuses on" narration stays together - cache.go: trim PinSecureMode doc comment and acknowledge that concurrent logins racing the write is benign because both write the same value - cache_test.go: drop the unused wantSkipMsg struct field; strengthen TestPinSecureMode_PersistFailureIsSwallowed to assert no file was written (and that the underlying os.OpenFile failure is the real trigger) - u2m-secure-default test.toml: rephrase the fixture comment to keep internal Go API names out of test config Co-authored-by: Isaac
After the default flipped to secure, any test that runs the login command on linux (no D-Bus) hits applyLoginFallback, which silently persists auth_storage = plaintext to whatever DATABRICKS_CONFIG_FILE points at. TestProfileHostCompatibleViaCobra points it at the checked- in cmd/auth/testdata/.databrickscfg fixture, so the test run leaves a dirty working tree and CI's `git diff --exit-code` step fails. Two changes: 1. Move ResolveCacheForLogin in login.go to run after input validation (cluster/serverless mutex + positional-arg check) rather than before. Trivially-invalid commands now fail without probing the keyring, so TestLoginRejectsPositionalArgWithHostFlag / WithProfileFlag no longer hit applyLoginFallback. The "resolve before browser step" property the original comment cared about is preserved: cache resolution still happens before NewPersistentAuth and Challenge. 2. Force plaintext via DATABRICKS_AUTH_STORAGE in TestProfileHostCompatibleViaCobra, which legitimately passes all input validation and reaches the resolver. The test is about flag compatibility, not storage mode; pinning it to plaintext keeps it hermetic. Co-authored-by: Isaac
Before: a databricks-cli auth command on a system without a usable OS keyring (Linux without D-Bus, headless containers, certain SSH sessions) surfaced the raw backend error, typically something like "Cannot autolaunch D-Bus without X11 $DISPLAY". The user had no way to tell that switching storage modes would fix it. After: keyringCache.Lookup wraps non-ErrNotFound errors with actionable guidance pointing at DATABRICKS_AUTH_STORAGE=plaintext and `databricks auth login` (which auto-falls back to plaintext when the keyring is unreachable). ErrNotFound passes through unchanged because a clean miss is the "run auth login" signal, not an unreachability problem. Store is left alone: the login Challenge path is already handled by applyLoginFallback with mode-aware messaging, so wrapping Store would duplicate the hint inside "secure storage was requested but the OS keyring is not reachable: ...". Co-authored-by: Isaac
| // Force plaintext so the storage resolver does not probe the OS keyring | ||
| // and silently persist auth_storage = plaintext to the checked-in fixture | ||
| // on CI runners without a usable keyring. | ||
| t.Setenv(storage.EnvVar, "plaintext") |
There was a problem hiding this comment.
The outcome is the same, right?
Or does setting this env var omit the auth_storage field in the cfg?
There was a problem hiding this comment.
The env var did change cfg behavior — without it, applyLoginFallback's silent-downgrade branch would persist auth_storage = plaintext to whatever DATABRICKS_CONFIG_FILE points at, which here was the checked-in fixture (Linux CI: probe fails → silent fallback fires). Refactored in 03c3d02: copy the fixture to t.TempDir() and point DATABRICKS_CONFIG_FILE there, so the test runs the full resolver path with whatever writes happen contained to the temp file. Env override is gone.
| } | ||
| configPath := env.Get(ctx, "DATABRICKS_CONFIG_FILE") | ||
| if err := databrickscfg.SetConfiguredAuthStorage(ctx, string(StorageModeSecure), configPath); err != nil { | ||
| log.Debugf(ctx, "pin secure mode: persist: %v", err) |
There was a problem hiding this comment.
Same as function on L164 except debug logging. Shouldn't both log at debug?
There was a problem hiding this comment.
Good catch. Done in 03c3d02 — persistPlaintextFallback now also logs internally at debug and returns void. Both functions have the same shape now; applyLoginFallback's call site drops the err check.
| if mode != StorageModeSecure { | ||
| return | ||
| } | ||
| _, source, err := ResolveStorageModeWithSource(ctx, "") |
There was a problem hiding this comment.
Not blocking: empty string for a typed value. This should say something like StorageModeUnspecified or something to make it clear we're not applying an override here.
There was a problem hiding this comment.
Done in 03c3d02 — switched to StorageModeUnknown, which is the existing 'no mode selected' zero value (see its doc comment). Happy to rename it to StorageModeUnspecified across the codebase as a follow-up if that reads better; left out here to keep the diff focused.
|
|
||
| ### Notable Changes | ||
|
|
||
| * Breaking change: U2M (`auth_type = databricks-cli`) tokens are now stored in the OS-native secure store by default (Keychain on macOS, Credential Manager on Windows, Secret Service on Linux) instead of `~/.databricks/token-cache.json`. After upgrading, run `databricks auth login` once per profile to re-authenticate; cached tokens from older versions are not migrated. To keep the previous file-backed storage, set `DATABRICKS_AUTH_STORAGE=plaintext` or add `auth_storage = plaintext` under `[__settings__]` in `~/.databrickscfg` (the env var takes precedence over the config setting), then re-run `databricks auth login`. |
There was a problem hiding this comment.
u2m --> "OAuth tokens for interactive logins (auth_type = databricks-cli)"...
There was a problem hiding this comment.
Done in 03c3d02 — replaced with 'OAuth tokens for interactive logins (auth_type = databricks-cli)'.
Before: the SDK's cache.ErrNotFound rendered as "cache: token not found" at the CLI surface. Pre-flip, plaintext users who never ran auth login or were missing a profile saw that message. Post-flip, every user with an existing plaintext token sees it too until they re-login, since the secure resolver does not read token-cache.json. After: ResolveCache and ResolveCacheForLogin wrap their result in a notFoundHintCache. When Lookup returns ErrNotFound: - If mode=secure and ~/.databricks/token-cache.json exists with entries, the error reads "stored credentials from older CLI versions are no longer used; run `databricks auth login` to sign in again, or set DATABRICKS_AUTH_STORAGE=plaintext to keep using the file cache". This is the upgrade scenario from the CLI Secure Storage Troubleshooting doc. - Otherwise the error reads "no cached credentials; run `databricks auth login` to sign in". A custom notFoundHint type backs both messages so they replace the SDK's "token not found" string rather than getting appended to it; the SDK's outer "cache: %w" wrap then produces clean output like "cache: no cached credentials; run `databricks auth login` to sign in" instead of "cache: <hint>: token not found". errors.Is(err, cache.ErrNotFound) continues to return true via the Unwrap method, so the SDK's existing branches on ErrNotFound still fire. The wrap lives in the exported ResolveCache / ResolveCacheForLogin, not in the internal resolveCacheWith / resolveCacheForLoginWith test helpers, so existing unit tests that type-assert on the inner cache type still work. Co-authored-by: Isaac
1. NEXT_CHANGELOG: drop "U2M" jargon in favor of "OAuth tokens for interactive logins (auth_type = databricks-cli)". 2. TestProfileHostCompatibleViaCobra: copy the testdata fixture into a t.TempDir() and point DATABRICKS_CONFIG_FILE there, instead of forcing DATABRICKS_AUTH_STORAGE=plaintext to suppress the silent plaintext-fallback write. The temp-file approach is more hermetic and removes the question Pieter raised about whether the env var alters cfg behavior. The storage import in auth_test.go is dropped. 3. persistPlaintextFallback: log internally at debug instead of returning the error for the caller to log. Same shape as PinSecureMode, which was already doing this. Call site in applyLoginFallback drops the err check. 4. PinSecureMode: pass StorageModeUnknown (the named "no override selected" constant) instead of "" to ResolveStorageModeWithSource. Co-authored-by: Isaac
Mirrors the login-path fallback for the read path. When the resolver returns (mode=Secure, source=Default) and the keyring probes as definitively unreachable, ResolveCache now returns the file cache instead of the keyring. This means users upgrading on systems without a usable keyring (e.g. Linux containers without a D-Bus session bus) keep reading from their pre-upgrade token-cache.json without any manual configuration. Two behaviours intentionally differ from the login-path fallback: 1. The probe is read-only (ProbeKeyringRead does a Get on a non-existent account name, treating keyring.ErrNotFound as "reachable"). The login-path probe still uses the write+delete cycle because login needs to validate write capability before committing tokens. 2. The read-path fallback does NOT persist auth_storage = plaintext to [__settings__]. A read failure could be transient (e.g. a momentarily broken D-Bus connection) and we don't have strong enough evidence to pin the user out of the keyring permanently. Pinning stays the responsibility of the login command, which has the stronger write-probe signal and an explicit user action. Timeouts in the read probe stay on the keyring (same logic as the login path): a locked keyring being unlocked is the most common timeout case, and a misdiagnosed hang fails the actual Lookup, which is preferable to silently routing reads to a different backend. Explicit-secure callers (env var or config) get the keyring cache regardless of probe result, so an explicit "I want secure" request surfaces the unreachability instead of being silently downgraded. Co-authored-by: Isaac
Two findings: 1. The persist failure was logged at debug, which silently weakens the stated pin-on-success guarantee: login succeeds, the keyring write succeeds, but auth_storage = secure is never persisted. A later default-secure read with a transient keyring probe failure would then fall back to plaintext, undoing the protection the pin was supposed to provide. Promote to log.Warnf so the user sees the failure during login and can investigate (file permissions, etc.). Login is still not blocked: pinning is best-effort. 2. PinSecureMode re-resolved the storage mode with StorageModeUnknown, so a caller-supplied override was invisible to the source check. The function's doc said "No-op when the user already chose a mode explicitly", but only env and config were detected; an override secure mode would still pin auth_storage = secure to config, silently turning an ephemeral per-invocation choice into a persistent one. No caller passes a non-empty override today, so the bug was dormant, but fix it before someone wires up a flag and notices the behavior. PinSecureMode now takes an override StorageMode parameter, and the three call sites (main login, discoveryLogin, runInlineLogin) pass storage.StorageModeUnknown explicitly. New table-driven case covers the override path. Co-authored-by: Isaac
|
Codex review on a15feae. Addressed in a01ac02. 1) Default-secure acceptance tests not unsetting the env var — false positive. Both 2) PinSecureMode persistence failures at debug — fixed. Promoted to 3) PinSecureMode re-resolving with StorageModeUnknown loses the override — fixed. Added |
token.go's loadToken rewrites cache.ErrNotFound to the constant "cache: databricks OAuth is not configured for this host" so older SDK versions (pre-v0.77.0) can substring-match the error and fall through to other auth providers. That rewrite, written before the secure-storage default, silently threw away the notFoundHintCache message a default-secure user would see post-upgrade: instead of the actionable "stored credentials from older CLI versions are no longer used; run auth login or set DATABRICKS_AUTH_STORAGE=plaintext" copy, they got the generic "Try logging in again ... If this fails, please report this issue" trailer, which steers expected upgrade behavior toward bug reports. Add storage.HintForNotFound to extract the notFoundHint message from an error chain, plus storage.NewNotFoundHint as a test factory. token.go now keeps the SDK-compat substring AND appends the hint when present, and skips the helpfulError trailer in that case. Co-authored-by: Isaac
The previous commit (Surface notFoundHint message in auth token error path) changed loadToken to drop the "Try logging in again with ... before retrying. If this fails, please report this issue ..." trailer when a notFoundHint is attached, and surface the hint instead. The acceptance output.txt fixtures for the cmd/auth code paths that produce ErrNotFound were not regenerated in that commit, so all task test jobs failed on a stale-fixture diff. Regenerated via ./task test-update for: - cmd/auth/token - cmd/auth/token/default-profile - cmd/auth/token/force-refresh-no-cache - cmd/auth/logout/stale-account-id-workspace-host Co-authored-by: Isaac
Why
Part of CLI GA. Storing long-lived OAuth refresh tokens for interactive logins (
auth_type = databricks-cli) in a plain JSON file in the user's home directory is a security weakness: any process with home-directory access can read them. The CLI is increasingly the entry point for local agent workflows, so we want tokens in the OS-native secure store by default.The flip has to handle one big "but": not every system has a usable OS keyring. Linux containers, headless SSH sessions, WSL1, and some CI runners do not have a D-Bus session bus. On those systems the keyring is not merely empty, it is not reachable at all. If we shipped the default flip alone, every command on those systems would error out with a cryptic backend message ("Cannot autolaunch D-Bus without X11 $DISPLAY", etc.) until the user manually set
DATABRICKS_AUTH_STORAGE=plaintextor re-randatabricks auth login. That is a real support burden for the GA window.This PR ships the default flip together with the supporting UX. Three pieces, each scoped narrowly:
token-cache.jsonentries stay accessible without manual configuration. This does not fire when the keyring is reachable but empty — that is the normal post-upgrade case, where we still surface a clear "rundatabricks auth loginto sign in" nudge so the user moves their tokens into the secure store rather than silently keeping them in plaintext.ErrNotFoundmessages that tell the user what to do, with upgrade-specific copy when a legacytoken-cache.jsonis present.Behavior matrix
The distinction between "keyring not available" and "keyring available but empty" drives most of the design:
ErrNotFoundwrapped with "no cached credentials; rundatabricks auth loginto sign in" (or upgrade copy iftoken-cache.jsonexists). User re-authenticates and writes the new token to the keyring.Lookupsurfaces the unreachability rather than being silently downgraded against the user's stated intent.Changes
Before: tokens were written to
~/.databricks/token-cache.json. SettingDATABRICKS_AUTH_STORAGE=secureopted in to the OS keyring.Now:
databricks auth loginonce after upgrade.DATABRICKS_AUTH_STORAGE=plaintextor[__settings__].auth_storage = plaintextopts back to the file cache. Env wins over config.auth_storage = secureto[__settings__]. Pins the choice so a later transient probe failure cannot silently demote the user.Geton a non-existent account. If the backend is unreachable on the default-secure path, the file cache is returned instead. The probe and fall-back are scoped strictly to backend-unavailability, not to empty results. Read-path fallback does NOT pin; pinning stays exclusive to login, which has the stronger write-probe signal and an explicit user action.ErrNotFoundfromLookupis wrapped with actionable copy: generic case "no cached credentials; rundatabricks auth loginto sign in"; upgrade case (mode=secure AND~/.databricks/token-cache.jsonhas entries) "stored credentials from older CLI versions are no longer used; rundatabricks auth loginto sign in again, or setDATABRICKS_AUTH_STORAGE=plaintextto keep using the file cache".ErrNotFoundkeyring errors get wrapped with the same actionable hint so users on no-keyring systems who somehow bypass the probe (e.g. explicit-secure callers) see "OS keyring unreachable: ... (setDATABRICKS_AUTH_STORAGE=plaintextor rundatabricks auth login)" instead of a raw D-Bus message.mainas dormant infrastructure) activates and pins.Implementation:
libs/auth/storage/mode.go: resolver default flips fromStorageModePlaintexttoStorageModeSecure. Constant doc comments updated.libs/auth/storage/cache.go: drops "dormant today" comments. NewPinSecureMode(login-side pin) andapplyReadFallback(read-side fallback).cacheFactoriesgainsprobeKeyringRead.persistPlaintextFallbacknow logs internally at debug for shape-consistency withPinSecureMode.libs/auth/storage/keyring.go: newProbeKeyringRead(read-only probe).Lookupwraps non-ErrNotFounderrors with the unreachability hint.libs/auth/storage/not_found_hint.go(new):notFoundHintCachewrapsResolveCache/ResolveCacheForLoginsoErrNotFoundfromLookupcarries an actionable hint without getting sandwiched between the SDK'scache:prefix andErrNotFound's tail.cmd/auth/login.go,cmd/auth/token.go: callstorage.PinSecureModeafter eachpersistentAuth.Challenge().login.goalso movesResolveCacheForLoginto run after input validation so trivially-invalid commands no longer probe the keyring.PinSecureModecases,applyReadFallbackcases,ProbeKeyringRead,notFoundHintCache,legacyCacheHasTokens).acceptance/script.prepareforcesDATABRICKS_AUTH_STORAGE=plaintextat the root so existing auth acceptance tests keep exercising the file-backed path. Tests that want the resolver default override it.acceptance/cmd/auth/describe/u2m-plaintext-defaultrenamed tou2m-secure-default; itstest.tomladds a[[Repls]]regex normalizing the platform-dependent keyring lookup error.acceptance/cmd/auth/describe/u2m-json-output,u2m-plaintext-env,u2m-plaintext-config: regenerated to match the new error copy.cmd/auth/auth_test.go:TestProfileHostCompatibleViaCobracopies the fixture into a temp directory so the resolver's writes can never dirty the checked-in file.NEXT_CHANGELOG.md: breaking-change entry under Notable Changes covering the flip, the re-login requirement, both opt-out paths, and the read-path fallback for systems without a usable keyring.Test plan
task checkscleantask lint-qcleango test ./libs/auth/... ./cmd/auth/... ./libs/databrickscfg/...passesgo test ./acceptance -run 'TestAccept/cmd/auth'passes on macOSgo test ./acceptance -run 'TestAccept/cmd/configure'passes (covers adatabricks-cliauth path outsidecmd/auth)[[Repls]]regex inu2m-secure-default/test.toml(macOS clean miss vs. Linux backend error).DATABRICKS_AUTH_STORAGEunset,databricks auth login --profile Xwrites to the keyring and persistsauth_storage = secureto[__settings__].DATABRICKS_AUTH_STORAGE=plaintext databricks auth login --profile Xcontinues to write to~/.databricks/token-cache.jsonwith the host-key dual-write entry;[__settings__]is not modified.databricks auth loginto sign in" nudge (not a silent fall-back).token-cache.jsonkeeps working.This pull request and its description were written by Isaac.