Skip to content

Make secure token storage the default storage mode#5272

Merged
simonfaltum merged 15 commits into
mainfrom
simonfaltum/cli-ga-ms4-secure-default
May 20, 2026
Merged

Make secure token storage the default storage mode#5272
simonfaltum merged 15 commits into
mainfrom
simonfaltum/cli-ga-ms4-secure-default

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented May 19, 2026

Why

Part of CLI GA. Storing long-lived OAuth refresh tokens for interactive logins (auth_type = databricks-cli) in a plain JSON file in the user's home directory is a security weakness: any process with home-directory access can read them. The CLI is increasingly the entry point for local agent workflows, so we want tokens in the OS-native secure store by default.

The flip has to handle one big "but": not every system has a usable OS keyring. Linux containers, headless SSH sessions, WSL1, and some CI runners do not have a D-Bus session bus. On those systems the keyring is not merely empty, it is not reachable at all. If we shipped the default flip alone, every command on those systems would error out with a cryptic backend message ("Cannot autolaunch D-Bus without X11 $DISPLAY", etc.) until the user manually set DATABRICKS_AUTH_STORAGE=plaintext or re-ran databricks auth login. That is a real support burden for the GA window.

This PR ships the default flip together with the supporting UX. Three pieces, each scoped narrowly:

  1. Pin-on-success after a successful login on the default-secure path, so a later transient keyring failure cannot silently demote a working secure-storage user back to plaintext.
  2. Read-path fallback for unavailable keyrings only. When the keyring backend itself is unreachable (no D-Bus, no Secret Service daemon, etc.), reads fall through to the file cache so pre-upgrade token-cache.json entries stay accessible without manual configuration. This does not fire when the keyring is reachable but empty — that is the normal post-upgrade case, where we still surface a clear "run databricks auth login to sign in" nudge so the user moves their tokens into the secure store rather than silently keeping them in plaintext.
  3. Friendlier ErrNotFound messages that tell the user what to do, with upgrade-specific copy when a legacy token-cache.json is present.

Behavior matrix

The distinction between "keyring not available" and "keyring available but empty" drives most of the design:

Scenario Read-path behavior
Keyring reachable, has token for profile Return token from keyring.
Keyring reachable, no token for profile ErrNotFound wrapped with "no cached credentials; run databricks auth login to sign in" (or upgrade copy if token-cache.json exists). User re-authenticates and writes the new token to the keyring.
Keyring NOT reachable, default-secure mode Silent fall-back to file cache. Pre-upgrade tokens keep working. No nudge, no error.
Keyring NOT reachable, user explicitly chose secure (env / config) Return the keyring cache anyway. The actual Lookup surfaces the unreachability rather than being silently downgraded against the user's stated intent.
Keyring probe times out Stay on the keyring. A locked keyring being unlocked is the common timeout case; misdiagnosing it as "unavailable" would silently route reads to a different backend.

Changes

Before: tokens were written to ~/.databricks/token-cache.json. Setting DATABRICKS_AUTH_STORAGE=secure opted in to the OS keyring.

Now:

  • Default storage is the OS keyring (Keychain / Credential Manager / Secret Service). Users re-run databricks auth login once after upgrade.
  • DATABRICKS_AUTH_STORAGE=plaintext or [__settings__].auth_storage = plaintext opts back to the file cache. Env wins over config.
  • After a successful login on the default-secure path, the CLI writes auth_storage = secure to [__settings__]. Pins the choice so a later transient probe failure cannot silently demote the user.
  • Read paths cheap-probe the keyring with a read-only Get on a non-existent account. If the backend is unreachable on the default-secure path, the file cache is returned instead. The probe and fall-back are scoped strictly to backend-unavailability, not to empty results. Read-path fallback does NOT pin; pinning stays exclusive to login, which has the stronger write-probe signal and an explicit user action.
  • ErrNotFound from Lookup is wrapped with actionable copy: generic case "no cached credentials; run databricks auth login to sign in"; upgrade case (mode=secure AND ~/.databricks/token-cache.json has entries) "stored credentials from older CLI versions are no longer used; run databricks auth login to sign in again, or set DATABRICKS_AUTH_STORAGE=plaintext to keep using the file cache".
  • Non-ErrNotFound keyring errors get wrapped with the same actionable hint so users on no-keyring systems who somehow bypass the probe (e.g. explicit-secure callers) see "OS keyring unreachable: ... (set DATABRICKS_AUTH_STORAGE=plaintext or run databricks auth login)" instead of a raw D-Bus message.
  • Login-time silent fallback (already on main as dormant infrastructure) activates and pins.

Implementation:

  • libs/auth/storage/mode.go: resolver default flips from StorageModePlaintext to StorageModeSecure. Constant doc comments updated.
  • libs/auth/storage/cache.go: drops "dormant today" comments. New PinSecureMode (login-side pin) and applyReadFallback (read-side fallback). cacheFactories gains probeKeyringRead. persistPlaintextFallback now logs internally at debug for shape-consistency with PinSecureMode.
  • libs/auth/storage/keyring.go: new ProbeKeyringRead (read-only probe). Lookup wraps non-ErrNotFound errors with the unreachability hint.
  • libs/auth/storage/not_found_hint.go (new): notFoundHintCache wraps ResolveCache / ResolveCacheForLogin so ErrNotFound from Lookup carries an actionable hint without getting sandwiched between the SDK's cache: prefix and ErrNotFound's tail.
  • cmd/auth/login.go, cmd/auth/token.go: call storage.PinSecureMode after each persistentAuth.Challenge(). login.go also moves ResolveCacheForLogin to run after input validation so trivially-invalid commands no longer probe the keyring.
  • Unit tests cover all of the above (PinSecureMode cases, applyReadFallback cases, ProbeKeyringRead, notFoundHintCache, legacyCacheHasTokens).
  • acceptance/script.prepare forces DATABRICKS_AUTH_STORAGE=plaintext at the root so existing auth acceptance tests keep exercising the file-backed path. Tests that want the resolver default override it.
  • acceptance/cmd/auth/describe/u2m-plaintext-default renamed to u2m-secure-default; its test.toml adds a [[Repls]] regex normalizing the platform-dependent keyring lookup error.
  • acceptance/cmd/auth/describe/u2m-json-output, u2m-plaintext-env, u2m-plaintext-config: regenerated to match the new error copy.
  • cmd/auth/auth_test.go: TestProfileHostCompatibleViaCobra copies the fixture into a temp directory so the resolver's writes can never dirty the checked-in file.
  • NEXT_CHANGELOG.md: breaking-change entry under Notable Changes covering the flip, the re-login requirement, both opt-out paths, and the read-path fallback for systems without a usable keyring.

Test plan

  • task checks clean
  • task lint-q clean
  • go test ./libs/auth/... ./cmd/auth/... ./libs/databrickscfg/... passes
  • go test ./acceptance -run 'TestAccept/cmd/auth' passes on macOS
  • go test ./acceptance -run 'TestAccept/cmd/configure' passes (covers a databricks-cli auth path outside cmd/auth)
  • Linux CI is the real test for the [[Repls]] regex in u2m-secure-default/test.toml (macOS clean miss vs. Linux backend error).
  • Manual: with DATABRICKS_AUTH_STORAGE unset, databricks auth login --profile X writes to the keyring and persists auth_storage = secure to [__settings__].
  • Manual: DATABRICKS_AUTH_STORAGE=plaintext databricks auth login --profile X continues to write to ~/.databricks/token-cache.json with the host-key dual-write entry; [__settings__] is not modified.
  • Manual: keyring reachable but empty for the current profile, an auth command produces the "run databricks auth login to sign in" nudge (not a silent fall-back).
  • Manual: keyring NOT reachable (Linux container, headless SSH), an auth command silently uses the file cache; a populated pre-upgrade token-cache.json keeps working.

This pull request and its description were written by Isaac.

U2M tokens for the databricks-cli auth type now write to the OS-native
keyring by default. Users who need the previous file-backed cache can
opt back via DATABRICKS_AUTH_STORAGE=plaintext or auth_storage =
plaintext under [__settings__] in .databrickscfg; the env var takes
precedence. The login-time keyring probe and fallback (already on main)
activate with this change.

Co-authored-by: Isaac
When the resolver returns secure from the default (no env, no config),
login now writes auth_storage = secure to [__settings__] after the
keyring Store succeeds. Subsequent invocations see source=Config, so the
explicit-secure branch of applyLoginFallback surfaces a transient keyring
probe failure as an error instead of silently demoting the user to
plaintext. Without this pin, a working secure-storage user could get
stranded on the file cache after a single flaky probe.

No-op when mode is plaintext (silent fallback already happened) or when
the user already chose a mode explicitly. Persistence failures are
logged at debug and never block login.

Co-authored-by: Isaac
- login.go: move PinSecureMode call out of the existing comment block so
  the "At this point... / The rest of the command focuses on" narration
  stays together
- cache.go: trim PinSecureMode doc comment and acknowledge that
  concurrent logins racing the write is benign because both write the
  same value
- cache_test.go: drop the unused wantSkipMsg struct field; strengthen
  TestPinSecureMode_PersistFailureIsSwallowed to assert no file was
  written (and that the underlying os.OpenFile failure is the real
  trigger)
- u2m-secure-default test.toml: rephrase the fixture comment to keep
  internal Go API names out of test config

Co-authored-by: Isaac
After the default flipped to secure, any test that runs the login
command on linux (no D-Bus) hits applyLoginFallback, which silently
persists auth_storage = plaintext to whatever DATABRICKS_CONFIG_FILE
points at. TestProfileHostCompatibleViaCobra points it at the checked-
in cmd/auth/testdata/.databrickscfg fixture, so the test run leaves a
dirty working tree and CI's `git diff --exit-code` step fails.

Two changes:

1. Move ResolveCacheForLogin in login.go to run after input validation
   (cluster/serverless mutex + positional-arg check) rather than before.
   Trivially-invalid commands now fail without probing the keyring, so
   TestLoginRejectsPositionalArgWithHostFlag / WithProfileFlag no
   longer hit applyLoginFallback. The "resolve before browser step"
   property the original comment cared about is preserved: cache
   resolution still happens before NewPersistentAuth and Challenge.

2. Force plaintext via DATABRICKS_AUTH_STORAGE in
   TestProfileHostCompatibleViaCobra, which legitimately passes all
   input validation and reaches the resolver. The test is about flag
   compatibility, not storage mode; pinning it to plaintext keeps it
   hermetic.

Co-authored-by: Isaac
Before: a databricks-cli auth command on a system without a usable OS
keyring (Linux without D-Bus, headless containers, certain SSH sessions)
surfaced the raw backend error, typically something like "Cannot
autolaunch D-Bus without X11 $DISPLAY". The user had no way to tell
that switching storage modes would fix it.

After: keyringCache.Lookup wraps non-ErrNotFound errors with actionable
guidance pointing at DATABRICKS_AUTH_STORAGE=plaintext and
`databricks auth login` (which auto-falls back to plaintext when the
keyring is unreachable).

ErrNotFound passes through unchanged because a clean miss is the "run
auth login" signal, not an unreachability problem. Store is left alone:
the login Challenge path is already handled by applyLoginFallback with
mode-aware messaging, so wrapping Store would duplicate the hint inside
"secure storage was requested but the OS keyring is not reachable: ...".

Co-authored-by: Isaac
Comment thread cmd/auth/auth_test.go Outdated
// Force plaintext so the storage resolver does not probe the OS keyring
// and silently persist auth_storage = plaintext to the checked-in fixture
// on CI runners without a usable keyring.
t.Setenv(storage.EnvVar, "plaintext")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outcome is the same, right?

Or does setting this env var omit the auth_storage field in the cfg?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var did change cfg behavior — without it, applyLoginFallback's silent-downgrade branch would persist auth_storage = plaintext to whatever DATABRICKS_CONFIG_FILE points at, which here was the checked-in fixture (Linux CI: probe fails → silent fallback fires). Refactored in 03c3d02: copy the fixture to t.TempDir() and point DATABRICKS_CONFIG_FILE there, so the test runs the full resolver path with whatever writes happen contained to the temp file. Env override is gone.

Comment thread libs/auth/storage/cache.go Outdated
}
configPath := env.Get(ctx, "DATABRICKS_CONFIG_FILE")
if err := databrickscfg.SetConfiguredAuthStorage(ctx, string(StorageModeSecure), configPath); err != nil {
log.Debugf(ctx, "pin secure mode: persist: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as function on L164 except debug logging. Shouldn't both log at debug?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done in 03c3d02 — persistPlaintextFallback now also logs internally at debug and returns void. Both functions have the same shape now; applyLoginFallback's call site drops the err check.

Comment thread libs/auth/storage/cache.go Outdated
if mode != StorageModeSecure {
return
}
_, source, err := ResolveStorageModeWithSource(ctx, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: empty string for a typed value. This should say something like StorageModeUnspecified or something to make it clear we're not applying an override here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 03c3d02 — switched to StorageModeUnknown, which is the existing 'no mode selected' zero value (see its doc comment). Happy to rename it to StorageModeUnspecified across the codebase as a follow-up if that reads better; left out here to keep the diff focused.

Comment thread NEXT_CHANGELOG.md Outdated

### Notable Changes

* Breaking change: U2M (`auth_type = databricks-cli`) tokens are now stored in the OS-native secure store by default (Keychain on macOS, Credential Manager on Windows, Secret Service on Linux) instead of `~/.databricks/token-cache.json`. After upgrading, run `databricks auth login` once per profile to re-authenticate; cached tokens from older versions are not migrated. To keep the previous file-backed storage, set `DATABRICKS_AUTH_STORAGE=plaintext` or add `auth_storage = plaintext` under `[__settings__]` in `~/.databrickscfg` (the env var takes precedence over the config setting), then re-run `databricks auth login`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u2m --> "OAuth tokens for interactive logins (auth_type = databricks-cli)"...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 03c3d02 — replaced with 'OAuth tokens for interactive logins (auth_type = databricks-cli)'.

Before: the SDK's cache.ErrNotFound rendered as "cache: token not found"
at the CLI surface. Pre-flip, plaintext users who never ran auth login
or were missing a profile saw that message. Post-flip, every user with
an existing plaintext token sees it too until they re-login, since the
secure resolver does not read token-cache.json.

After: ResolveCache and ResolveCacheForLogin wrap their result in a
notFoundHintCache. When Lookup returns ErrNotFound:

- If mode=secure and ~/.databricks/token-cache.json exists with entries,
  the error reads "stored credentials from older CLI versions are no
  longer used; run `databricks auth login` to sign in again, or set
  DATABRICKS_AUTH_STORAGE=plaintext to keep using the file cache". This
  is the upgrade scenario from the CLI Secure Storage Troubleshooting
  doc.
- Otherwise the error reads "no cached credentials; run
  `databricks auth login` to sign in".

A custom notFoundHint type backs both messages so they replace the
SDK's "token not found" string rather than getting appended to it; the
SDK's outer "cache: %w" wrap then produces clean output like
"cache: no cached credentials; run `databricks auth login` to sign in"
instead of "cache: <hint>: token not found".

errors.Is(err, cache.ErrNotFound) continues to return true via the
Unwrap method, so the SDK's existing branches on ErrNotFound still fire.

The wrap lives in the exported ResolveCache / ResolveCacheForLogin,
not in the internal resolveCacheWith / resolveCacheForLoginWith test
helpers, so existing unit tests that type-assert on the inner cache
type still work.

Co-authored-by: Isaac
1. NEXT_CHANGELOG: drop "U2M" jargon in favor of "OAuth tokens for
   interactive logins (auth_type = databricks-cli)".

2. TestProfileHostCompatibleViaCobra: copy the testdata fixture into a
   t.TempDir() and point DATABRICKS_CONFIG_FILE there, instead of
   forcing DATABRICKS_AUTH_STORAGE=plaintext to suppress the silent
   plaintext-fallback write. The temp-file approach is more hermetic
   and removes the question Pieter raised about whether the env var
   alters cfg behavior. The storage import in auth_test.go is dropped.

3. persistPlaintextFallback: log internally at debug instead of returning
   the error for the caller to log. Same shape as PinSecureMode, which
   was already doing this. Call site in applyLoginFallback drops the
   err check.

4. PinSecureMode: pass StorageModeUnknown (the named "no override
   selected" constant) instead of "" to ResolveStorageModeWithSource.

Co-authored-by: Isaac
Mirrors the login-path fallback for the read path. When the resolver
returns (mode=Secure, source=Default) and the keyring probes as
definitively unreachable, ResolveCache now returns the file cache
instead of the keyring. This means users upgrading on systems without
a usable keyring (e.g. Linux containers without a D-Bus session bus)
keep reading from their pre-upgrade token-cache.json without any
manual configuration.

Two behaviours intentionally differ from the login-path fallback:

1. The probe is read-only (ProbeKeyringRead does a Get on a non-existent
   account name, treating keyring.ErrNotFound as "reachable"). The
   login-path probe still uses the write+delete cycle because login
   needs to validate write capability before committing tokens.

2. The read-path fallback does NOT persist auth_storage = plaintext to
   [__settings__]. A read failure could be transient (e.g. a momentarily
   broken D-Bus connection) and we don't have strong enough evidence to
   pin the user out of the keyring permanently. Pinning stays the
   responsibility of the login command, which has the stronger
   write-probe signal and an explicit user action.

Timeouts in the read probe stay on the keyring (same logic as the login
path): a locked keyring being unlocked is the most common timeout case,
and a misdiagnosed hang fails the actual Lookup, which is preferable to
silently routing reads to a different backend.

Explicit-secure callers (env var or config) get the keyring cache
regardless of probe result, so an explicit "I want secure" request
surfaces the unreachability instead of being silently downgraded.

Co-authored-by: Isaac
Two findings:

1. The persist failure was logged at debug, which silently weakens the
   stated pin-on-success guarantee: login succeeds, the keyring write
   succeeds, but auth_storage = secure is never persisted. A later
   default-secure read with a transient keyring probe failure would
   then fall back to plaintext, undoing the protection the pin was
   supposed to provide. Promote to log.Warnf so the user sees the
   failure during login and can investigate (file permissions, etc.).
   Login is still not blocked: pinning is best-effort.

2. PinSecureMode re-resolved the storage mode with StorageModeUnknown,
   so a caller-supplied override was invisible to the source check.
   The function's doc said "No-op when the user already chose a mode
   explicitly", but only env and config were detected; an override
   secure mode would still pin auth_storage = secure to config,
   silently turning an ephemeral per-invocation choice into a
   persistent one. No caller passes a non-empty override today, so the
   bug was dormant, but fix it before someone wires up a flag and
   notices the behavior.

   PinSecureMode now takes an override StorageMode parameter, and the
   three call sites (main login, discoveryLogin, runInlineLogin) pass
   storage.StorageModeUnknown explicitly. New table-driven case covers
   the override path.

Co-authored-by: Isaac
@simonfaltum simonfaltum temporarily deployed to test-trigger-is May 20, 2026 11:10 — with GitHub Actions Inactive
@simonfaltum
Copy link
Copy Markdown
Member Author

Codex review on a15feae. Addressed in a01ac02.

1) Default-secure acceptance tests not unsetting the env var — false positive. Both u2m-secure-default/script and u2m-json-output/script already do unset DATABRICKS_AUTH_STORAGE on line 6 (the script files weren't in the diff because they were unchanged or only renamed). Local acceptance run confirms: both tests pass and their output asserts (from default).

2) PinSecureMode persistence failures at debug — fixed. Promoted to log.Warnf with copy that names the config path and tells the user what to set if the pin didn't take. Login still doesn't block on a pin failure (best-effort), but the user sees it. Kept persistPlaintextFallback at debug since the user is already in a degraded fallback state there.

3) PinSecureMode re-resolving with StorageModeUnknown loses the override — fixed. Added override StorageMode parameter, plumbed through the three call sites (main login, discoveryLogin, runInlineLogin), all pass storage.StorageModeUnknown today. New table-driven case covers the override path. No caller has a --secure-storage flag yet, but the API is now correct for when one is added.

@simonfaltum simonfaltum temporarily deployed to test-trigger-is May 20, 2026 11:10 — with GitHub Actions Inactive
token.go's loadToken rewrites cache.ErrNotFound to the constant
"cache: databricks OAuth is not configured for this host" so older SDK
versions (pre-v0.77.0) can substring-match the error and fall through
to other auth providers. That rewrite, written before the secure-storage
default, silently threw away the notFoundHintCache message a
default-secure user would see post-upgrade: instead of the actionable
"stored credentials from older CLI versions are no longer used; run
auth login or set DATABRICKS_AUTH_STORAGE=plaintext" copy, they got the
generic "Try logging in again ... If this fails, please report this
issue" trailer, which steers expected upgrade behavior toward bug
reports.

Add storage.HintForNotFound to extract the notFoundHint message from
an error chain, plus storage.NewNotFoundHint as a test factory.
token.go now keeps the SDK-compat substring AND appends the hint when
present, and skips the helpfulError trailer in that case.

Co-authored-by: Isaac
The previous commit (Surface notFoundHint message in auth token error
path) changed loadToken to drop the "Try logging in again with ...
before retrying. If this fails, please report this issue ..." trailer
when a notFoundHint is attached, and surface the hint instead. The
acceptance output.txt fixtures for the cmd/auth code paths that produce
ErrNotFound were not regenerated in that commit, so all task test jobs
failed on a stale-fixture diff.

Regenerated via ./task test-update for:
- cmd/auth/token
- cmd/auth/token/default-profile
- cmd/auth/token/force-refresh-no-cache
- cmd/auth/logout/stale-account-id-workspace-host

Co-authored-by: Isaac
@simonfaltum simonfaltum temporarily deployed to test-trigger-is May 20, 2026 13:23 — with GitHub Actions Inactive
@simonfaltum simonfaltum temporarily deployed to test-trigger-is May 20, 2026 13:23 — with GitHub Actions Inactive
@simonfaltum simonfaltum merged commit 3204e67 into main May 20, 2026
22 of 23 checks passed
@simonfaltum simonfaltum deleted the simonfaltum/cli-ga-ms4-secure-default branch May 20, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants