Context
feat/keycloak-jwt-validation (currently a pushed branch awaiting PR) makes the SmartEM Decisions backend require an Authorization: Bearer <jwt> on every non-exempt endpoint, validated offline against the configured Keycloak realm's JWKS. The SmartEM frontend handles this via the standard OIDC auth-code + PKCE flow, gated by silent-check-sso.html so it doesn't redirect-loop on third-party-cookie blocking.
The agent doesn't fit that flow. It runs unattended on Windows EPU workstations, ingests EPU filesystem output, and POSTs to the backend over HTTP/REST. There is no browser, no human, no way to do an interactive consent. Once the backend is auth-required in staging/production, the existing agent immediately breaks unless we give it a service-to-service auth path.
We need to pick that path now so the rollout from the JWT PR doesn't get blocked when it reaches an environment that has agents pointing at it.
Options
| Option |
What it is |
Pros |
Cons |
| Client credentials grant (recommended) |
Dedicated Keycloak client (e.g. SmartEM-agent) with serviceAccountsEnabled: true and a client secret. Agent POSTs to /protocol/openid-connect/token with grant_type=client_credentials, gets a JWT, uses it. |
Standard OAuth2 service-to-service pattern. Zero backend changes - the existing verify_token already accepts any RS256 token from the realm. Same JWKS rotation story as user tokens. Easy to revoke or rotate. |
Shared secret needs distribution to agent operators. One secret per agent population is the simple version; per-agent secrets would mean realm churn. |
| JWT client authentication |
Same shape as above but the agent authenticates to Keycloak with a signed JWT (private key on the agent) instead of a shared secret. |
Stronger than shared secret. No secret in transit. Per-agent keys are natural. |
Agent has to manage a private key. Setup overhead. Likely the right next step after client credentials is shipped. |
| mTLS |
Mutual TLS terminated at the ingress; backend trusts the cert subject. |
Strongest. No token mechanics inside the agent. |
k3s/ingress config, cert lifecycle, doesn't compose with our JWT-validation code path - we'd be running two parallel auth systems. |
| Static API key |
Backend accepts a long-lived secret in a custom header. |
Trivial to implement on both ends. |
Reinvents auth. No rotation story. Parallel auth path means more code, more attack surface. |
| Exempt agent endpoints |
Add /agents/... and friends to EXEMPT_PATHS; rely on network policies. |
No code. |
"Internal-only" tends not to stay that way. Hard to enforce on shared k8s clusters. Loses any per-agent attribution. |
Recommendation
Client credentials, with JWT client auth as a planned future hardening once the basic flow is in production.
Rationale:
- Reuses the JWT validation already on the backend; no parallel auth path.
- Tokens carry
azp (authorized party) = the agent client ID, so the backend can later split agent vs user permissions without changing the auth mechanism.
- Shared secret is good enough for first ship: agents are deployed by DLS infra into trusted hosts; the secret never leaves a controlled deployment.
- JWT client auth is a drop-in upgrade later - same flow, different
client_authenticator_type in Keycloak.
Concrete work (sketch, for a future implementation issue)
Keycloak realm config (both mock and DLS realms):
- Add a
SmartEM-agent client with publicClient: false, serviceAccountsEnabled: true, directAccessGrantsEnabled: false, standardFlowEnabled: false, clientAuthenticatorType: "client-secret".
- Optional: a
smartem-agent realm role assigned to the client's service account, so later we can authorize "only agents can write data" if we want it.
- For the mock at
smartem-devtools/keycloak-mock/dls-realm.json: hard-code a dev secret like dev-agent-secret so the agent can self-configure from .env.local.
Backend (smartem-decisions/src/smartem_backend/auth.py):
- Already accepts service-account tokens as-is.
- One small hardening: optional
KEYCLOAK_ALLOWED_AZP env var (comma-separated). When set, verify_token checks the azp claim is in the allowlist (e.g. SmartEM,SmartEM-agent). Default empty -> no check, current behaviour.
Agent (smartem-decisions/src/smartem_agent/...):
- A
KeycloakClient class that reads KEYCLOAK_URL, KEYCLOAK_REALM, AGENT_CLIENT_ID, AGENT_CLIENT_SECRET from env/config, calls the token endpoint with grant_type=client_credentials, caches the access token, and refreshes when within ~30s of exp (no refresh-token flow for client_credentials - just re-request).
- The existing requests-based HTTP layer in the agent gets an
auth callable that injects Authorization: Bearer <token> on every request and, on a 401, forces one token refresh + retry.
agent.exe config schema gains the four KEYCLOAK_* keys.
Open questions for whoever picks this up
- One shared
SmartEM-agent client for all agent instances, or one client per workstation? Shared is simpler for v1; per-workstation gives finer attribution and revocation but requires realm management tooling.
- Do we want
azp enforcement on the backend from day one, or ship without it and add later? Adding later is non-breaking.
- Should the agent fall back to "no auth" when
AGENT_CLIENT_SECRET is unset (for local dev parity with the current behaviour), or hard-fail at startup? Hard-fail is safer; fall-back is more ergonomic.
Out of scope for this issue
- Choosing between users having read-only vs full access; agents having write access vs full access. That's an authorization (RBAC) question, separate from authentication.
- mTLS at the ingress - documented above as an alternative, not pursued.
Context
feat/keycloak-jwt-validation(currently a pushed branch awaiting PR) makes the SmartEM Decisions backend require anAuthorization: Bearer <jwt>on every non-exempt endpoint, validated offline against the configured Keycloak realm's JWKS. The SmartEM frontend handles this via the standard OIDC auth-code + PKCE flow, gated bysilent-check-sso.htmlso it doesn't redirect-loop on third-party-cookie blocking.The agent doesn't fit that flow. It runs unattended on Windows EPU workstations, ingests EPU filesystem output, and POSTs to the backend over HTTP/REST. There is no browser, no human, no way to do an interactive consent. Once the backend is auth-required in staging/production, the existing agent immediately breaks unless we give it a service-to-service auth path.
We need to pick that path now so the rollout from the JWT PR doesn't get blocked when it reaches an environment that has agents pointing at it.
Options
SmartEM-agent) withserviceAccountsEnabled: trueand a client secret. Agent POSTs to/protocol/openid-connect/tokenwithgrant_type=client_credentials, gets a JWT, uses it.verify_tokenalready accepts any RS256 token from the realm. Same JWKS rotation story as user tokens. Easy to revoke or rotate./agents/...and friends toEXEMPT_PATHS; rely on network policies.Recommendation
Client credentials, with JWT client auth as a planned future hardening once the basic flow is in production.
Rationale:
azp(authorized party) = the agent client ID, so the backend can later split agent vs user permissions without changing the auth mechanism.client_authenticator_typein Keycloak.Concrete work (sketch, for a future implementation issue)
Keycloak realm config (both mock and DLS realms):
SmartEM-agentclient withpublicClient: false,serviceAccountsEnabled: true,directAccessGrantsEnabled: false,standardFlowEnabled: false,clientAuthenticatorType: "client-secret".smartem-agentrealm role assigned to the client's service account, so later we can authorize "only agents can write data" if we want it.smartem-devtools/keycloak-mock/dls-realm.json: hard-code a dev secret likedev-agent-secretso the agent can self-configure from.env.local.Backend (
smartem-decisions/src/smartem_backend/auth.py):KEYCLOAK_ALLOWED_AZPenv var (comma-separated). When set,verify_tokenchecks theazpclaim is in the allowlist (e.g.SmartEM,SmartEM-agent). Default empty -> no check, current behaviour.Agent (
smartem-decisions/src/smartem_agent/...):KeycloakClientclass that readsKEYCLOAK_URL,KEYCLOAK_REALM,AGENT_CLIENT_ID,AGENT_CLIENT_SECRETfrom env/config, calls the token endpoint withgrant_type=client_credentials, caches the access token, and refreshes when within ~30s ofexp(no refresh-token flow for client_credentials - just re-request).authcallable that injectsAuthorization: Bearer <token>on every request and, on a 401, forces one token refresh + retry.agent.execonfig schema gains the fourKEYCLOAK_*keys.Open questions for whoever picks this up
SmartEM-agentclient for all agent instances, or one client per workstation? Shared is simpler for v1; per-workstation gives finer attribution and revocation but requires realm management tooling.azpenforcement on the backend from day one, or ship without it and add later? Adding later is non-breaking.AGENT_CLIENT_SECRETis unset (for local dev parity with the current behaviour), or hard-fail at startup? Hard-fail is safer; fall-back is more ergonomic.Out of scope for this issue