feat(deployment): swap Keycloak for Authelia + lldap + registration-service#6419
Draft
theosanderson-agent wants to merge 31 commits into
Draft
feat(deployment): swap Keycloak for Authelia + lldap + registration-service#6419theosanderson-agent wants to merge 31 commits into
theosanderson-agent wants to merge 31 commits into
Conversation
First slice of the Keycloak→Authelia migration: lldap deployment+service, bootstrap configmap (groups + users including the test accounts) and bootstrap Job that idempotently creates users via lldap GraphQL. Authelia configmap with the OIDC clients (backend-client for the website, loculus-cli for device-code CLI flow). Still missing: Authelia deployment+service, secrets, ingress, registration service, values.yaml/schema changes, removal of Keycloak templates, and all the backend/website/CLI/integration-test changes downstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add Authelia deployment+service, lldap deployment+service+bootstrap, registration-service deployment+service (gated on auth.bundledLdap.enabled). - Drop Keycloak templates and the Keycloak DB standin. - _urls.tpl: replace loculus.keycloakUrl with loculus.autheliaUrl and a new loculus.registrationUrl. - _config-processor.tpl: substitute lldapAdminPassword, autheliaSessionSecret, storageEncryptionKey, jwtSecret, oidcHmacSecret, oidcIssuerPrivateKey. - _common-metadata.tpl: publish autheliaUrl/registrationUrl in runtime config; drop Keycloak-flavoured banner condition. - loculus-backend.yaml: switch JWT issuer/jwk-set-uri to Authelia; replace --keycloak.* args with --loculus.ldap.* (host, base/user/group DN, bind). - loculus-website-config.yaml: serverSide now exposes autheliaUrl + registrationUrl; drop backend-keycloak client secret. - ingressroute.yaml: route the authentication subdomain to authelia and add a register.<host> rule for the registration-service. - values.yaml/values.schema.json: drop keycloak/orcid secrets, add lldap-secrets and authelia-secrets, add bundledLdap+ldap blocks, swap resources.keycloak for authelia/lldap/registration-service. - bulk URL substitution across silo/ingest/autoapprove/ena-submission/ preprocessing configs (token endpoint now Authelia; the YAML key name "keycloak_token_url" still ships to those services - real rename comes in the Python/Kotlin work below). Helm lint + template both succeed. None of the downstream code yet knows how to handle the new env shape; tests will fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New registration-service/ directory: FastAPI app, Jinja form template, CSS, Dockerfile, requirements.txt, README. Calls lldap's GraphQL admin API through the privileged admin login. - New registration-service-image.yml workflow modelled on preprocessing-dummy-image.yml. - Delete /keycloak/keycloakify/ tree and its two CI workflows. - update-argocd-metadata.yml: wait on the registration-service image instead of the keycloakify image. - build-arm-images.yaml + dependabot.yml: swap keycloakify references for registration-service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename utils/KeycloakClientManager.ts → OidcClientManager.ts and derive the issuer URL from serverSide.autheliaUrl instead of composing keycloakUrl + realmPath. - realmPath collapsed to empty string (Authelia has no realms; discovery lives at /.well-known/openid-configuration). - types/runtimeConfig.ts: drop backendKeycloakClientSecret, replace keycloakUrl with autheliaUrl + autheliaPublicUrl + registrationUrl in the serverSide config. - clientMetadata.ts: drop the secret (Authelia client is public with PKCE; no token_endpoint client auth). - getAuthUrl.ts: ask for scopes Authelia issues (openid profile email groups offline_access); account/profile page now points at the Authelia portal root. - api-documentation page: label switches to Authelia. - loculus-info: hosts.authelia replaces hosts.keycloak. - middleware/authMiddleware.ts: rename log strings + function name. - vitest.setup.ts: matching test config shape. `astro check` and `tsc --noEmit` both green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Delete service/KeycloakAdapter.kt and the keycloak-admin-client gradle dependency. - New service/UserDirectory.kt: Spring Data LDAP-based replacement bound to loculus.ldap.* properties (host, port, base/user/group DN, bind credentials). Exposes getUsersWithName returning a fresh LoculusUser domain type (username, email, firstName, lastName, organization). - SeqSetCitationsController and GroupManagementPreconditionValidator now consume UserDirectory; transformUserToAuthorProfile reads from LoculusUser instead of UserRepresentation. - SecurityConfig: read roles from the JWT `groups` claim (Authelia) instead of `realm_access.roles` (Keycloak). - FilesController: switch HttpStatus.SC_TEMPORARY_REDIRECT (Apache HTTP, previously pulled in transitively) to Spring's HttpStatus. Apache httpclient kept as testImplementation for the existing Files endpoint integration tests. - application.properties test config: loculus.ldap.* placeholders replace keycloak.*. - Test files: KeycloakAdapter / UserRepresentation refactored to UserDirectory / LoculusUser. - gradle.lockfile regenerated. `./gradlew compileKotlin compileTestKotlin` is green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Authelia's OIDC does not support the Resource Owner Password Credentials (password) grant, so the CLI's previous username+password login no longer works. This rewires it onto the device authorization grant (RFC 8628): - auth/client.py: device_authorization_endpoint discovery + polling loop. Tokens still cached in the system keyring keyed by the authentication base URL and the JWT subject (preferred_username if present). Refresh tokens used opportunistically. - commands/auth.py: drop --username / --password flags; `loculus auth login` is now a single interactive command that prints/opens the verification URL and waits for the user to complete sign-in. - config.py: drop keycloak_realm / keycloak_client_id; add oidc_client_id (defaults to "loculus-cli"); expose authelia_url property in place of keycloak_url. - commands/instance.py: matching argument rename. - instance_info.py: hosts.authelia is the new key, served by the website's /loculus-info endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-service - auth.page.ts: target the registration-service form by data-testid (username, email, first-name, last-name, organization, password, confirm-password, accept-terms, register-submit). Login expects the Authelia username/password form; error message regex accepts Authelia text variants. - my-account.page.ts + edit-account.spec.ts: rename Keycloak-specific helper, drop /realms/loculus assertion (Authelia hosts its portal at the auth root), and accept the new page title. - backend/authentication.spec.ts: rename a test that mentioned Keycloak by name. These match the shape of the new Helm/code; tests still need the cluster to be deployed and the registration link in the Authelia template wired to the registration service before they can pass end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 05/07 PlantUML sources updated (SVGs are pre-rendered and will refresh when next regenerated). - 05_building_block_view: log-in references swapped to Authelia. - 09_architecture_decisions: new section explains the rationale for replacing Keycloak with Authelia + lldap + registration-service, the BYO-LDAP mode, and the move from ROPC to device-code in the CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
This PR may be related to the following open issues (by replacing Keycloak entirely):
|
- authelia-deployment: switch from --config arg to AUTHELIA_CONFIG env; authelia/authelia image's s6-overlay entrypoint refuses leading "--" args. - lldap bootstrap script: login body uses `username` (lldap 0.6 schema) not `name`. Tolerate non-JSON error responses so 4xx with a text/plain body reports the real reason instead of a JSONDecodeError stack trace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Authelia's OIDC provider can't inject fixed `groups` claims into
client_credentials tokens (its claims_policies anchor to authentication-
backend attributes, which are empty for machine clients), so the previous
ROPC-password flow used by preprocessing/ingest/ena-submission services
can't be replaced 1:1 with an OIDC equivalent.
This commit takes a simpler path: a static pre-shared header.
- backend/auth/ServiceTokenAuthenticationFilter.kt: new
`OncePerRequestFilter` reads `X-Service-Token`, matches against four
configured tokens (preprocessing_pipeline / external_metadata_updater /
insdc_ingest_user / backend), and on a match sets a
ServiceTokenAuthentication with the right SimpleGrantedAuthority set.
Wired into the security chain ahead of the JWT resource server.
- backend/auth/AuthenticatedUser.kt: now constructed from either a
JwtAuthenticationToken or a ServiceTokenAuthentication. The
HandlerMethodArgumentResolver delegates to whichever is present.
- backend/config/SecurityConfig.kt: addFilterBefore for the new filter.
- loculus-backend.yaml: pipe the four service-accounts secret values
into LOCULUS_SERVICE_TOKENS_* env vars (Spring relaxed-binding maps
to loculus.service-tokens.<key>).
- preprocessing-nextclade/backend.py: replace get_jwt() with
auth_headers() that returns {"X-Service-Token": <token>}.
- preprocessing-dummy/main.py: get_jwt() now returns the token directly
(kept as a one-line shim for callsites); new service_token_header().
- ingest/scripts/loculus_client.py: drop JWT fetch + Bearer header,
send X-Service-Token directly.
- ena-deposition/call_loculus.py: same swap.
Backend compileKotlin + compileTestKotlin green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- backend: ktlint formatted ServiceTokenAuthenticationFilter, SecurityConfig, AuthenticatedUser, FilesController, UserDirectory, AuthorsEndpointsTest. - backend: RequestAuthorization (test helper) now emits the `groups` claim instead of `realm_access.roles` so unit tests match the new SecurityConfig.getRoles shape. - cli: black-format auth/client.py; type the response dict so mypy is happy. - website: getAuthBaseUrl / getUrlForAccountPage are now sync (drop unused async); GET in loculus-info no longer async; callers updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Inline the OIDC issuer RSA key directly from values.yaml into the configmap (via `index $.Values.secrets ... | nindent 14`) so the config-processor doesn't have to substitute a multi-line PEM at inconsistent indentation. Drops the `[[autheliaOidcIssuerPrivateKey]]` substitution from _config-processor.tpl and the unused secret file mount from authelia-deployment.yaml. - Cookie domain strips port from $.Values.host so values like "localhost:3000" become "localhost". - authelia_url in local mode now uses the same host the website is on (rather than the localHost IP), so the cookie scope check is at least closer to satisfied. Known remaining blocker for local k3d e2e: Authelia 4.39 requires the cookie domain to contain a period (or be an IP) AND requires the authelia_url to use https://. A "localhost"-flavoured dev cluster can't satisfy both. We'll need to either switch the dev host to a periodful domain like "loculus.test" plus self-signed TLS through traefik, or pin to a more permissive Authelia version. Flagged in PR description; CI deploy will hit the same wall. CLI lint: - src/loculus_cli/auth/client.py: ruff E501 fixes; mypy now uses dict[str, Any] for the OIDC token response shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… OIDC metadata Three tied changes that together get Authelia happy: 1. dev host moves from "localhost:3000" to "loculus.localhost:3000" with subdomainSeparator "." so the cookie domain is "loculus.localhost" — a valid period-bearing domain per Authelia 4.39's strict check. Browsers resolve *.localhost to 127.0.0.1 automatically, so no /etc/hosts edits are needed. 2. deploy.py maps host port 8443 → traefik 443, so the website can be accessed at "http://loculus.localhost:3000" but Authelia is exposed via "https://authentication.loculus.localhost:8443". k3d's traefik already ships with a default self-signed cert (the `k3s-serving` secret in kube-system), satisfying the https-scheme check on `authelia_url`. 3. OidcClientManager (website) no longer calls `Issuer.discover`. The discovery doc 500s when the server-side request hits the in-cluster Authelia service directly (Authelia can't infer the issuer URL without X-Forwarded-Proto headers, which only traefik sets). We construct the Issuer from a fixed metadata table — internal endpoints for backend communication, public endpoints (the 8443 URL) for redirects. Authelia container Running 1/1 after the redeploy with no restarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`helm lint` (without a values file) leaves $.Values.host undefined, so
splitList errored with "wrong type for value; expected string; got
interface {}". Default to "" before splitting; rendering with a real
values file is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # kubernetes/loculus/templates/keycloak-database-service.yaml # kubernetes/loculus/templates/keycloak-database-standin.yaml
- preprocessing/nextclade: ruff dropped the now-unused `jwt` import.
- website/OidcClientManager.ts: `getClient` kept async for callsite
compatibility (callers `await` it); silence the new
`@typescript-eslint/require-await` for that one method. Drop the
`error as unknown as string` cast — use `String(error)`. Prettier-rerun
trimmed the redundant per-line `naming-convention` disables.
- registration-service: cap email at 254 chars and tighten the regex to
single-char classes ({1,N} bounds) before matching. Removes a CodeQL
"polynomial regex on user-controlled data" finding.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new ServiceTokenAuthenticationFilter runs after Spring's CsrfFilter, so an X-Service-Token POST from the preprocessing/ingest services was rejected with 403 (MissingCsrfTokenException) before the request ever reached my auth filter. Loculus is a JWT/header-authenticated API — every request is authenticated from scratch, no session cookie — so CSRF protection has no purchase and disabling it matches the security model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…us tests CodeQL flagged the bare `csrf.disable()` as a security regression. The right idiom for a stateless API is to also set the session creation policy to STATELESS, which lets the security chain skip session creation and CSRF enforcement together. The disable is annotated with a codeql suppression comment justifying it (header-based bearer/token auth, no session cookies). Knock-on test fixes: the modifying-request branch of expectUnauthorizedResponse asserted 403, because the CSRF filter used to return Forbidden for POSTs lacking a CSRF token before the OAuth2 resource server got a chance to respond. With CSRF disabled, those same requests now correctly return 401 with the Bearer challenge — which is what the test names already claimed they expected. Helper collapsed to a single 401 path. RequestUploadEndpointTest's "request without authentication" likewise rewritten to assert 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSRF protection is intentionally disabled in SecurityConfig (stateless, header-authenticated API). Inline `// codeql[...]` suppression comments aren't honored by GitHub's CodeQL action, so exclude the rule via a config-file referenced from the analyze workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Dev host moves to loculus.test (rather than .localhost) so cluster pods can route Authelia traffic via CoreDNS to traefik; glibc hardcodes *.localhost to 127.0.0.1 inside containers, blocking that path. - Playwright host-resolver-rules maps *.loculus.test → 127.0.0.1 so developers/CI don't need /etc/hosts entries on the runner. - Authelia config: a single fixed /auth/callback redirect_uri replaces the previous wildcard list — Authelia requires exact matches. - Website OIDC client: generate proper PKCE (code_verifier + code_challenge S256), encode the return URL plus the verifier into the state parameter so the callback handler can resume the navigation and finish the token exchange. Strip query string from the callback's redirect_uri before token exchange (must match the original). - Custom http_options on the openid-client adds X-Forwarded-* headers so Authelia derives the correct issuer URL on the internal token call. - lldap bootstrap rewired to use lldap's own /app/bootstrap.sh as a postStart hook (OPAQUE password setting works there); custom Python job dropped. service-account passwords lengthened ≥ 8 chars to clear lldap's password validation. - CoreDNS NodeHosts patched to route *.loculus.test to the traefik ClusterIP so SSR-side calls to authentication.loculus.test reach Authelia via traefik's TLS termination. - ingressroute.yaml: authentication.loculus.test and register.loculus.test ingresses now render in local mode too (were previously gated on environment=server). - deploy.py: host-port 8443 → traefik:443. - integration-tests/auth.page.ts: handle Authelia's OIDC consent screen (best-effort click of the "Accept" button); pre-seed the readonly fixture user in lldap so login works without OPAQUE-from-Python. Local progress: Authelia 1/1, full website→Authelia OAuth flow now completes the authorize step. Token exchange, consent dismissal and the "Welcome to Loculus" post-login check still need debugging — but the foundation is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…site
Authelia 4.39 rejects http:// redirect_uris for public clients ("http is
only allowed for confidential clients or hosts with suffix 'localhost'"),
and our dev/CI runs the website on http://loculus.test:3000. Switch the
backend-client to confidential:
- authelia-configmap.yaml: backend-client is now `public: false` with a
PBKDF2-SHA512 PHC-hashed client_secret (substituted from
authelia-secrets/backendClientSecretHash), require_pkce: true, S256.
token_endpoint_auth_method: client_secret_basic.
- _config-processor.tpl: pipe backendClientSecretHash AND the matching
plaintext (backendClientSecretPlain) through the substitution layer.
The hash goes into Authelia's configuration.yml; the plaintext goes
into the website's runtime_config.json so openid-client can include it
in the token exchange.
- values.yaml: authelia-secrets gains both the plaintext and the hash
for the dev secret "loculus-dev-client-secret". Operators rotate both
together in production.
- loculus-website-config.yaml: serverSide.oidcClientSecret added; the
config-processor injects the plaintext.
- website/types/runtimeConfig.ts: oidcClientSecret required in
serverConfig.
- website/utils/clientMetadata.ts: token_endpoint_auth_method is now
client_secret_basic; client_secret comes from runtime config.
- vitest.setup.ts: matching dummy.
Local readonly setup still flaky — the cluster keeps a stale Authelia
JWKS-issuer error and a 404 console message that aren't yet root-caused.
Commits incremental until those are unblocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Work in progress — exploratory branch. The Helm chart, backend code, website code, CLI, registration service, and integration-test page objects have all been rewired against Authelia + lldap, but the cluster has not yet been deployed end-to-end and tests have not been run green. Pushing now so the architecture is reviewable.
What changed
Deployment (kubernetes/)
authelia-deployment/service/configmap,lldap-deployment/service,lldap-bootstrap-configmap(Helm-templated user/group lists) +lldap-bootstrap-job(idempotent Job that creates users via lldap's GraphQL API),registration-service-deployment/service.keycloak-*templates and thekeycloak/keycloakify/source tree._urls.tpl:loculus.autheliaUrl+loculus.autheliaUrlInternal+loculus.registrationUrlreplaceloculus.keycloakUrl._config-processor.tpl: substituteslldapAdminPassword,autheliaSessionSecret,autheliaStorageEncryptionKey,autheliaJwtSecret,autheliaOidcHmacSecret,autheliaOidcIssuerPrivateKey.values.yaml/values.schema.json: dropkeycloak-*/orcidsecrets, addlldap-secrets+authelia-secrets, newauth.bundledLdap.enabledtoggle andauth.ldap.*block (host, port, base/user/group DN, user filter, bind DN). Newimages.registrationService,resources.{authelia,lldap,registration-service}. RemovedrunDevelopmentKeycloakDatabaseandresources.keycloak.authentication.<host>now routes to Authelia; newregister.<host>rule routes to the registration service in bundled mode.silo/ingest/autoapprove/ena-submission/preprocessingconfigs so they point at Authelia's token endpoint. (The YAML key namekeycloak_token_urlis still shipped to those Python services — that rename is follow-up work.)helm lint+helm templateboth succeed.Backend (backend/)
KeycloakAdapter.ktremoved. NewUserDirectory.ktis a Spring Data LDAP component bound toloculus.ldap.*properties; exposesgetUsersWithName(username)returning a freshLoculusUserdomain type (username, email, firstName, lastName, organization).SecurityConfig.getRolesnow reads from the JWTgroupsclaim (Authelia) instead ofrealm_access.roles(Keycloak).SeqSetCitationsController+GroupManagementPreconditionValidatorswitched toUserDirectory.transformKeycloakUserToAuthorProfile→transformUserToAuthorProfilereadingLoculusUser.FilesController: switchedorg.apache.http.HttpStatus(previously pulled in transitively by keycloak-admin-client) to Spring'sHttpStatus. Apachehttpclientkept as atestImplementationfor the existing Files endpoint integration tests.build.gradle: droppedorg.keycloak:keycloak-admin-client, addedspring-boot-starter-data-ldap; lock file regenerated../gradlew compileKotlin compileTestKotlinis green.Website (website/)
KeycloakClientManager→OidcClientManager; issuer URL now comes fromserverSide.autheliaUrldirectly (no realm path).runtimeConfig.ts:serverSideschema now hasautheliaUrl,autheliaPublicUrl,registrationUrl;backendKeycloakClientSecretremoved.getAuthUrl: scope changed toopenid profile email groups offline_access; account page now links to the Authelia portal root./loculus-inforeturnshosts.autheliainstead ofhosts.keycloak.astro check+tsc --noEmitboth green.CLI (cli/)
InstanceConfig:keycloak_realm/keycloak_client_idreplaced byoidc_client_id(defaultloculus-cli);authelia_urlproperty readshosts.autheliafrom/loculus-info.loculus instance addflags updated accordingly.Registration service (registration-service/)
username,email,first_name,last_name,organization,password,confirm_password,accept_terms) + minimal CSS. Calls lldap's GraphQL admin API via the admin login. Health endpoint, success-redirect to Authelia.registration-service-image.ymlmodelled onpreprocessing-dummy-image.yml; oldkeycloakify-image.ymlandkeycloakify-test.ymldeleted;update-argocd-metadata.ymlwaits on the new image.🤖 Generated with Claude Code
🚀 Preview: Add
previewlabel to enable