feat: add Helm chart for Kubernetes deployment#159
Conversation
Adds deploy/charts/litellm-agent-platform/ with: - Web Deployment (server-proxy.mjs with IN_CLUSTER=true) - Worker Deployment (reconciler) - DB migration Job (Helm pre-install/pre-upgrade hook) - ServiceAccount + RBAC (namespaced sandbox CRD access + cluster-wide node read) - PriorityClasses (platform-critical > sandbox-active > sandbox-warm) - Secret with auto-generated master key and harness auth token - Service (ClusterIP by default, configurable) - Ingress (optional) - Bitnami PostgreSQL subchart or external database support - Full values.yaml with per-harness image overrides Tested with helm template and helm lint (0 failures). Prerequisite: agent-sandbox CRD must be installed separately: kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.5/manifest.yaml
Greptile SummaryThis PR adds a new Helm chart (
Confidence Score: 2/5The chart has several issues that would cause broken deployments or data loss in production before any real traffic hits the application. Three independent defects affect core deployment correctness: secrets rotate on every upgrade (breaking all active sessions), the migration command can silently drop data on schema changes, and the external database secret path produces a non-functional DATABASE_URL. Any one of these would cause a production incident.
|
| Filename | Overview |
|---|---|
| deploy/charts/litellm-agent-platform/templates/_helpers.tpl | Defines chart helpers including auto-generated secrets using randAlphaNum (no lookup guard causes key rotation on every upgrade) and a broken externalDatabase.existingSecret code path that writes a literal placeholder as DATABASE_URL. |
| deploy/charts/litellm-agent-platform/templates/migration-job.yaml | Helm pre-install/pre-upgrade migration job; uses --accept-data-loss flag that silently applies destructive schema changes, and lacks the migration.enabled guard that values.yaml implies. |
| deploy/charts/litellm-agent-platform/templates/rbac.yaml | Creates namespaced Role/RoleBinding (sandbox operations) + ClusterRole/ClusterRoleBinding (node discovery); scoped correctly but uses a separate sandboxNamespace value that can diverge from k8s.namespace. |
| deploy/charts/litellm-agent-platform/templates/secret.yaml | Renders the platform Secret; conditionally writes per-harness image keys correctly, but inherits the broken DATABASE_URL from the externalDatabase.existingSecret code path in the helpers. |
| deploy/charts/litellm-agent-platform/templates/web-deployment.yaml | Web Deployment with liveness/readiness probes, resource limits, and conditional harness image env vars; structure is sound assuming secrets are generated correctly. |
| deploy/charts/litellm-agent-platform/templates/worker-deployment.yaml | Worker Deployment mirroring web env structure; no probes defined (acceptable for a background reconciler), otherwise well-formed. |
| deploy/charts/litellm-agent-platform/values.yaml | Comprehensive defaults for all chart knobs; exposes migration.enabled which is not respected by the migration-job template, and has the dual sandboxNamespace/k8s.namespace ambiguity. |
| deploy/charts/litellm-agent-platform/templates/priorityclasses.yaml | Creates three cluster-scoped PriorityClasses with correct priority values ensuring platform pods are never preempted by sandbox pods. |
| deploy/charts/litellm-agent-platform/templates/ingress.yaml | Standard optional Ingress template; correctly routes to the web service and supports TLS and ingressClassName. |
| deploy/charts/litellm-agent-platform/templates/service.yaml | ClusterIP Service targeting the web deployment; selector and targetPort are consistent with the web deployment labels. |
| deploy/charts/litellm-agent-platform/Chart.yaml | Valid Helm v2 chart descriptor; no Bitnami PostgreSQL subchart dependency declared here (would need a dependencies entry and Chart.lock for the subchart to actually be pulled). |
Comments Outside Diff (3)
-
deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 378-387 (link)Auto-generated secrets rotate on every
helm upgraderandAlphaNumis called at render time with no guard against an already-existing secret. Everyhelm upgraderegeneratesMASTER_KEYandCONTAINER_ENV_HARNESS_AUTH_TOKEN, overwriting the Secret in place. This invalidates all existing API sessions and disconnects live agent WebSocket connections immediately after each upgrade. The standard fix is to uselookupto read the current secret value and fall back to generating one only when the secret does not yet exist. Bothlitellm-agent-platform.masterKeyandlitellm-agent-platform.harnessAuthTokenneed this treatment. -
deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 363-373 (link)externalDatabase.existingSecretpath writes a broken DATABASE_URLWhen
externalDatabase.existingSecretis set (andsecrets.existingSecretis not),secret.yamlstill creates a new Secret because it only guards onsecrets.existingSecret. ThedatabaseUrlhelper returns the literal string$(DB_URL_FROM_SECRET), which gets base64-encoded and stored as-is. The web and worker pods then receiveDATABASE_URL=$(DB_URL_FROM_SECRET)— a literal string, not a database URL — and every database call will fail. The helper should instead reference the existing external secret directly in the pod'senv, bypassing the generated secret entirely forDATABASE_URL. -
deploy/charts/litellm-agent-platform/templates/rbac.yaml, line 537 (link)sandboxNamespaceandk8s.namespaceare separate values that must stay in syncrbac.yamlscopes theRoleandRoleBindingto.Values.sandboxNamespace, while the web and worker deployments setK8S_NAMESPACEfrom.Values.k8s.namespace. Both default todefault, but if a user setsk8s.namespace: agentsto point the runtime at a custom sandbox namespace without also settingsandboxNamespace: agents, the RBAC will be bound to the wrong namespace and every sandbox operation will fail with a 403. Consider consolidating to a single key or having one value derive from the other.
Reviews (1): Last reviewed commit: "feat: add Helm chart for Kubernetes depl..." | Re-trigger Greptile
| imagePullPolicy: {{ .Values.image.pullPolicy }} | ||
| command: ["/bin/sh", "-c"] | ||
| args: | ||
| - "npx prisma db push --accept-data-loss --skip-generate" |
There was a problem hiding this comment.
--accept-data-loss silently drops columns/tables in production
prisma db push --accept-data-loss suppresses the interactive confirmation prompt that Prisma normally requires when a schema change would delete data (e.g. dropping a column). Every helm upgrade will apply destructive schema changes without any warning, risking silent data loss. Replace with prisma migrate deploy for production — it applies versioned, reviewed migrations only, and will never silently destroy data.
| - "npx prisma db push --accept-data-loss --skip-generate" | |
| - "npx prisma migrate deploy" |
| apiVersion: batch/v1 | ||
| kind: Job | ||
| metadata: | ||
| name: {{ include "litellm-agent-platform.fullname" . }}-migrate |
There was a problem hiding this comment.
migration.enabled flag is never checked — the job always runs
values.yaml exposes migration.enabled: true, implying users can disable the migration job. However migration-job.yaml has no {{- if .Values.migration.enabled }} guard, so the job is always rendered regardless of what users set. This means setting migration.enabled=false has no effect.
| apiVersion: batch/v1 | |
| kind: Job | |
| metadata: | |
| name: {{ include "litellm-agent-platform.fullname" . }}-migrate | |
| {{- if .Values.migration.enabled }} | |
| apiVersion: batch/v1 | |
| kind: Job | |
| metadata: | |
| name: {{ include "litellm-agent-platform.fullname" . }}-migrate |
|
@narigondelsiglo address greptile comments. +add proof this works as expected Also can u add this to docs https://github.com/BerriAI/litellm-agent-platform-docs |
- Fix DB host helper: use {{ .Release.Name }}-postgresql instead of fullname
- Replace 'prisma db push --accept-data-loss' with 'prisma migrate deploy'
- Add lookup for existing secrets to prevent rotation on helm upgrade
- Move non-sensitive config (NODEPORT, WARM_POOL, MODEL, REPO) to ConfigMap
- Remove deprecated kubernetes.io/service-account-token Secret (K8s >=1.24)
- Add securityContext to web and worker pods (runAsNonRoot, drop ALL caps)
- Add migration.enabled guard to migration-job.yaml
- Increase ttlSecondsAfterFinished on migration job (20 -> 120)
|
@narigondelsiglo please get to a greptile 4/5+ |
- externalDatabase.existingSecret: pods now reference the external secret directly via secretKeyRef instead of storing a broken placeholder in the generated secret - sandboxNamespace consolidated into k8s.namespace (single source of truth for sandbox RBAC and K8S_NAMESPACE env var) - removed redundant sandboxNamespace value from values.yaml
Summary
Adds
deploy/charts/litellm-agent-platform/— a production-ready Helm chart for deploying the LiteLLM Agent Platform to any Kubernetes cluster (not just AWS EKS + Render).What's included
web-deployment.yamlIN_CLUSTER=true)worker-deployment.yamlmigration-job.yamlserviceaccount.yamlrbac.yamlpriorityclasses.yamlsecret.yamlservice.yamlingress.yamlFeatures
externalDatabase.url/existingSecret)prisma db pushbefore web/worker startPrerequisites
Quick start
Testing
Notes
bin/kind-up.shandk8s/approach).docs/k8s-backend.md).k8s/web.yaml,k8s/worker.yaml,k8s/rbac-platform.yaml,k8s/priority-classes.yaml, andk8s/secrets.yamlmanifests — no behavior changes.