Skip to content

feat: add Helm chart for Kubernetes deployment#159

Open
narigondelsiglo wants to merge 3 commits into
BerriAI:mainfrom
narigondelsiglo:helm-chart
Open

feat: add Helm chart for Kubernetes deployment#159
narigondelsiglo wants to merge 3 commits into
BerriAI:mainfrom
narigondelsiglo:helm-chart

Conversation

@narigondelsiglo
Copy link
Copy Markdown

Summary

Adds deploy/charts/litellm-agent-platform/ — a production-ready Helm chart for deploying the LiteLLM Agent Platform to any Kubernetes cluster (not just AWS EKS + Render).

What's included

Template Resource
web-deployment.yaml Web Deployment (server-proxy.mjs, IN_CLUSTER=true)
worker-deployment.yaml Worker Deployment (reconciler)
migration-job.yaml Prisma DB migration (Helm pre-install/pre-upgrade hook)
serviceaccount.yaml ServiceAccount for platform pods
rbac.yaml Role + RoleBinding (namespaced sandbox CRD access) + ClusterRole (node read)
priorityclasses.yaml PriorityClasses (platform-critical > sandbox-active > sandbox-warm)
secret.yaml Secret with auto-generated master key and harness auth token
service.yaml Service (ClusterIP by default)
ingress.yaml Optional Ingress

Features

  • Bitnami PostgreSQL subchart or external database (externalDatabase.url / existingSecret)
  • Per-harness image overrides (opencode, claude-sdk, claude-code, codex, hermes, gemini)
  • RBAC scoped to sandbox namespace + cluster-wide node discovery
  • PriorityClasses ensure platform pods are never preempted by sandbox pods
  • Auto-generated secrets (master key, harness auth token) when not provided
  • Helm hook migration runs prisma db push before web/worker start

Prerequisites

Quick start

# 1. Install agent-sandbox CRD (once)
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.5/manifest.yaml

# 2. Install the chart
helm install litellm-agents deploy/charts/litellm-agent-platform \
  --set image.repository=your-registry/litellm-agent-platform \
  --set image.tag=latest \
  --set secrets.litellmApiKey=sk-your-key \
  --set secrets.litellmApiBase=http://litellm-proxy:4000 \
  --set k8s.harnessImage=your-registry/opencode-sandbox:latest

Testing

helm lint deploy/charts/litellm-agent-platform    # 0 failures
helm template test-release deploy/charts/litellm-agent-platform  # renders all 13 resources

Notes

  • This chart deploys the platform layer (web + worker + postgres + RBAC). The agent-sandbox CRD/controller must be installed separately (same as the existing bin/kind-up.sh and k8s/ approach).
  • NodePort range (30000-30099) is the default; for >100 concurrent sandboxes, ClusterIP + Ingress topology is recommended (see docs/k8s-backend.md).
  • The chart is based directly on the existing k8s/web.yaml, k8s/worker.yaml, k8s/rbac-platform.yaml, k8s/priority-classes.yaml, and k8s/secrets.yaml manifests — no behavior changes.

Adds deploy/charts/litellm-agent-platform/ with:

- Web Deployment (server-proxy.mjs with IN_CLUSTER=true)
- Worker Deployment (reconciler)
- DB migration Job (Helm pre-install/pre-upgrade hook)
- ServiceAccount + RBAC (namespaced sandbox CRD access + cluster-wide node read)
- PriorityClasses (platform-critical > sandbox-active > sandbox-warm)
- Secret with auto-generated master key and harness auth token
- Service (ClusterIP by default, configurable)
- Ingress (optional)
- Bitnami PostgreSQL subchart or external database support
- Full values.yaml with per-harness image overrides

Tested with helm template and helm lint (0 failures).

Prerequisite: agent-sandbox CRD must be installed separately:
  kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.5/manifest.yaml
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR adds a new Helm chart (deploy/charts/litellm-agent-platform/) for deploying the LiteLLM Agent Platform to any Kubernetes cluster, covering web, worker, migration job, RBAC, PriorityClasses, and optional Ingress — based directly on the existing k8s/ manifests.

  • Secret regeneration on upgrade: randAlphaNum in the helpers has no lookup guard, so MASTER_KEY and CONTAINER_ENV_HARNESS_AUTH_TOKEN are rotated on every helm upgrade, invalidating all existing API keys and live agent sessions.
  • Destructive migration: The migration job runs prisma db push --accept-data-loss on every install/upgrade, which will silently drop columns or tables whenever a schema change requires it; and the migration.enabled=false guard in values.yaml is never checked in the template.
  • Broken externalDatabase.existingSecret path: When this option is used without also setting secrets.existingSecret, the generated Secret stores the literal string $(DB_URL_FROM_SECRET) as DATABASE_URL, causing all database connections to fail.

Confidence Score: 2/5

The chart has several issues that would cause broken deployments or data loss in production before any real traffic hits the application.

Three independent defects affect core deployment correctness: secrets rotate on every upgrade (breaking all active sessions), the migration command can silently drop data on schema changes, and the external database secret path produces a non-functional DATABASE_URL. Any one of these would cause a production incident.

_helpers.tpl and migration-job.yaml carry the most consequential problems; secret.yaml and rbac.yaml need the secondary fixes before this chart is safe to deploy.

Important Files Changed

Filename Overview
deploy/charts/litellm-agent-platform/templates/_helpers.tpl Defines chart helpers including auto-generated secrets using randAlphaNum (no lookup guard causes key rotation on every upgrade) and a broken externalDatabase.existingSecret code path that writes a literal placeholder as DATABASE_URL.
deploy/charts/litellm-agent-platform/templates/migration-job.yaml Helm pre-install/pre-upgrade migration job; uses --accept-data-loss flag that silently applies destructive schema changes, and lacks the migration.enabled guard that values.yaml implies.
deploy/charts/litellm-agent-platform/templates/rbac.yaml Creates namespaced Role/RoleBinding (sandbox operations) + ClusterRole/ClusterRoleBinding (node discovery); scoped correctly but uses a separate sandboxNamespace value that can diverge from k8s.namespace.
deploy/charts/litellm-agent-platform/templates/secret.yaml Renders the platform Secret; conditionally writes per-harness image keys correctly, but inherits the broken DATABASE_URL from the externalDatabase.existingSecret code path in the helpers.
deploy/charts/litellm-agent-platform/templates/web-deployment.yaml Web Deployment with liveness/readiness probes, resource limits, and conditional harness image env vars; structure is sound assuming secrets are generated correctly.
deploy/charts/litellm-agent-platform/templates/worker-deployment.yaml Worker Deployment mirroring web env structure; no probes defined (acceptable for a background reconciler), otherwise well-formed.
deploy/charts/litellm-agent-platform/values.yaml Comprehensive defaults for all chart knobs; exposes migration.enabled which is not respected by the migration-job template, and has the dual sandboxNamespace/k8s.namespace ambiguity.
deploy/charts/litellm-agent-platform/templates/priorityclasses.yaml Creates three cluster-scoped PriorityClasses with correct priority values ensuring platform pods are never preempted by sandbox pods.
deploy/charts/litellm-agent-platform/templates/ingress.yaml Standard optional Ingress template; correctly routes to the web service and supports TLS and ingressClassName.
deploy/charts/litellm-agent-platform/templates/service.yaml ClusterIP Service targeting the web deployment; selector and targetPort are consistent with the web deployment labels.
deploy/charts/litellm-agent-platform/Chart.yaml Valid Helm v2 chart descriptor; no Bitnami PostgreSQL subchart dependency declared here (would need a dependencies entry and Chart.lock for the subchart to actually be pulled).

Comments Outside Diff (3)

  1. deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 378-387 (link)

    P1 Auto-generated secrets rotate on every helm upgrade

    randAlphaNum is called at render time with no guard against an already-existing secret. Every helm upgrade regenerates MASTER_KEY and CONTAINER_ENV_HARNESS_AUTH_TOKEN, overwriting the Secret in place. This invalidates all existing API sessions and disconnects live agent WebSocket connections immediately after each upgrade. The standard fix is to use lookup to read the current secret value and fall back to generating one only when the secret does not yet exist. Both litellm-agent-platform.masterKey and litellm-agent-platform.harnessAuthToken need this treatment.

  2. deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 363-373 (link)

    P1 externalDatabase.existingSecret path writes a broken DATABASE_URL

    When externalDatabase.existingSecret is set (and secrets.existingSecret is not), secret.yaml still creates a new Secret because it only guards on secrets.existingSecret. The databaseUrl helper returns the literal string $(DB_URL_FROM_SECRET), which gets base64-encoded and stored as-is. The web and worker pods then receive DATABASE_URL=$(DB_URL_FROM_SECRET) — a literal string, not a database URL — and every database call will fail. The helper should instead reference the existing external secret directly in the pod's env, bypassing the generated secret entirely for DATABASE_URL.

  3. deploy/charts/litellm-agent-platform/templates/rbac.yaml, line 537 (link)

    P2 sandboxNamespace and k8s.namespace are separate values that must stay in sync

    rbac.yaml scopes the Role and RoleBinding to .Values.sandboxNamespace, while the web and worker deployments set K8S_NAMESPACE from .Values.k8s.namespace. Both default to default, but if a user sets k8s.namespace: agents to point the runtime at a custom sandbox namespace without also setting sandboxNamespace: agents, the RBAC will be bound to the wrong namespace and every sandbox operation will fail with a 403. Consider consolidating to a single key or having one value derive from the other.

Reviews (1): Last reviewed commit: "feat: add Helm chart for Kubernetes depl..." | Re-trigger Greptile

imagePullPolicy: {{ .Values.image.pullPolicy }}
command: ["/bin/sh", "-c"]
args:
- "npx prisma db push --accept-data-loss --skip-generate"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --accept-data-loss silently drops columns/tables in production

prisma db push --accept-data-loss suppresses the interactive confirmation prompt that Prisma normally requires when a schema change would delete data (e.g. dropping a column). Every helm upgrade will apply destructive schema changes without any warning, risking silent data loss. Replace with prisma migrate deploy for production — it applies versioned, reviewed migrations only, and will never silently destroy data.

Suggested change
- "npx prisma db push --accept-data-loss --skip-generate"
- "npx prisma migrate deploy"

Comment on lines +1 to +4
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "litellm-agent-platform.fullname" . }}-migrate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 migration.enabled flag is never checked — the job always runs

values.yaml exposes migration.enabled: true, implying users can disable the migration job. However migration-job.yaml has no {{- if .Values.migration.enabled }} guard, so the job is always rendered regardless of what users set. This means setting migration.enabled=false has no effect.

Suggested change
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "litellm-agent-platform.fullname" . }}-migrate
{{- if .Values.migration.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "litellm-agent-platform.fullname" . }}-migrate

@ishaan-berri
Copy link
Copy Markdown
Contributor

@narigondelsiglo address greptile comments. +add proof this works as expected

Also can u add this to docs https://github.com/BerriAI/litellm-agent-platform-docs

- Fix DB host helper: use {{ .Release.Name }}-postgresql instead of fullname
- Replace 'prisma db push --accept-data-loss' with 'prisma migrate deploy'
- Add lookup for existing secrets to prevent rotation on helm upgrade
- Move non-sensitive config (NODEPORT, WARM_POOL, MODEL, REPO) to ConfigMap
- Remove deprecated kubernetes.io/service-account-token Secret (K8s >=1.24)
- Add securityContext to web and worker pods (runAsNonRoot, drop ALL caps)
- Add migration.enabled guard to migration-job.yaml
- Increase ttlSecondsAfterFinished on migration job (20 -> 120)
@krrish-berri-2
Copy link
Copy Markdown
Contributor

@narigondelsiglo please get to a greptile 4/5+

- externalDatabase.existingSecret: pods now reference the external
  secret directly via secretKeyRef instead of storing a broken
  placeholder in the generated secret
- sandboxNamespace consolidated into k8s.namespace (single source
  of truth for sandbox RBAC and K8S_NAMESPACE env var)
- removed redundant sandboxNamespace value from values.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants