feat: add Helm chart for Kubernetes deployment by narigondelsiglo · Pull Request #159 · BerriAI/litellm-agent-platform

narigondelsiglo · 2026-05-18T20:02:37Z

Summary

Adds deploy/charts/litellm-agent-platform/ — a production-ready Helm chart for deploying the LiteLLM Agent Platform to any Kubernetes cluster (not just AWS EKS + Render).

What's included

Template	Resource
`web-deployment.yaml`	Web Deployment (server-proxy.mjs, `IN_CLUSTER=true`)
`worker-deployment.yaml`	Worker Deployment (reconciler)
`migration-job.yaml`	Prisma DB migration (Helm pre-install/pre-upgrade hook)
`serviceaccount.yaml`	ServiceAccount for platform pods
`rbac.yaml`	Role + RoleBinding (namespaced sandbox CRD access) + ClusterRole (node read)
`priorityclasses.yaml`	PriorityClasses (platform-critical > sandbox-active > sandbox-warm)
`secret.yaml`	Secret with auto-generated master key and harness auth token
`service.yaml`	Service (ClusterIP by default)
`ingress.yaml`	Optional Ingress

Features

Bitnami PostgreSQL subchart or external database (externalDatabase.url / existingSecret)
Per-harness image overrides (opencode, claude-sdk, claude-code, codex, hermes, gemini)
RBAC scoped to sandbox namespace + cluster-wide node discovery
PriorityClasses ensure platform pods are never preempted by sandbox pods
Auto-generated secrets (master key, harness auth token) when not provided
Helm hook migration runs prisma db push before web/worker start

Prerequisites

Kubernetes 1.21+
Helm 3.8.0+
agent-sandbox CRD v0.4.5+ installed on the cluster

Quick start

# 1. Install agent-sandbox CRD (once)
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.5/manifest.yaml

# 2. Install the chart
helm install litellm-agents deploy/charts/litellm-agent-platform \
  --set image.repository=your-registry/litellm-agent-platform \
  --set image.tag=latest \
  --set secrets.litellmApiKey=sk-your-key \
  --set secrets.litellmApiBase=http://litellm-proxy:4000 \
  --set k8s.harnessImage=your-registry/opencode-sandbox:latest

Testing

helm lint deploy/charts/litellm-agent-platform    # 0 failures
helm template test-release deploy/charts/litellm-agent-platform  # renders all 13 resources

Notes

This chart deploys the platform layer (web + worker + postgres + RBAC). The agent-sandbox CRD/controller must be installed separately (same as the existing bin/kind-up.sh and k8s/ approach).
NodePort range (30000-30099) is the default; for >100 concurrent sandboxes, ClusterIP + Ingress topology is recommended (see docs/k8s-backend.md).
The chart is based directly on the existing k8s/web.yaml, k8s/worker.yaml, k8s/rbac-platform.yaml, k8s/priority-classes.yaml, and k8s/secrets.yaml manifests — no behavior changes.

Adds deploy/charts/litellm-agent-platform/ with: - Web Deployment (server-proxy.mjs with IN_CLUSTER=true) - Worker Deployment (reconciler) - DB migration Job (Helm pre-install/pre-upgrade hook) - ServiceAccount + RBAC (namespaced sandbox CRD access + cluster-wide node read) - PriorityClasses (platform-critical > sandbox-active > sandbox-warm) - Secret with auto-generated master key and harness auth token - Service (ClusterIP by default, configurable) - Ingress (optional) - Bitnami PostgreSQL subchart or external database support - Full values.yaml with per-harness image overrides Tested with helm template and helm lint (0 failures). Prerequisite: agent-sandbox CRD must be installed separately: kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.5/manifest.yaml

greptile-apps · 2026-05-18T20:05:12Z

Greptile Summary

This PR adds a new Helm chart (deploy/charts/litellm-agent-platform/) for deploying the LiteLLM Agent Platform to any Kubernetes cluster, covering web, worker, migration job, RBAC, PriorityClasses, and optional Ingress — based directly on the existing k8s/ manifests.

Secret regeneration on upgrade: randAlphaNum in the helpers has no lookup guard, so MASTER_KEY and CONTAINER_ENV_HARNESS_AUTH_TOKEN are rotated on every helm upgrade, invalidating all existing API keys and live agent sessions.
Destructive migration: The migration job runs prisma db push --accept-data-loss on every install/upgrade, which will silently drop columns or tables whenever a schema change requires it; and the migration.enabled=false guard in values.yaml is never checked in the template.
Broken externalDatabase.existingSecret path: When this option is used without also setting secrets.existingSecret, the generated Secret stores the literal string $(DB_URL_FROM_SECRET) as DATABASE_URL, causing all database connections to fail.

Confidence Score: 2/5

The chart has several issues that would cause broken deployments or data loss in production before any real traffic hits the application.

Three independent defects affect core deployment correctness: secrets rotate on every upgrade (breaking all active sessions), the migration command can silently drop data on schema changes, and the external database secret path produces a non-functional DATABASE_URL. Any one of these would cause a production incident.

_helpers.tpl and migration-job.yaml carry the most consequential problems; secret.yaml and rbac.yaml need the secondary fixes before this chart is safe to deploy.

Important Files Changed

Filename	Overview
deploy/charts/litellm-agent-platform/templates/_helpers.tpl	Defines chart helpers including auto-generated secrets using `randAlphaNum` (no `lookup` guard causes key rotation on every upgrade) and a broken `externalDatabase.existingSecret` code path that writes a literal placeholder as DATABASE_URL.
deploy/charts/litellm-agent-platform/templates/migration-job.yaml	Helm pre-install/pre-upgrade migration job; uses `--accept-data-loss` flag that silently applies destructive schema changes, and lacks the `migration.enabled` guard that values.yaml implies.
deploy/charts/litellm-agent-platform/templates/rbac.yaml	Creates namespaced Role/RoleBinding (sandbox operations) + ClusterRole/ClusterRoleBinding (node discovery); scoped correctly but uses a separate `sandboxNamespace` value that can diverge from `k8s.namespace`.
deploy/charts/litellm-agent-platform/templates/secret.yaml	Renders the platform Secret; conditionally writes per-harness image keys correctly, but inherits the broken DATABASE_URL from the `externalDatabase.existingSecret` code path in the helpers.
deploy/charts/litellm-agent-platform/templates/web-deployment.yaml	Web Deployment with liveness/readiness probes, resource limits, and conditional harness image env vars; structure is sound assuming secrets are generated correctly.
deploy/charts/litellm-agent-platform/templates/worker-deployment.yaml	Worker Deployment mirroring web env structure; no probes defined (acceptable for a background reconciler), otherwise well-formed.
deploy/charts/litellm-agent-platform/values.yaml	Comprehensive defaults for all chart knobs; exposes `migration.enabled` which is not respected by the migration-job template, and has the dual `sandboxNamespace`/`k8s.namespace` ambiguity.
deploy/charts/litellm-agent-platform/templates/priorityclasses.yaml	Creates three cluster-scoped PriorityClasses with correct priority values ensuring platform pods are never preempted by sandbox pods.
deploy/charts/litellm-agent-platform/templates/ingress.yaml	Standard optional Ingress template; correctly routes to the web service and supports TLS and ingressClassName.
deploy/charts/litellm-agent-platform/templates/service.yaml	ClusterIP Service targeting the web deployment; selector and targetPort are consistent with the web deployment labels.
deploy/charts/litellm-agent-platform/Chart.yaml	Valid Helm v2 chart descriptor; no Bitnami PostgreSQL subchart dependency declared here (would need a `dependencies` entry and `Chart.lock` for the subchart to actually be pulled).

Comments Outside Diff (3)

deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 378-387 (link)

Auto-generated secrets rotate on every helm upgrade

randAlphaNum is called at render time with no guard against an already-existing secret. Every helm upgrade regenerates MASTER_KEY and CONTAINER_ENV_HARNESS_AUTH_TOKEN, overwriting the Secret in place. This invalidates all existing API sessions and disconnects live agent WebSocket connections immediately after each upgrade. The standard fix is to use lookup to read the current secret value and fall back to generating one only when the secret does not yet exist. Both litellm-agent-platform.masterKey and litellm-agent-platform.harnessAuthToken need this treatment.
deploy/charts/litellm-agent-platform/templates/_helpers.tpl, line 363-373 (link)

externalDatabase.existingSecret path writes a broken DATABASE_URL

When externalDatabase.existingSecret is set (and secrets.existingSecret is not), secret.yaml still creates a new Secret because it only guards on secrets.existingSecret. The databaseUrl helper returns the literal string $(DB_URL_FROM_SECRET), which gets base64-encoded and stored as-is. The web and worker pods then receive DATABASE_URL=$(DB_URL_FROM_SECRET) — a literal string, not a database URL — and every database call will fail. The helper should instead reference the existing external secret directly in the pod's env, bypassing the generated secret entirely for DATABASE_URL.
deploy/charts/litellm-agent-platform/templates/rbac.yaml, line 537 (link)

sandboxNamespace and k8s.namespace are separate values that must stay in sync

rbac.yaml scopes the Role and RoleBinding to .Values.sandboxNamespace, while the web and worker deployments set K8S_NAMESPACE from .Values.k8s.namespace. Both default to default, but if a user sets k8s.namespace: agents to point the runtime at a custom sandbox namespace without also setting sandboxNamespace: agents, the RBAC will be bound to the wrong namespace and every sandbox operation will fail with a 403. Consider consolidating to a single key or having one value derive from the other.

_{Reviews (1): Last reviewed commit: "feat: add Helm chart for Kubernetes depl..." | Re-trigger Greptile}

greptile-apps · 2026-05-18T20:05:20Z

+          imagePullPolicy: {{ .Values.image.pullPolicy }}
+          command: ["/bin/sh", "-c"]
+          args:
+            - "npx prisma db push --accept-data-loss --skip-generate"


--accept-data-loss silently drops columns/tables in production

prisma db push --accept-data-loss suppresses the interactive confirmation prompt that Prisma normally requires when a schema change would delete data (e.g. dropping a column). Every helm upgrade will apply destructive schema changes without any warning, risking silent data loss. Replace with prisma migrate deploy for production — it applies versioned, reviewed migrations only, and will never silently destroy data.

Suggested change

- "npx prisma db push --accept-data-loss --skip-generate"

- "npx prisma migrate deploy"

greptile-apps · 2026-05-18T20:05:22Z

+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: {{ include "litellm-agent-platform.fullname" . }}-migrate


migration.enabled flag is never checked — the job always runs

values.yaml exposes migration.enabled: true, implying users can disable the migration job. However migration-job.yaml has no {{- if .Values.migration.enabled }} guard, so the job is always rendered regardless of what users set. This means setting migration.enabled=false has no effect.

Suggested change

apiVersion: batch/v1

kind: Job

metadata:

name: {{ include "litellm-agent-platform.fullname" . }}-migrate

{{- if .Values.migration.enabled }}

apiVersion: batch/v1

kind: Job

metadata:

name: {{ include "litellm-agent-platform.fullname" . }}-migrate

ishaan-berri · 2026-05-18T20:12:02Z

@narigondelsiglo address greptile comments. +add proof this works as expected

Also can u add this to docs https://github.com/BerriAI/litellm-agent-platform-docs

- Fix DB host helper: use {{ .Release.Name }}-postgresql instead of fullname - Replace 'prisma db push --accept-data-loss' with 'prisma migrate deploy' - Add lookup for existing secrets to prevent rotation on helm upgrade - Move non-sensitive config (NODEPORT, WARM_POOL, MODEL, REPO) to ConfigMap - Remove deprecated kubernetes.io/service-account-token Secret (K8s >=1.24) - Add securityContext to web and worker pods (runAsNonRoot, drop ALL caps) - Add migration.enabled guard to migration-job.yaml - Increase ttlSecondsAfterFinished on migration job (20 -> 120)

krrish-berri-2 · 2026-05-18T23:13:12Z

@narigondelsiglo please get to a greptile 4/5+

- externalDatabase.existingSecret: pods now reference the external secret directly via secretKeyRef instead of storing a broken placeholder in the generated secret - sandboxNamespace consolidated into k8s.namespace (single source of truth for sandbox RBAC and K8S_NAMESPACE env var) - removed redundant sandboxNamespace value from values.yaml

narigondelsiglo mentioned this pull request May 18, 2026

feat: add Helm chart for Kubernetes deployment narigondelsiglo/litellm-agent-platform#1

Open

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Helm chart for Kubernetes deployment#159

feat: add Helm chart for Kubernetes deployment#159
narigondelsiglo wants to merge 3 commits into
BerriAI:mainfrom
narigondelsiglo:helm-chart

narigondelsiglo commented May 18, 2026

Uh oh!

greptile-apps Bot commented May 18, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot May 18, 2026

Uh oh!

greptile-apps Bot May 18, 2026

Uh oh!

ishaan-berri commented May 18, 2026

Uh oh!

krrish-berri-2 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- "npx prisma db push --accept-data-loss --skip-generate"
	- "npx prisma migrate deploy"

Conversation

narigondelsiglo commented May 18, 2026

Summary

What's included

Features

Prerequisites

Quick start

Testing

Notes

Uh oh!

greptile-apps Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ishaan-berri commented May 18, 2026

Uh oh!

krrish-berri-2 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 18, 2026 •

edited

Loading