Skip to content

fix(worker): ship tsconfig.worker.json in worker image (fixes worker crash-loop)#294

Open
ishaan-berri wants to merge 2 commits into
mainfrom
litellm_fix-worker-tsconfig-copy
Open

fix(worker): ship tsconfig.worker.json in worker image (fixes worker crash-loop)#294
ishaan-berri wants to merge 2 commits into
mainfrom
litellm_fix-worker-tsconfig-copy

Conversation

@ishaan-berri
Copy link
Copy Markdown
Contributor

@ishaan-berri ishaan-berri commented May 23, 2026

What

The litellm-worker (reconciler) pod was crash-looping in prod, taking down the reconciler that GCs orphaned session pods and maintains the warm pool. With it down, orphan pods accumulated → node memory hit 96–98% → sessions couldn't schedule → platform slow / 502s.

Two missing pieces in the worker Docker stage, both surfaced in sequence:

  1. Cannot resolve tsconfig at path: /app/tsconfig.worker.jsonnpm run worker runs tsx --tsconfig tsconfig.worker.json … but the stage only copied tsconfig.json.
  2. Cannot find module '@/lib/egress-hosts' — the stage copied src/server + src/worker but the worker transitively imports from src/lib (via src/server/types.ts).

Fix

  • Copy tsconfig.worker.json into the worker stage.
  • Copy the whole src/ tree instead of selective subdirs, so all transitive imports resolve (robust against this recurring whack-a-mole).

Verified

Applied the equivalent fix live in prod via ConfigMap mounts (tsconfig.worker.json + src/lib): the worker started cleanly and the reconciler resumed —

reconciler worker started (warm_pool_size=2)
reconcile: ok=true … idle_killed=7 … warm_provisioned=2 automations_fired=1

Web went from flapping 502s back to stable 200s. This PR makes that fix permanent (the live ConfigMap hotfix would be reverted by the next deploy).

The worker runs `tsx --tsconfig tsconfig.worker.json src/worker/index.ts`
(via `npm run worker`), but the worker Docker stage only copied tsconfig.json,
not tsconfig.worker.json. Every worker pod crash-looped with
"Cannot resolve tsconfig at path: /app/tsconfig.worker.json", taking down the
reconciler — so the warm pool stopped being maintained and orphaned session
pods were never garbage-collected, causing node memory exhaustion and slow/
stuck session provisioning. Copy tsconfig.worker.json into the worker stage.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR ships a one-line Dockerfile fix to stop the worker crash-loop by adding tsconfig.worker.json to the worker stage's COPY command, which is required by the npm run worker script (tsx --tsconfig tsconfig.worker.json …).

  • The worker stage fix is correct and directly addresses the crash-loop, but the runner stage (stage 5) copies src/worker and is documented as also acting as the reconciler worker, yet still omits tsconfig.worker.json from its COPY on line 110 — leaving the same failure possible there.
  • The worker stage CMD invokes tsx src/worker/index.ts directly without --tsconfig tsconfig.worker.json, so the image's default entrypoint silently skips the worker tsconfig; only deployments that override the command with npm run worker get the intended configuration.

Confidence Score: 3/5

The core fix is correct, but the runner stage — documented to also run as the worker — has the same missing file, and the worker CMD diverges from the npm script.

The runner stage explicitly documents that it doubles as the worker image in some k8s configs (line 112–113) and copies worker source, but line 110 still omits tsconfig.worker.json. Any deployment using the runner image to run npm run worker will hit the same crash-loop that this PR is trying to fix. The CMD discrepancy is a secondary concern that could cause subtle configuration drift.

Dockerfile — both the runner stage COPY on line 110 and the worker stage CMD on line 75 need a second look.

Important Files Changed

Filename Overview
Dockerfile Adds tsconfig.worker.json to the worker stage COPY command to fix crash-loop; the runner stage (which can also act as the worker) still omits the same file, and the worker CMD diverges from the npm script by not passing --tsconfig.

Comments Outside Diff (2)

  1. Dockerfile, line 110-115 (link)

    P1 Runner stage also missing tsconfig.worker.json

    The comment on line 112–113 explicitly states the runner image is also used as the reconciler worker ("k8s/worker.yaml runs npm run worker = tsx src/worker/index.ts"), and the runner stage copies both src/server and src/worker. However, line 110 only copies tsconfig.json — not tsconfig.worker.json. If npm run worker (which uses --tsconfig tsconfig.worker.json) is invoked against the runner image, it will hit the exact same crash-loop this PR is fixing in the worker stage.

  2. Dockerfile, line 75 (link)

    P2 CMD bypasses --tsconfig flag

    The CMD at this line invokes tsx src/worker/index.ts directly — without --tsconfig tsconfig.worker.json — so if the container runs with its default command, tsconfig.worker.json is never consulted and its settings (path aliases, module resolution, etc.) are silently ignored. The crash-loop is fixed when k8s overrides this with npm run worker, but if any deployment uses the image's default CMD, the worker will behave differently from the intended configuration. Consider aligning the CMD to mirror the npm run worker script (e.g. ["npx", "tsx", "--tsconfig", "tsconfig.worker.json", "src/worker/index.ts"]).

Reviews (1): Last reviewed commit: "fix(worker): ship tsconfig.worker.json i..." | Re-trigger Greptile

The worker stage selectively copied src/server + src/worker, but the worker
transitively imports from src/lib (e.g. @/lib/egress-hosts via src/server/types.ts).
At runtime this crash-looped with "Cannot find module '@/lib/egress-hosts'".
Copy the whole src/ tree so all transitive imports resolve, alongside the
tsconfig.worker.json fix from the previous commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant