fix(worker): ship tsconfig.worker.json in worker image (fixes worker crash-loop)#294
fix(worker): ship tsconfig.worker.json in worker image (fixes worker crash-loop)#294ishaan-berri wants to merge 2 commits into
Conversation
The worker runs `tsx --tsconfig tsconfig.worker.json src/worker/index.ts` (via `npm run worker`), but the worker Docker stage only copied tsconfig.json, not tsconfig.worker.json. Every worker pod crash-looped with "Cannot resolve tsconfig at path: /app/tsconfig.worker.json", taking down the reconciler — so the warm pool stopped being maintained and orphaned session pods were never garbage-collected, causing node memory exhaustion and slow/ stuck session provisioning. Copy tsconfig.worker.json into the worker stage.
Greptile SummaryThis PR ships a one-line Dockerfile fix to stop the worker crash-loop by adding
Confidence Score: 3/5The core fix is correct, but the runner stage — documented to also run as the worker — has the same missing file, and the worker CMD diverges from the npm script. The runner stage explicitly documents that it doubles as the worker image in some k8s configs (line 112–113) and copies worker source, but line 110 still omits tsconfig.worker.json. Any deployment using the runner image to run npm run worker will hit the same crash-loop that this PR is trying to fix. The CMD discrepancy is a secondary concern that could cause subtle configuration drift. Dockerfile — both the runner stage COPY on line 110 and the worker stage CMD on line 75 need a second look.
|
| Filename | Overview |
|---|---|
| Dockerfile | Adds tsconfig.worker.json to the worker stage COPY command to fix crash-loop; the runner stage (which can also act as the worker) still omits the same file, and the worker CMD diverges from the npm script by not passing --tsconfig. |
Comments Outside Diff (2)
-
Dockerfile, line 110-115 (link)Runner stage also missing
tsconfig.worker.jsonThe comment on line 112–113 explicitly states the runner image is also used as the reconciler worker ("k8s/worker.yaml runs
npm run worker= tsx src/worker/index.ts"), and the runner stage copies bothsrc/serverandsrc/worker. However, line 110 only copiestsconfig.json— nottsconfig.worker.json. Ifnpm run worker(which uses--tsconfig tsconfig.worker.json) is invoked against the runner image, it will hit the exact same crash-loop this PR is fixing in the worker stage. -
Dockerfile, line 75 (link)The
CMDat this line invokestsx src/worker/index.tsdirectly — without--tsconfig tsconfig.worker.json— so if the container runs with its default command,tsconfig.worker.jsonis never consulted and its settings (path aliases, module resolution, etc.) are silently ignored. The crash-loop is fixed when k8s overrides this withnpm run worker, but if any deployment uses the image's defaultCMD, the worker will behave differently from the intended configuration. Consider aligning theCMDto mirror thenpm run workerscript (e.g.["npx", "tsx", "--tsconfig", "tsconfig.worker.json", "src/worker/index.ts"]).
Reviews (1): Last reviewed commit: "fix(worker): ship tsconfig.worker.json i..." | Re-trigger Greptile
The worker stage selectively copied src/server + src/worker, but the worker transitively imports from src/lib (e.g. @/lib/egress-hosts via src/server/types.ts). At runtime this crash-looped with "Cannot find module '@/lib/egress-hosts'". Copy the whole src/ tree so all transitive imports resolve, alongside the tsconfig.worker.json fix from the previous commit.
What
The
litellm-worker(reconciler) pod was crash-looping in prod, taking down the reconciler that GCs orphaned session pods and maintains the warm pool. With it down, orphan pods accumulated → node memory hit 96–98% → sessions couldn't schedule → platform slow / 502s.Two missing pieces in the worker Docker stage, both surfaced in sequence:
Cannot resolve tsconfig at path: /app/tsconfig.worker.json—npm run workerrunstsx --tsconfig tsconfig.worker.json …but the stage only copiedtsconfig.json.Cannot find module '@/lib/egress-hosts'— the stage copiedsrc/server+src/workerbut the worker transitively imports fromsrc/lib(viasrc/server/types.ts).Fix
tsconfig.worker.jsoninto the worker stage.src/tree instead of selective subdirs, so all transitive imports resolve (robust against this recurring whack-a-mole).Verified
Applied the equivalent fix live in prod via ConfigMap mounts (tsconfig.worker.json + src/lib): the worker started cleanly and the reconciler resumed —
Web went from flapping 502s back to stable 200s. This PR makes that fix permanent (the live ConfigMap hotfix would be reverted by the next deploy).