Description
When openab runs as PID 1 in a container (the default for the published openab-codex image), zombie processes from agent grandchildren are never reaped, because openab does not implement a SIGCHLD reaper expected of init processes.
Observed in a K3s pod running ghcr.io/openabdev/openab-codex:0.7.2 with codex-acp as the agent. After ~85 minutes of normal Discord usage, 44 zombie processes had accumulated — all with PPid = 1 (i.e. openab itself).
$ cat /proc/1/comm
openab
$ for p in /proc/[0-9]*; do
s=$(awk '/^State:/{print $2}' $p/status 2>/dev/null)
pp=$(awk '/^PPid:/{print $2}' $p/status 2>/dev/null)
[ "$s" = "Z" ] && echo "Z pid=$(basename $p) ppid=$pp"
done | head
Z pid=10070 ppid=1
Z pid=10838 ppid=1
Z pid=11431 ppid=1
Z pid=12169 ppid=1
Z pid=12750 ppid=1
Z pid=13621 ppid=1
Z pid=14232 ppid=1
Z pid=14936 ppid=1
Z pid=15160 ppid=1
Z pid=1575 ppid=1
# state distribution across all processes
3 (kernel/empty)
12 S
44 Z
The chain:
openab (PID 1) spawns codex-acp
codex-acp spawns shell tools (git, grep, next, ...) for tool
calls
- When
codex-acp exits or restarts a session, its remaining children get
reparented to PID 1 (kernel default)
openab does not call wait() on arbitrary children → they become
zombies and stay forever
This is distinct from #269 — that one is about kiro-cli-chat orphans (alive, ~300 MB RSS each, real memory leak). This issue is about
reaped-but-not-cleaned zombies (almost no memory, but they consume PID table entries).
Steps to Reproduce
- Deploy
openab with the codex-acp agent using the standard
ghcr.io/openabdev/openab-codex:0.7.2 image (no init wrapper)
- Use the Discord bot for normal coding sessions involving shell tool
calls (git status, pnpm build, next build, etc.)
- After ~30–60 minutes of activity, exec into the pod:
kubectl exec <pod> -- sh -c '
for p in /proc/[0-9]*; do
awk "/^State:/{print \$2}" $p/status
done | sort | uniq -c'
- Observe the count of
Z (zombie) state processes growing roughly
linearly with session activity, all with PPid = 1
Expected Behavior
Zombie processes should be reaped automatically. Either:
- Option A (recommended, image-level fix): ship the image with a proper
init binary as PID 1. tini is the standard answer; one apt install +
ENTRYPOINT ["tini", "--", "openab"] in Dockerfile.codex. This also
fixes signal forwarding for free.
- Option B (app-level fix): have
openab install a SIGCHLD handler
that reaps any waitable children whenever it detects it is running as PID
- The Rust ecosystem has crates for this, or a small
signal-hook loop
calling waitpid(-1, WNOHANG).
Option A is much smaller and is the standard container best practice. It also helps every other agent image (openab-claude, openab-gemini, openab-copilot) since the same Dockerfile structure exists across them.
Environment
- chart:
openab 0.7.2 (Helm)
- image:
ghcr.io/openabdev/openab-codex:0.7.2
- agent:
codex-acp (@zed-industries/codex-acp@0.9.5)
- runtime: K3s (containerd, not docker)
- pod securityContext:
runAsNonRoot: true, runAsUser: 1000,
capabilities.drop: [ALL]
- no
shareProcessNamespace, no init wrapper
args for codex passed via Helm:
-c approval_policy="never"
-c sandbox_mode="workspace-write"
-c sandbox_workspace_write.network_access=true
Screenshots / Logs
Process listing snippet at the time of measurement (PPid column omitted for brevity, full command lines shown for live processes only):
PID 1 S openab /etc/openab/config.toml
PID 13 S node /usr/local/bin/codex-acp -c approval_policy="never" -c
sandbox_mode="workspace-write" -c
sandbox_workspace_write.network_access=true
PID 16715 S node /usr/local/bin/corepack pnpm build
PID 16747 S node /usr/local/bin/corepack pnpm -r build
PID 16783 R node .../packages/web/.../next build
... plus 44 Z (zombie) entries, all with PPid=1, empty cmdline ...
Total state distribution: 12 S, 1 R, 44 Z after ~85 minutes uptime, ~30 zombies/hour during normal use. With Linux's default pid_max=32768 this means PID exhaustion at roughly 1000 hours uptime. Long-lived pods will eventually fail to fork.
Workaround
- Periodically restart the deployment with
kubectl rollout restart deploy/openab
- Or build a downstream image that wraps the entrypoint in
tini:
FROM ghcr.io/openabdev/openab-codex:0.7.2
USER root
RUN apt-get update && apt-get install -y --no-install-recommends tini &&
apt-get clean
USER node
ENTRYPOINT ["tini", "--", "openab"]
Description
When
openabruns as PID 1 in a container (the default for the publishedopenab-codeximage), zombie processes from agent grandchildren are never reaped, becauseopenabdoes not implement aSIGCHLDreaper expected of init processes.Observed in a K3s pod running
ghcr.io/openabdev/openab-codex:0.7.2withcodex-acpas the agent. After ~85 minutes of normal Discord usage, 44 zombie processes had accumulated — all withPPid = 1(i.e.openabitself).The chain:
openab(PID 1) spawnscodex-acpcodex-acpspawns shell tools (git,grep,next, ...) for toolcalls
codex-acpexits or restarts a session, its remaining children getreparented to PID 1 (kernel default)
openabdoes not callwait()on arbitrary children → they becomezombies and stay forever
This is distinct from #269 — that one is about
kiro-cli-chatorphans (alive, ~300 MB RSS each, real memory leak). This issue is aboutreaped-but-not-cleaned zombies (almost no memory, but they consume PID table entries).
Steps to Reproduce
openabwith thecodex-acpagent using the standardghcr.io/openabdev/openab-codex:0.7.2image (no init wrapper)calls (
git status,pnpm build,next build, etc.)Z(zombie) state processes growing roughlylinearly with session activity, all with
PPid = 1Expected Behavior
Zombie processes should be reaped automatically. Either:
init binary as PID 1.
tiniis the standard answer; one apt install +ENTRYPOINT ["tini", "--", "openab"]inDockerfile.codex. This alsofixes signal forwarding for free.
openabinstall aSIGCHLDhandlerthat reaps any waitable children whenever it detects it is running as PID
signal-hookloopcalling
waitpid(-1, WNOHANG).Option A is much smaller and is the standard container best practice. It also helps every other agent image (
openab-claude,openab-gemini,openab-copilot) since the same Dockerfile structure exists across them.Environment
openab0.7.2 (Helm)ghcr.io/openabdev/openab-codex:0.7.2codex-acp(@zed-industries/codex-acp@0.9.5)runAsNonRoot: true,runAsUser: 1000,capabilities.drop: [ALL]shareProcessNamespace, no init wrapperargsfor codex passed via Helm:Screenshots / Logs
Process listing snippet at the time of measurement (PPid column omitted for brevity, full command lines shown for live processes only):
Total state distribution: 12
S, 1R, 44Zafter ~85 minutes uptime, ~30 zombies/hour during normal use. With Linux's defaultpid_max=32768this means PID exhaustion at roughly 1000 hours uptime. Long-lived pods will eventually fail to fork.Workaround
kubectl rollout restart deploy/openabtini: