Skip to content

bug: openab as PID 1 leaves zombie processes from agent subprocesses #290

@thekkagent

Description

@thekkagent

Description

When openab runs as PID 1 in a container (the default for the published openab-codex image), zombie processes from agent grandchildren are never reaped, because openab does not implement a SIGCHLD reaper expected of init processes.

Observed in a K3s pod running ghcr.io/openabdev/openab-codex:0.7.2 with codex-acp as the agent. After ~85 minutes of normal Discord usage, 44 zombie processes had accumulated — all with PPid = 1 (i.e. openab itself).

$ cat /proc/1/comm
openab

$ for p in /proc/[0-9]*; do
    s=$(awk '/^State:/{print $2}' $p/status 2>/dev/null)
    pp=$(awk '/^PPid:/{print $2}' $p/status 2>/dev/null)
    [ "$s" = "Z" ] && echo "Z pid=$(basename $p) ppid=$pp"
  done | head
Z pid=10070 ppid=1
Z pid=10838 ppid=1
Z pid=11431 ppid=1
Z pid=12169 ppid=1
Z pid=12750 ppid=1
Z pid=13621 ppid=1
Z pid=14232 ppid=1
Z pid=14936 ppid=1
Z pid=15160 ppid=1
Z pid=1575  ppid=1

# state distribution across all processes
      3 (kernel/empty)
     12 S
     44 Z

The chain:

  1. openab (PID 1) spawns codex-acp
  2. codex-acp spawns shell tools (git, grep, next, ...) for tool
    calls
  3. When codex-acp exits or restarts a session, its remaining children get
    reparented to PID 1 (kernel default)
  4. openab does not call wait() on arbitrary children → they become
    zombies and stay forever

This is distinct from #269 — that one is about kiro-cli-chat orphans (alive, ~300 MB RSS each, real memory leak). This issue is about
reaped-but-not-cleaned zombies (almost no memory, but they consume PID table entries).

Steps to Reproduce

  1. Deploy openab with the codex-acp agent using the standard
    ghcr.io/openabdev/openab-codex:0.7.2 image (no init wrapper)
  2. Use the Discord bot for normal coding sessions involving shell tool
    calls (git status, pnpm build, next build, etc.)
  3. After ~30–60 minutes of activity, exec into the pod:
    kubectl exec <pod> -- sh -c '
      for p in /proc/[0-9]*; do
        awk "/^State:/{print \$2}" $p/status
      done | sort | uniq -c'
  4. Observe the count of Z (zombie) state processes growing roughly
    linearly with session activity, all with PPid = 1

Expected Behavior

Zombie processes should be reaped automatically. Either:

  • Option A (recommended, image-level fix): ship the image with a proper
    init binary as PID 1. tini is the standard answer; one apt install +
    ENTRYPOINT ["tini", "--", "openab"] in Dockerfile.codex. This also
    fixes signal forwarding for free.
  • Option B (app-level fix): have openab install a SIGCHLD handler
    that reaps any waitable children whenever it detects it is running as PID
  1. The Rust ecosystem has crates for this, or a small signal-hook loop
    calling waitpid(-1, WNOHANG).

Option A is much smaller and is the standard container best practice. It also helps every other agent image (openab-claude, openab-gemini, openab-copilot) since the same Dockerfile structure exists across them.

Environment

  • chart: openab 0.7.2 (Helm)
  • image: ghcr.io/openabdev/openab-codex:0.7.2
  • agent: codex-acp (@zed-industries/codex-acp@0.9.5)
  • runtime: K3s (containerd, not docker)
  • pod securityContext: runAsNonRoot: true, runAsUser: 1000,
    capabilities.drop: [ALL]
  • no shareProcessNamespace, no init wrapper
  • args for codex passed via Helm:
    -c approval_policy="never"
    -c sandbox_mode="workspace-write"
    -c sandbox_workspace_write.network_access=true
    
    

Screenshots / Logs

Process listing snippet at the time of measurement (PPid column omitted for brevity, full command lines shown for live processes only):

PID 1     S  openab /etc/openab/config.toml
PID 13    S  node /usr/local/bin/codex-acp -c approval_policy="never" -c
sandbox_mode="workspace-write" -c
sandbox_workspace_write.network_access=true
PID 16715 S  node /usr/local/bin/corepack pnpm build
PID 16747 S  node /usr/local/bin/corepack pnpm -r build
PID 16783 R  node .../packages/web/.../next build
... plus 44 Z (zombie) entries, all with PPid=1, empty cmdline ...

Total state distribution: 12 S, 1 R, 44 Z after ~85 minutes uptime, ~30 zombies/hour during normal use. With Linux's default pid_max=32768 this means PID exhaustion at roughly 1000 hours uptime. Long-lived pods will eventually fail to fork.

Workaround

  • Periodically restart the deployment with kubectl rollout restart deploy/openab
  • Or build a downstream image that wraps the entrypoint in tini:
FROM ghcr.io/openabdev/openab-codex:0.7.2
USER root
RUN apt-get update && apt-get install -y --no-install-recommends tini &&
apt-get clean
USER node
ENTRYPOINT ["tini", "--", "openab"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcodexp2Medium — planned work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions