Skip to content

Session pool leaks memory: orphaned kiro-cli processes and no eviction #309

@chaodu-agent

Description

@chaodu-agent

Description

When running openab with kiro-cli on a constrained host (e.g. 3.6 GB RAM on Zeabur), the session pool fills up with idle sessions that are never reclaimed in time. Once max_sessions is reached, new requests are rejected and the host eventually OOMs.

Each kiro-cli acp spawns a child kiro-cli-chat acp process (~230-390 MB each). When the pool drops a session, kill_on_drop only kills the direct child — the grandchild kiro-cli-chat process becomes orphaned and keeps consuming memory.

Observed on a live deployment — 10 stale kiro-cli-chat acp processes consuming 3 GB total:

PID Started RSS
459820 Apr12 290 MB
581161 Apr12 306 MB
625730 Apr12 300 MB
633382 Apr12 282 MB
673360 Apr12 388 MB
724688 00:43 273 MB
872305 08:48 274 MB
872764 08:50 236 MB
907784 10:39 227 MB
913618 11:00 230 MB

Four root causes identified:

  1. Orphaned grandchild processeskill_on_drop(true) only SIGKILLs the direct child PID. The grandchild kiro-cli-chat survives and leaks memory. Fix: use process groups (setsid/setpgid) and kill the entire group on cleanup.

  2. No cleanup on Discord thread archiveEventHandler only implements message and ready. Archiving a thread leaves the session alive until TTL. Fix: implement thread_update handler.

  3. No LRU eviction — when pool is full, get_or_create() rejects with "pool exhausted" instead of evicting the oldest idle session. Fix: evict oldest last_active session when at capacity.

  4. Default TTL too longsession_ttl_hours defaults to 24. On a 3.6 GB host with 10 sessions × ~300 MB = 3 GB of idle processes. Fix: lower default or document memory implications.

Industry Comparison

A survey of agent harnesses from Picrew/awesome-agent-harness shows openab sits in the riskiest position — process-level isolation without proper process group management:

Harness Isolation Orphan Risk Cleanup Strategy
Gemini CLI a2a In-process Map None ✅ task.dispose() + Map delete
openab Process HIGH ☠️ kill_on_drop (broken for grandchildren)
acpx Process Low ✅ 3-stage shutdown (stdin.end() → SIGTERM → SIGKILL) + self-terminating TTL
Scion Container None ✅ docker rm -f kills everything
Daytona / E2B VM/microVM None ✅ Destroy sandbox API

Key insight from acpx: they use a 3-stage graceful shutdown (stdin.end() → SIGTERM 1.5s → SIGKILL 1s → detach all handles) and self-terminating queue-owner processes that exit when idle. This eliminates both the orphan problem and the need for a central cleanup task.

Key insight from Scion: container-per-agent makes orphans impossible by design (docker rm -f kills the entire process tree). This is the most robust long-term architecture but requires more infrastructure.

Steps to Reproduce

  1. Deploy openab with kiro-cli on a host with limited RAM (e.g. 3.6 GB)
  2. Send messages from Discord that create multiple threads (up to max_sessions)
  3. Archive/close the Discord threads
  4. Observe that kiro-cli and kiro-cli-chat processes remain running
  5. Run ps aux | grep kiro-cli — orphaned processes accumulate
  6. Eventually the host runs out of memory and the pod/container is killed

Expected Behavior

  • When a Discord thread is archived, the associated session and all its child processes should be terminated
  • When the pool is full, the oldest idle session should be evicted to make room
  • When a session is dropped, all descendant processes (including grandchildren) should be killed via process group signal
  • Default TTL should be reasonable for small hosts, or clearly documented

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions