Session pool leaks memory: orphaned kiro-cli processes and no eviction

### Description

When running openab with `kiro-cli` on a constrained host (e.g. 3.6 GB RAM on Zeabur), the session pool fills up with idle sessions that are never reclaimed in time. Once `max_sessions` is reached, new requests are rejected and the host eventually OOMs.

Each `kiro-cli acp` spawns a child `kiro-cli-chat acp` process (~230-390 MB each). When the pool drops a session, `kill_on_drop` only kills the direct child — the grandchild `kiro-cli-chat` process becomes orphaned and keeps consuming memory.

Observed on a live deployment — 10 stale `kiro-cli-chat acp` processes consuming **3 GB total**:

| PID | Started | RSS |
|-----|---------|-----|
| 459820 | Apr12 | 290 MB |
| 581161 | Apr12 | 306 MB |
| 625730 | Apr12 | 300 MB |
| 633382 | Apr12 | 282 MB |
| 673360 | Apr12 | 388 MB |
| 724688 | 00:43 | 273 MB |
| 872305 | 08:48 | 274 MB |
| 872764 | 08:50 | 236 MB |
| 907784 | 10:39 | 227 MB |
| 913618 | 11:00 | 230 MB |

Four root causes identified:

1. **Orphaned grandchild processes** — `kill_on_drop(true)` only SIGKILLs the direct child PID. The grandchild `kiro-cli-chat` survives and leaks memory. Fix: use process groups (`setsid`/`setpgid`) and kill the entire group on cleanup.

2. **No cleanup on Discord thread archive** — `EventHandler` only implements `message` and `ready`. Archiving a thread leaves the session alive until TTL. Fix: implement `thread_update` handler.

3. **No LRU eviction** — when pool is full, `get_or_create()` rejects with "pool exhausted" instead of evicting the oldest idle session. Fix: evict oldest `last_active` session when at capacity.

4. **Default TTL too long** — `session_ttl_hours` defaults to 24. On a 3.6 GB host with 10 sessions × ~300 MB = 3 GB of idle processes. Fix: lower default or document memory implications.

#### Industry Comparison

A survey of agent harnesses from [Picrew/awesome-agent-harness](https://github.com/Picrew/awesome-agent-harness) shows openab sits in the riskiest position — process-level isolation without proper process group management:

| Harness | Isolation | Orphan Risk | Cleanup Strategy |
|---------|-----------|-------------|------------------|
| Gemini CLI a2a | In-process `Map` | None ✅ | `task.dispose()` + Map delete |
| **openab** | **Process** | **HIGH ☠️** | **`kill_on_drop` (broken for grandchildren)** |
| [acpx](https://github.com/openclaw/acpx) | Process | Low ✅ | 3-stage shutdown (`stdin.end()` → SIGTERM → SIGKILL) + self-terminating TTL |
| [Scion](https://github.com/GoogleCloudPlatform/scion) | Container | None ✅ | `docker rm -f` kills everything |
| Daytona / E2B | VM/microVM | None ✅ | Destroy sandbox API |

Key insight from **acpx**: they use a 3-stage graceful shutdown (`stdin.end()` → SIGTERM 1.5s → SIGKILL 1s → detach all handles) and self-terminating queue-owner processes that exit when idle. This eliminates both the orphan problem and the need for a central cleanup task.

Key insight from **Scion**: container-per-agent makes orphans impossible by design (`docker rm -f` kills the entire process tree). This is the most robust long-term architecture but requires more infrastructure.

### Steps to Reproduce

1. Deploy openab with `kiro-cli` on a host with limited RAM (e.g. 3.6 GB)
2. Send messages from Discord that create multiple threads (up to `max_sessions`)
3. Archive/close the Discord threads
4. Observe that `kiro-cli` and `kiro-cli-chat` processes remain running
5. Run `ps aux | grep kiro-cli` — orphaned processes accumulate
6. Eventually the host runs out of memory and the pod/container is killed

### Expected Behavior

- When a Discord thread is archived, the associated session and all its child processes should be terminated
- When the pool is full, the oldest idle session should be evicted to make room
- When a session is dropped, all descendant processes (including grandchildren) should be killed via process group signal
- Default TTL should be reasonable for small hosts, or clearly documented


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session pool leaks memory: orphaned kiro-cli processes and no eviction #309

Description

Industry Comparison

Steps to Reproduce

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PID	Started	RSS
459820	Apr12	290 MB
581161	Apr12	306 MB
625730	Apr12	300 MB
633382	Apr12	282 MB
673360	Apr12	388 MB
724688	00:43	273 MB
872305	08:48	274 MB
872764	08:50	236 MB
907784	10:39	227 MB
913618	11:00	230 MB

Harness	Isolation	Orphan Risk	Cleanup Strategy
Gemini CLI a2a	In-process `Map`	None ✅	`task.dispose()` + Map delete
openab	Process	HIGH ☠️	`kill_on_drop` (broken for grandchildren)
acpx	Process	Low ✅	3-stage shutdown (`stdin.end()` → SIGTERM → SIGKILL) + self-terminating TTL
Scion	Container	None ✅	`docker rm -f` kills everything
Daytona / E2B	VM/microVM	None ✅	Destroy sandbox API

Session pool leaks memory: orphaned kiro-cli processes and no eviction #309

Description

Description

Industry Comparison

Steps to Reproduce

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions