Bot runtime stability — recurring outages from node/brew/TCC interactions, no detection signal

## Problem

The bot has experienced multiple silent outages caused by interactions between its Node.js runtime, Homebrew package management, and macOS TCC. The longest outage so far was **~9.7 days (231.9h)** in 2026-05-02 → 2026-05-11. During that period the operator received zero messages from the bot — including from cron jobs that normally send daily notifications. The outage was discovered only when an unrelated bash-based cron (`brew-weekly-upgrade`, which doesn't go through node) finally surfaced an error.

This issue collects the incident history and characterizes the failure modes. **It does NOT propose a solution** — that comes in a follow-up after the problem is well-defined.

## Incident history

Times are local (Europe/Moscow). Bot delivers to Telegram via grammY; runs as a launchd service with `KeepAlive` (auto-respawn on crash).

### 1. 2026-05-02 04:07 → 2026-05-11 19:46 — ~9.7 days, ~232h

**Trigger:** Homebrew upgraded a transitive dependency `llhttp` from `9.3.x` to `9.4.1`. The pinned `node 25.9.0` binary had a hard reference to `libllhttp.9.3.dylib`, which the upgrade removed.

**Symptom:** every node spawn died at startup with:
```
dyld[<pid>]: Library not loaded: /opt/homebrew/opt/llhttp/lib/libllhttp.9.3.dylib
  Referenced from: /opt/homebrew/Cellar/node/25.9.0_2/bin/node
```

**Effect:** launchd `KeepAlive` kept respawning the service. Every spawn failed instantly. The bot never reached a state where it could initialize and report. The `stderr.log` accumulated hundreds of identical `dyld` failures. The operator was notified by zero monitoring signal — only by an out-of-band bash-script cron failure.

**Root context:** node had been deliberately pinned earlier (~2026-04-28) to prevent macOS TCC permissions from being invalidated by node binary updates (TCC tracks binary path/signature). Pinning the formula does not prevent brew from upgrading the formula's dependencies, so the pin produced the worst-case outcome: stale code that doesn't run.

### 2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h

**Trigger:** Not fully traced from logs (no diary entry detailing root cause).

**Effect:** No deliveries from the bot for 29 hours. Pattern is consistent with a runtime/launchd-level issue (cron-delivery.log shows a clean gap, not retried-and-failed entries).

### 3. 2026-04-18 — ~17 min

**Trigger:** Manual `launchctl bootout` followed immediately by `launchctl bootstrap`. The bootout in the `gui` domain is asynchronous; the immediate bootstrap raced launchd's teardown and left the service unregistered.

**Effect:** Service unregistered, `KeepAlive` had nothing to keep alive. Bot stayed down until operator noticed and ran `launchctl bootstrap` again.

**Mitigation already shipped:** `bot/scripts/restart-bot.sh` (PR #100 area) polls launchd teardown to completion before bootstrap; documented in the platform rule `.claude/rules/platform/bot-operations.md`. The race condition itself is fixed if the script is always used. **Risk remains** if any operator (human or agent) reverts to raw `launchctl` commands.

### 4. 2026-04-13 → 2026-04-15 — ~38h

**Trigger:** macOS TCC permissions reset after laptop lid was closed and reopened. `node` and dependent tools (e.g., `ical` CLI) lost permissions to access protected services (Calendar, Reminders, etc.).

**Effect:** Tools hung indefinitely waiting on TCC dialogs that nobody dismissed. Bot operationally degraded.

**Workaround at the time:** Manual TCC re-approval. The "pin node" decision (incident 1) traces back to a wish to avoid this class of reset by keeping the binary identity stable across brew operations.

### 5. 2026-04-02 — crash-loop (HTTP 409 Conflict)

**Trigger:** Two bot instances polling `getUpdates` simultaneously (Telegram returns 409 Conflict to the second). Two instances arose from a race during restart.

**Effect:** Service crash-loop. KeepAlive respawned, the second instance immediately collided with the first, both exited, repeat.

**Mitigation already shipped:** Fixed in the bot code (PR around the same time).

### 6. Earlier — OAuth break from `CLAUDE_CODE_SIMPLE=1`

**Trigger:** An env-var toggle broke the OAuth flow for spawning Claude Code subprocesses (issue #87, PR #88). Bot started but couldn't actually spawn Claude — every user message resulted in a failed session.

**Effect:** Bot appeared up (no crash) but every interaction silently failed.

## Pattern

Cross-cutting axes from the incidents above:

1. **System package management vs. node binary identity.**
   - Node updates → TCC permission reset → silent feature loss (#4).
   - Pinning node → brew upgrades transitive deps → ABI break → node won't run at all (#1).
   - Both poles fail. There is no current strategy that survives both.

2. **No detection signal.**
   - launchd `KeepAlive` respawns the process. From launchd's perspective, the service is "always running" (just briefly). From the operator's perspective, the service is dead.
   - There is no positive-signal heartbeat the bot emits on a schedule that an external monitor could check.
   - Prometheus `bot_sessions_active` is scrape-from-bot — if the bot is dead, the scrape itself fails. The exporter side (node_exporter / Prometheus) can detect the scrape failure, but no alert is currently routed to a channel the operator reads.
   - The operator discovered #1 only because a bash-based cron (`brew-weekly-upgrade`) emitted a non-bot delivery failure that did land in Telegram.

3. **Restart fragility.**
   - Raw `launchctl bootout`/`bootstrap` races (#3). The wrapper script fixed it, but the failure mode reappears whenever someone bypasses the script.
   - Manual operations during a turn that itself spawned the bot can produce parent-kill scenarios (referenced in `bot-operations.md`).

4. **Silent functional break (no crash).**
   - #5 (409 crash-loop) and #6 (OAuth break) are bot-internal regressions. Different class from #1-4 (environment), but contributes to the same observable outcome: messages don't get through.

## What needs to be defined (scope of follow-up work)

These are the questions the next issue/discussion should answer. Not answering them in this issue.

- **Heartbeat / liveness signal.** Should the bot emit a positive "I'm alive" signal on a schedule that an out-of-band watcher checks? Where does the watcher live (separate launchd service, external uptime service, a cron that pings and alerts)?
- **Node version strategy.** Stop using brew's `node` formula and switch to a project-local node (`.nvmrc` + `volta`, `mise`, or `fnm`)? Accept TCC resets and add a re-approval workflow? Other?
- **Dependency-change firewall.** If something in the bot's runtime environment changes (node version, dylib soname, OAuth flow), can the bot detect that *before* trying to serve traffic, and refuse to silently respawn-loop?
- **Alerting path.** Where does an "outage detected" message reach the operator? It can't go through the bot itself (the failing component). Out-of-band Telegram delivery via `deliver.sh` (bash, no node) is one option already proven in incident #1.
- **Restart-script enforcement.** Should raw `launchctl bootout/bootstrap` be impossible (e.g., blocked by a hook), or is documentation enough?

## Acceptance for this issue

This issue is a problem-definition document. It is "done" when:

- [ ] The incident history above is reviewed for accuracy by the operator
- [ ] The "what needs to be defined" questions are validated as the right ones to answer
- [ ] A follow-up issue is opened with the chosen direction (separate scope)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bot runtime stability — recurring outages from node/brew/TCC interactions, no detection signal #114

Problem

Incident history

1. 2026-05-02 04:07 → 2026-05-11 19:46 — ~9.7 days, ~232h

2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h

3. 2026-04-18 — ~17 min

4. 2026-04-13 → 2026-04-15 — ~38h

5. 2026-04-02 — crash-loop (HTTP 409 Conflict)

6. Earlier — OAuth break from `CLAUDE_CODE_SIMPLE=1`

Pattern

What needs to be defined (scope of follow-up work)

Acceptance for this issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bot runtime stability — recurring outages from node/brew/TCC interactions, no detection signal #114

Description

Problem

Incident history

1. 2026-05-02 04:07 → 2026-05-11 19:46 — ~9.7 days, ~232h

2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h

3. 2026-04-18 — ~17 min

4. 2026-04-13 → 2026-04-15 — ~38h

5. 2026-04-02 — crash-loop (HTTP 409 Conflict)

6. Earlier — OAuth break from CLAUDE_CODE_SIMPLE=1

Pattern

What needs to be defined (scope of follow-up work)

Acceptance for this issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

6. Earlier — OAuth break from `CLAUDE_CODE_SIMPLE=1`