Skip to content

Bot runtime stability — recurring outages from node/brew/TCC interactions, no detection signal #114

@fitz123

Description

@fitz123

Problem

The bot has experienced multiple silent outages caused by interactions between its Node.js runtime, Homebrew package management, and macOS TCC. The longest outage so far was ~9.7 days (231.9h) in 2026-05-02 → 2026-05-11. During that period the operator received zero messages from the bot — including from cron jobs that normally send daily notifications. The outage was discovered only when an unrelated bash-based cron (brew-weekly-upgrade, which doesn't go through node) finally surfaced an error.

This issue collects the incident history and characterizes the failure modes. It does NOT propose a solution — that comes in a follow-up after the problem is well-defined.

Incident history

Times are local (Europe/Moscow). Bot delivers to Telegram via grammY; runs as a launchd service with KeepAlive (auto-respawn on crash).

1. 2026-05-02 04:07 → 2026-05-11 19:46 — ~9.7 days, ~232h

Trigger: Homebrew upgraded a transitive dependency llhttp from 9.3.x to 9.4.1. The pinned node 25.9.0 binary had a hard reference to libllhttp.9.3.dylib, which the upgrade removed.

Symptom: every node spawn died at startup with:

dyld[<pid>]: Library not loaded: /opt/homebrew/opt/llhttp/lib/libllhttp.9.3.dylib
  Referenced from: /opt/homebrew/Cellar/node/25.9.0_2/bin/node

Effect: launchd KeepAlive kept respawning the service. Every spawn failed instantly. The bot never reached a state where it could initialize and report. The stderr.log accumulated hundreds of identical dyld failures. The operator was notified by zero monitoring signal — only by an out-of-band bash-script cron failure.

Root context: node had been deliberately pinned earlier (~2026-04-28) to prevent macOS TCC permissions from being invalidated by node binary updates (TCC tracks binary path/signature). Pinning the formula does not prevent brew from upgrading the formula's dependencies, so the pin produced the worst-case outcome: stale code that doesn't run.

2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h

Trigger: Not fully traced from logs (no diary entry detailing root cause).

Effect: No deliveries from the bot for 29 hours. Pattern is consistent with a runtime/launchd-level issue (cron-delivery.log shows a clean gap, not retried-and-failed entries).

3. 2026-04-18 — ~17 min

Trigger: Manual launchctl bootout followed immediately by launchctl bootstrap. The bootout in the gui domain is asynchronous; the immediate bootstrap raced launchd's teardown and left the service unregistered.

Effect: Service unregistered, KeepAlive had nothing to keep alive. Bot stayed down until operator noticed and ran launchctl bootstrap again.

Mitigation already shipped: bot/scripts/restart-bot.sh (PR #100 area) polls launchd teardown to completion before bootstrap; documented in the platform rule .claude/rules/platform/bot-operations.md. The race condition itself is fixed if the script is always used. Risk remains if any operator (human or agent) reverts to raw launchctl commands.

4. 2026-04-13 → 2026-04-15 — ~38h

Trigger: macOS TCC permissions reset after laptop lid was closed and reopened. node and dependent tools (e.g., ical CLI) lost permissions to access protected services (Calendar, Reminders, etc.).

Effect: Tools hung indefinitely waiting on TCC dialogs that nobody dismissed. Bot operationally degraded.

Workaround at the time: Manual TCC re-approval. The "pin node" decision (incident 1) traces back to a wish to avoid this class of reset by keeping the binary identity stable across brew operations.

5. 2026-04-02 — crash-loop (HTTP 409 Conflict)

Trigger: Two bot instances polling getUpdates simultaneously (Telegram returns 409 Conflict to the second). Two instances arose from a race during restart.

Effect: Service crash-loop. KeepAlive respawned, the second instance immediately collided with the first, both exited, repeat.

Mitigation already shipped: Fixed in the bot code (PR around the same time).

6. Earlier — OAuth break from CLAUDE_CODE_SIMPLE=1

Trigger: An env-var toggle broke the OAuth flow for spawning Claude Code subprocesses (issue #87, PR #88). Bot started but couldn't actually spawn Claude — every user message resulted in a failed session.

Effect: Bot appeared up (no crash) but every interaction silently failed.

Pattern

Cross-cutting axes from the incidents above:

  1. System package management vs. node binary identity.

  2. No detection signal.

    • launchd KeepAlive respawns the process. From launchd's perspective, the service is "always running" (just briefly). From the operator's perspective, the service is dead.
    • There is no positive-signal heartbeat the bot emits on a schedule that an external monitor could check.
    • Prometheus bot_sessions_active is scrape-from-bot — if the bot is dead, the scrape itself fails. The exporter side (node_exporter / Prometheus) can detect the scrape failure, but no alert is currently routed to a channel the operator reads.
    • The operator discovered refactor: monorepo — bot + workspace template #1 only because a bash-based cron (brew-weekly-upgrade) emitted a non-bot delivery failure that did land in Telegram.
  3. Restart fragility.

  4. Silent functional break (no crash).

What needs to be defined (scope of follow-up work)

These are the questions the next issue/discussion should answer. Not answering them in this issue.

  • Heartbeat / liveness signal. Should the bot emit a positive "I'm alive" signal on a schedule that an out-of-band watcher checks? Where does the watcher live (separate launchd service, external uptime service, a cron that pings and alerts)?
  • Node version strategy. Stop using brew's node formula and switch to a project-local node (.nvmrc + volta, mise, or fnm)? Accept TCC resets and add a re-approval workflow? Other?
  • Dependency-change firewall. If something in the bot's runtime environment changes (node version, dylib soname, OAuth flow), can the bot detect that before trying to serve traffic, and refuse to silently respawn-loop?
  • Alerting path. Where does an "outage detected" message reach the operator? It can't go through the bot itself (the failing component). Out-of-band Telegram delivery via deliver.sh (bash, no node) is one option already proven in incident refactor: monorepo — bot + workspace template #1.
  • Restart-script enforcement. Should raw launchctl bootout/bootstrap be impossible (e.g., blocked by a hook), or is documentation enough?

Acceptance for this issue

This issue is a problem-definition document. It is "done" when:

  • The incident history above is reviewed for accuracy by the operator
  • The "what needs to be defined" questions are validated as the right ones to answer
  • A follow-up issue is opened with the chosen direction (separate scope)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions