You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bot has experienced multiple silent outages caused by interactions between its Node.js runtime, Homebrew package management, and macOS TCC. The longest outage so far was ~9.7 days (231.9h) in 2026-05-02 → 2026-05-11. During that period the operator received zero messages from the bot — including from cron jobs that normally send daily notifications. The outage was discovered only when an unrelated bash-based cron (brew-weekly-upgrade, which doesn't go through node) finally surfaced an error.
This issue collects the incident history and characterizes the failure modes. It does NOT propose a solution — that comes in a follow-up after the problem is well-defined.
Incident history
Times are local (Europe/Moscow). Bot delivers to Telegram via grammY; runs as a launchd service with KeepAlive (auto-respawn on crash).
Trigger: Homebrew upgraded a transitive dependency llhttp from 9.3.x to 9.4.1. The pinned node 25.9.0 binary had a hard reference to libllhttp.9.3.dylib, which the upgrade removed.
Symptom: every node spawn died at startup with:
dyld[<pid>]: Library not loaded: /opt/homebrew/opt/llhttp/lib/libllhttp.9.3.dylib
Referenced from: /opt/homebrew/Cellar/node/25.9.0_2/bin/node
Effect: launchd KeepAlive kept respawning the service. Every spawn failed instantly. The bot never reached a state where it could initialize and report. The stderr.log accumulated hundreds of identical dyld failures. The operator was notified by zero monitoring signal — only by an out-of-band bash-script cron failure.
Root context: node had been deliberately pinned earlier (~2026-04-28) to prevent macOS TCC permissions from being invalidated by node binary updates (TCC tracks binary path/signature). Pinning the formula does not prevent brew from upgrading the formula's dependencies, so the pin produced the worst-case outcome: stale code that doesn't run.
2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h
Trigger: Not fully traced from logs (no diary entry detailing root cause).
Effect: No deliveries from the bot for 29 hours. Pattern is consistent with a runtime/launchd-level issue (cron-delivery.log shows a clean gap, not retried-and-failed entries).
3. 2026-04-18 — ~17 min
Trigger: Manual launchctl bootout followed immediately by launchctl bootstrap. The bootout in the gui domain is asynchronous; the immediate bootstrap raced launchd's teardown and left the service unregistered.
Effect: Service unregistered, KeepAlive had nothing to keep alive. Bot stayed down until operator noticed and ran launchctl bootstrap again.
Mitigation already shipped:bot/scripts/restart-bot.sh (PR #100 area) polls launchd teardown to completion before bootstrap; documented in the platform rule .claude/rules/platform/bot-operations.md. The race condition itself is fixed if the script is always used. Risk remains if any operator (human or agent) reverts to raw launchctl commands.
4. 2026-04-13 → 2026-04-15 — ~38h
Trigger: macOS TCC permissions reset after laptop lid was closed and reopened. node and dependent tools (e.g., ical CLI) lost permissions to access protected services (Calendar, Reminders, etc.).
Effect: Tools hung indefinitely waiting on TCC dialogs that nobody dismissed. Bot operationally degraded.
Workaround at the time: Manual TCC re-approval. The "pin node" decision (incident 1) traces back to a wish to avoid this class of reset by keeping the binary identity stable across brew operations.
5. 2026-04-02 — crash-loop (HTTP 409 Conflict)
Trigger: Two bot instances polling getUpdates simultaneously (Telegram returns 409 Conflict to the second). Two instances arose from a race during restart.
Effect: Service crash-loop. KeepAlive respawned, the second instance immediately collided with the first, both exited, repeat.
Mitigation already shipped: Fixed in the bot code (PR around the same time).
6. Earlier — OAuth break from CLAUDE_CODE_SIMPLE=1
Trigger: An env-var toggle broke the OAuth flow for spawning Claude Code subprocesses (issue #87, PR #88). Bot started but couldn't actually spawn Claude — every user message resulted in a failed session.
Effect: Bot appeared up (no crash) but every interaction silently failed.
Pattern
Cross-cutting axes from the incidents above:
System package management vs. node binary identity.
Both poles fail. There is no current strategy that survives both.
No detection signal.
launchd KeepAlive respawns the process. From launchd's perspective, the service is "always running" (just briefly). From the operator's perspective, the service is dead.
There is no positive-signal heartbeat the bot emits on a schedule that an external monitor could check.
Prometheus bot_sessions_active is scrape-from-bot — if the bot is dead, the scrape itself fails. The exporter side (node_exporter / Prometheus) can detect the scrape failure, but no alert is currently routed to a channel the operator reads.
What needs to be defined (scope of follow-up work)
These are the questions the next issue/discussion should answer. Not answering them in this issue.
Heartbeat / liveness signal. Should the bot emit a positive "I'm alive" signal on a schedule that an out-of-band watcher checks? Where does the watcher live (separate launchd service, external uptime service, a cron that pings and alerts)?
Node version strategy. Stop using brew's node formula and switch to a project-local node (.nvmrc + volta, mise, or fnm)? Accept TCC resets and add a re-approval workflow? Other?
Dependency-change firewall. If something in the bot's runtime environment changes (node version, dylib soname, OAuth flow), can the bot detect that before trying to serve traffic, and refuse to silently respawn-loop?
Alerting path. Where does an "outage detected" message reach the operator? It can't go through the bot itself (the failing component). Out-of-band Telegram delivery via deliver.sh (bash, no node) is one option already proven in incident refactor: monorepo — bot + workspace template #1.
Restart-script enforcement. Should raw launchctl bootout/bootstrap be impossible (e.g., blocked by a hook), or is documentation enough?
Acceptance for this issue
This issue is a problem-definition document. It is "done" when:
The incident history above is reviewed for accuracy by the operator
The "what needs to be defined" questions are validated as the right ones to answer
A follow-up issue is opened with the chosen direction (separate scope)
Problem
The bot has experienced multiple silent outages caused by interactions between its Node.js runtime, Homebrew package management, and macOS TCC. The longest outage so far was ~9.7 days (231.9h) in 2026-05-02 → 2026-05-11. During that period the operator received zero messages from the bot — including from cron jobs that normally send daily notifications. The outage was discovered only when an unrelated bash-based cron (
brew-weekly-upgrade, which doesn't go through node) finally surfaced an error.This issue collects the incident history and characterizes the failure modes. It does NOT propose a solution — that comes in a follow-up after the problem is well-defined.
Incident history
Times are local (Europe/Moscow). Bot delivers to Telegram via grammY; runs as a launchd service with
KeepAlive(auto-respawn on crash).1. 2026-05-02 04:07 → 2026-05-11 19:46 — ~9.7 days, ~232h
Trigger: Homebrew upgraded a transitive dependency
llhttpfrom9.3.xto9.4.1. The pinnednode 25.9.0binary had a hard reference tolibllhttp.9.3.dylib, which the upgrade removed.Symptom: every node spawn died at startup with:
Effect: launchd
KeepAlivekept respawning the service. Every spawn failed instantly. The bot never reached a state where it could initialize and report. Thestderr.logaccumulated hundreds of identicaldyldfailures. The operator was notified by zero monitoring signal — only by an out-of-band bash-script cron failure.Root context: node had been deliberately pinned earlier (~2026-04-28) to prevent macOS TCC permissions from being invalidated by node binary updates (TCC tracks binary path/signature). Pinning the formula does not prevent brew from upgrading the formula's dependencies, so the pin produced the worst-case outcome: stale code that doesn't run.
2. 2026-04-22 21:00 → 2026-04-24 02:13 — ~29h
Trigger: Not fully traced from logs (no diary entry detailing root cause).
Effect: No deliveries from the bot for 29 hours. Pattern is consistent with a runtime/launchd-level issue (cron-delivery.log shows a clean gap, not retried-and-failed entries).
3. 2026-04-18 — ~17 min
Trigger: Manual
launchctl bootoutfollowed immediately bylaunchctl bootstrap. The bootout in theguidomain is asynchronous; the immediate bootstrap raced launchd's teardown and left the service unregistered.Effect: Service unregistered,
KeepAlivehad nothing to keep alive. Bot stayed down until operator noticed and ranlaunchctl bootstrapagain.Mitigation already shipped:
bot/scripts/restart-bot.sh(PR #100 area) polls launchd teardown to completion before bootstrap; documented in the platform rule.claude/rules/platform/bot-operations.md. The race condition itself is fixed if the script is always used. Risk remains if any operator (human or agent) reverts to rawlaunchctlcommands.4. 2026-04-13 → 2026-04-15 — ~38h
Trigger: macOS TCC permissions reset after laptop lid was closed and reopened.
nodeand dependent tools (e.g.,icalCLI) lost permissions to access protected services (Calendar, Reminders, etc.).Effect: Tools hung indefinitely waiting on TCC dialogs that nobody dismissed. Bot operationally degraded.
Workaround at the time: Manual TCC re-approval. The "pin node" decision (incident 1) traces back to a wish to avoid this class of reset by keeping the binary identity stable across brew operations.
5. 2026-04-02 — crash-loop (HTTP 409 Conflict)
Trigger: Two bot instances polling
getUpdatessimultaneously (Telegram returns 409 Conflict to the second). Two instances arose from a race during restart.Effect: Service crash-loop. KeepAlive respawned, the second instance immediately collided with the first, both exited, repeat.
Mitigation already shipped: Fixed in the bot code (PR around the same time).
6. Earlier — OAuth break from
CLAUDE_CODE_SIMPLE=1Trigger: An env-var toggle broke the OAuth flow for spawning Claude Code subprocesses (issue #87, PR #88). Bot started but couldn't actually spawn Claude — every user message resulted in a failed session.
Effect: Bot appeared up (no crash) but every interaction silently failed.
Pattern
Cross-cutting axes from the incidents above:
System package management vs. node binary identity.
No detection signal.
KeepAliverespawns the process. From launchd's perspective, the service is "always running" (just briefly). From the operator's perspective, the service is dead.bot_sessions_activeis scrape-from-bot — if the bot is dead, the scrape itself fails. The exporter side (node_exporter / Prometheus) can detect the scrape failure, but no alert is currently routed to a channel the operator reads.brew-weekly-upgrade) emitted a non-bot delivery failure that did land in Telegram.Restart fragility.
launchctl bootout/bootstrapraces (stream-relay silently swallows editMessageText errors causing truncated messages #3). The wrapper script fixed it, but the failure mode reappears whenever someone bypasses the script.bot-operations.md).Silent functional break (no crash).
What needs to be defined (scope of follow-up work)
These are the questions the next issue/discussion should answer. Not answering them in this issue.
nodeformula and switch to a project-local node (.nvmrc+volta,mise, orfnm)? Accept TCC resets and add a re-approval workflow? Other?deliver.sh(bash, no node) is one option already proven in incident refactor: monorepo — bot + workspace template #1.launchctl bootout/bootstrapbe impossible (e.g., blocked by a hook), or is documentation enough?Acceptance for this issue
This issue is a problem-definition document. It is "done" when: