Skip to content

tunnel: tunnel-history MCP tool with per-client event ring#57

Open
joelgwebber wants to merge 2 commits intomainfrom
joel/tunnel-history
Open

tunnel: tunnel-history MCP tool with per-client event ring#57
joelgwebber wants to merge 2 commits intomainfrom
joel/tunnel-history

Conversation

@joelgwebber
Copy link
Copy Markdown
Contributor

Summary

Adds a per-TunnelClient lifecycle event ring (capacity 64) and a new tunnel-history MCP tool that surfaces it. Captures the full WS lifecycle:

  • connect-start, disconnect-requested, reconnect-scheduled
  • ws-open, hello-sent, ws-close, ws-error, unexpected-response
  • ready, handshake-error
  • ping-sent, stale-fired
  • transport-error, need-live-tunnel

Each event has a wall-clock timestamp (ms since epoch — aligns with lidar logs) and an optional small structured detail.

Also keeps the last 4 dead-tunnel histories in main.ts so callers can still inspect a tunnel after need_live_tunnel fires and the live clients entry has been removed (the most common case worth debugging).

Motivation

When tunnels die during an agent-driven session (Remy especially), neither the agent driving the browser nor the human watching has any way to inspect what happened — stderr from the npx subprocess goes to the parent agent process, not to the MCP tool surface. tunnel-history closes that gap.

The presence/absence of ping-sent events during a long quiet window is a particularly high-signal probe:

  • No ping-sent between connect and stale-fired → keepalive timer wasn't firing → likely event-loop starvation in the npx child (e.g. stdio backpressure when the parent isn't draining).
  • ping-sent × N then stale-fired → pings going out but no inbound → silent half-close upstream (linkerd, intermediate LB).
  • ws-close arrives before stale-fired → server tore the WS down (look at lidar logs for matching tunnel: %s disconnected).

Wiring details

opts.onPingSent threads through TransportOptionsYamuxSession.#sendPing. The hook fires only on successful ws.send() enqueue, so the absence of a ping-sent event reliably means "the timer fired but the send threw" rather than "the timer didn't fire" — preserving the diagnostic signal.

The tunnel-history tool description tells callers how to read the ring as a chronological story so agents can self-diagnose. Example call shapes:

// All tunnels (live + recently-dead)
tunnel-history({})

// One specific tunnel by id (active or recently-dead)
tunnel-history({ tunnelId: "abc-123" })

Test plan

  • Unit test (history ring records connect, ws-open, hello-sent, ready, ws-close) covers basic ordering and detail content.
  • Full suite passes (128/128).
  • Manual: reproduce a Remy stale-fire scenario after publishing 0.1.15 and verify the events surface the failure mode.

Bumps to 0.1.15. Companion to #54 (Fix A.client) — together these address SUBTEXT-338 and give us the diagnostic surface for the next class of tunnel issues.

🤖 Generated with Claude Code

joelgwebber and others added 2 commits May 5, 2026 22:06
Adds a per-TunnelClient lifecycle event ring (capacity 64) and a new
tunnel-history MCP tool that surfaces it. Captures connect-start,
ws-open, hello-sent, ready, ping-sent, ws-close, ws-error, stale-fired,
reconnect-scheduled, unexpected-response, handshake-error, transport-error,
and need-live-tunnel — each with a wall-clock timestamp and small
structured detail.

Also keeps the last 4 dead-tunnel histories in main.ts so callers can
still inspect a tunnel after need_live_tunnel fires and the live entry is
removed from the map (the most useful case to debug).

Motivation: when tunnels die during a Remy run, neither the agent driving
the browser nor the human watching has any way to inspect what happened
— stderr from the npx subprocess goes to the parent agent, not to the
MCP tool surface. tunnel-history fills that gap. The presence/absence of
ping-sent events between connect and stale-fired is a particularly
high-signal probe for "is the keepalive timer firing?" vs. "is the
inbound side silently half-closed?".

Wires opts.onPingSent through TransportOptions → YamuxSession.#sendPing
so the ping event is recorded only on successful enqueue (absence then
reliably means "timer fired but send threw" rather than "timer didn't
fire").

Bumps to 0.1.15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A small Node script that mints a relayUrl via the live-tunnel MCP tool,
then drives a TunnelClient directly (no MCP server wrapper, no Remy).
Periodic snapshots dump the full event ring so the operator can watch
the WS lifecycle play out in one terminal.

Use this to reproduce intermittent disconnect / reconnect failures
against staging without rebuilding remy-agent or going through the MCP
stdio child process.

Skips the MCP initialize handshake intentionally — initialize creates
server-side session state that's only valid on the pod that handled it,
and the affinity router doesn't bind subsequent requests to that pod.
A stateless one-shot tools/call is enough for live-tunnel and avoids
that whole class of routing race.

Default --mcp-url is staging; --api-key-env auto-picks the env var
based on URL (SUBTEXT_STAGING_API_KEY for *.staging.fullstory.com,
etc.). Override with --api-key-env if needed.

Usage:
  npm run build
  node scripts/probe.mjs                                    # 30-min idle test
  node scripts/probe.mjs --ping-ms 60000                    # exercise stale timer
  node scripts/probe.mjs --allow http://localhost:3000      # different allowlist
  node scripts/probe.mjs --mcp-url https://api.onfire.fyi/mcp/subtext

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant