Skip to content

feat: Heartbeat / liveness monitoring for long-running agent invocations #385

@jafreck

Description

@jafreck

Summary

Add heartbeat emission and liveness monitoring for long-running agent invocations, enabling watchdog UIs, timeout warnings, and progress reporting for agents that run for extended periods.

Motivation

Agent invocations can take anywhere from 30 seconds to 30+ minutes depending on task complexity. The framework's AgentLauncher.launchAgent() is a single async call that returns an AgentResult when done — there's no visibility into what's happening during execution. The timedOut: boolean field on AgentResult only tells you after the fact.

AAMF emits periodic agent-heartbeat events with elapsed time and agent-timed-out events when limits are approached. These drive:

  • CLI progress indicators ("code-migrator running for 4m32s...")
  • Timeout warnings before hard-kills ("approaching 10min limit, 8m45s elapsed")
  • Monitoring dashboards for fleet-scale runs

Proposed API

Heartbeat Configuration

interface HeartbeatConfig {
  /** Interval between heartbeat emissions in milliseconds (default: 30_000). */
  intervalMs?: number;
  /** Emit a warning event when this percentage of the timeout has elapsed (default: 0.8 = 80%). */
  timeoutWarningThreshold?: number;
}

interface AgentLauncherConfig {
  // ... existing fields ...
  heartbeat?: HeartbeatConfig;
}

Events

interface AgentHeartbeatEvent {
  type: 'agent-heartbeat';
  agent: string;
  issueNumber?: number;
  taskId?: string;
  sessionId?: string;
  /** Seconds elapsed since invocation started. */
  elapsedSeconds: number;
  /** Configured timeout in seconds (if set). */
  timeoutSeconds?: number;
  /** Percentage of timeout consumed (0-1). */
  timeoutPercentage?: number;
}

interface AgentTimeoutWarningEvent {
  type: 'agent-timeout-warning';
  agent: string;
  issueNumber?: number;
  taskId?: string;
  elapsedSeconds: number;
  timeoutSeconds: number;
  /** How many seconds remain before the hard timeout. */
  remainingSeconds: number;
}

interface AgentTimedOutEvent {
  type: 'agent-timed-out';
  agent: string;
  issueNumber?: number;
  taskId?: string;
  timeoutSeconds: number;
}

Add all three to the FrameworkLifecycleEvent union.

AgentLauncher Integration

The launcher starts a heartbeat interval timer when invoke() begins and clears it when the invocation completes. During the interval:

  1. Emit AgentHeartbeatEvent with current elapsed time
  2. If elapsed > timeout * timeoutWarningThreshold, emit AgentTimeoutWarningEvent (once)
  3. When the process is killed due to timeout, emit AgentTimedOutEvent

Callback Hook (for non-event-bus consumers)

interface AgentInvocation {
  // ... existing fields ...

  /** Optional callback invoked periodically during execution with elapsed time. */
  onHeartbeat?: (elapsed: { seconds: number; timeoutPercentage?: number }) => void;
}

Implementation Notes

  • The heartbeat timer runs in the launcher's process, not inside the agent subprocess — no agent cooperation required
  • setInterval with clearInterval on completion; use unref() so the timer doesn't prevent process exit
  • Heartbeat events go through the standard FleetEventBus dispatch, which means notification providers (Slack, webhook) can react to long-running agents
  • The onHeartbeat callback is a lightweight alternative for consumers who don't use the event bus
  • Default interval of 30s keeps event volume low while providing useful liveness signal
  • The timeout warning should fire exactly once per invocation (not on every heartbeat after crossing the threshold)
  • Consider adding a lastHeartbeat timestamp to AgentResult for diagnostic purposes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions