Skip to content

feat: Structured terminal exhaustion tracking for CheckpointState #378

@jafreck

Description

@jafreck

Summary

Add structured terminal exhaustion tracking to CheckpointState and the event system, giving framework consumers a machine-readable record of why execution stopped and enabling intelligent resume decisions.

Motivation

When a multi-phase pipeline aborts, checkpoint currently records the last completed phase and failedTasks with string error messages. There's no structured way to answer: "Did we stop because we blew the budget? Hit max retries on a critical path? Exceeded a convergence limit? Got killed by a signal?"

This matters for resume logic — a budget-exceeded stop should prompt the user to increase the budget before resuming, while a convergence-limit stop means retrying the same tasks won't help. Without structured reason codes, consumers must parse error strings or lose this context entirely.

AAMF implements this as:

  • A TerminalReasonCode enum ('budget-exceeded' | 'max-retries' | 'convergence-limit' | 'critical-phase-failure' | ...)
  • A TerminalExhaustionState stored in checkpoint with the reason code, optional wave/task/check identifiers, and a human-readable summary
  • A TerminalExhaustionError class that captures this structured data and is thrown to unwind the stack
  • A terminal-exhaustion event type dispatched before the error propagates

Proposed API

Reason Codes

/** Why execution stopped beyond simple success/failure. */
export type TerminalReasonCode =
  | 'budget-exceeded'          // Token budget exhausted
  | 'max-retries-exhausted'    // A task/step hit its retry ceiling
  | 'convergence-limit'        // Iterative loop didn't converge within max iterations
  | 'critical-phase-failure'   // A critical phase failed with no recovery path
  | 'signal-interrupted'       // Process received SIGINT/SIGTERM
  | 'dependency-blocked'       // All remaining work items depend on failed/blocked items
  | 'gate-hard-fail';          // A phase gate returned 'fail' (non-retriable)

Checkpoint Extension

interface CheckpointState {
  // ... existing fields ...

  /** Structured abort metadata, set when execution stopped abnormally. */
  terminalExhaustion?: {
    reasonCode: TerminalReasonCode;
    /** Work unit (issue number) where exhaustion occurred. */
    workUnit?: number;
    /** Phase/stage where exhaustion occurred. */
    phase?: number;
    /** Task or step identifier within the phase. */
    taskId?: string;
    /** Additional qualifier (e.g., gate name, convergence check). */
    qualifier?: string;
    /** Human-readable explanation. */
    summary: string;
    /** ISO timestamp of the exhaustion event. */
    timestamp: string;
  };
}

Event Type

interface TerminalExhaustionEvent {
  type: 'terminal-exhaustion';
  reasonCode: TerminalReasonCode;
  workUnit?: number;
  phase?: number;
  taskId?: string;
  qualifier?: string;
  summary: string;
}

Add to FrameworkLifecycleEvent union and wire through FleetEventBus.

Error Class

class TerminalExhaustionError extends Error {
  readonly reasonCode: TerminalReasonCode;
  readonly metadata: { workUnit?: number; phase?: number; taskId?: string; qualifier?: string };
  
  constructor(reasonCode: TerminalReasonCode, summary: string, metadata?: { ... });
}

CheckpointManager Integration

class CheckpointManager {
  // ... existing methods ...
  
  /** Record terminal exhaustion and persist immediately. */
  recordTerminalExhaustion(exhaustion: TerminalExhaustionState): Promise<void>;
  
  /** Check if the last run ended with terminal exhaustion. */
  getTerminalExhaustion(): TerminalExhaustionState | undefined;
  
  /** Clear terminal exhaustion state (called on successful resume). */
  clearTerminalExhaustion(): Promise<void>;
}

Implementation Notes

  • recordTerminalExhaustion should persist atomically — it's called during error unwinding when the process may be about to exit
  • Resume logic should check getTerminalExhaustion() on load and log a diagnostic explaining why the previous run stopped
  • clearTerminalExhaustion is called after the first successful phase/task on resume, confirming the blocking condition has been resolved
  • The TerminalExhaustionError should be catchable at the top-level runner to trigger checkpoint save + event dispatch before process exit
  • Consumers can extend TerminalReasonCode via module augmentation for domain-specific codes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions