feat: Structured terminal exhaustion tracking for CheckpointState

## Summary

Add structured terminal exhaustion tracking to `CheckpointState` and the event system, giving framework consumers a machine-readable record of *why* execution stopped and enabling intelligent resume decisions.

## Motivation

When a multi-phase pipeline aborts, checkpoint currently records the last completed phase and `failedTasks` with string error messages. There's no structured way to answer: "Did we stop because we blew the budget? Hit max retries on a critical path? Exceeded a convergence limit? Got killed by a signal?"

This matters for resume logic — a budget-exceeded stop should prompt the user to increase the budget before resuming, while a convergence-limit stop means retrying the same tasks won't help. Without structured reason codes, consumers must parse error strings or lose this context entirely.

AAMF implements this as:
- A `TerminalReasonCode` enum (`'budget-exceeded' | 'max-retries' | 'convergence-limit' | 'critical-phase-failure' | ...`)
- A `TerminalExhaustionState` stored in checkpoint with the reason code, optional wave/task/check identifiers, and a human-readable summary
- A `TerminalExhaustionError` class that captures this structured data and is thrown to unwind the stack
- A `terminal-exhaustion` event type dispatched before the error propagates

## Proposed API

### Reason Codes

```typescript
/** Why execution stopped beyond simple success/failure. */
export type TerminalReasonCode =
  | 'budget-exceeded'          // Token budget exhausted
  | 'max-retries-exhausted'    // A task/step hit its retry ceiling
  | 'convergence-limit'        // Iterative loop didn't converge within max iterations
  | 'critical-phase-failure'   // A critical phase failed with no recovery path
  | 'signal-interrupted'       // Process received SIGINT/SIGTERM
  | 'dependency-blocked'       // All remaining work items depend on failed/blocked items
  | 'gate-hard-fail';          // A phase gate returned 'fail' (non-retriable)
```

### Checkpoint Extension

```typescript
interface CheckpointState {
  // ... existing fields ...

  /** Structured abort metadata, set when execution stopped abnormally. */
  terminalExhaustion?: {
    reasonCode: TerminalReasonCode;
    /** Work unit (issue number) where exhaustion occurred. */
    workUnit?: number;
    /** Phase/stage where exhaustion occurred. */
    phase?: number;
    /** Task or step identifier within the phase. */
    taskId?: string;
    /** Additional qualifier (e.g., gate name, convergence check). */
    qualifier?: string;
    /** Human-readable explanation. */
    summary: string;
    /** ISO timestamp of the exhaustion event. */
    timestamp: string;
  };
}
```

### Event Type

```typescript
interface TerminalExhaustionEvent {
  type: 'terminal-exhaustion';
  reasonCode: TerminalReasonCode;
  workUnit?: number;
  phase?: number;
  taskId?: string;
  qualifier?: string;
  summary: string;
}
```

Add to `FrameworkLifecycleEvent` union and wire through `FleetEventBus`.

### Error Class

```typescript
class TerminalExhaustionError extends Error {
  readonly reasonCode: TerminalReasonCode;
  readonly metadata: { workUnit?: number; phase?: number; taskId?: string; qualifier?: string };
  
  constructor(reasonCode: TerminalReasonCode, summary: string, metadata?: { ... });
}
```

### CheckpointManager Integration

```typescript
class CheckpointManager {
  // ... existing methods ...
  
  /** Record terminal exhaustion and persist immediately. */
  recordTerminalExhaustion(exhaustion: TerminalExhaustionState): Promise<void>;
  
  /** Check if the last run ended with terminal exhaustion. */
  getTerminalExhaustion(): TerminalExhaustionState | undefined;
  
  /** Clear terminal exhaustion state (called on successful resume). */
  clearTerminalExhaustion(): Promise<void>;
}
```

## Implementation Notes

- `recordTerminalExhaustion` should persist atomically — it's called during error unwinding when the process may be about to exit
- Resume logic should check `getTerminalExhaustion()` on load and log a diagnostic explaining why the previous run stopped
- `clearTerminalExhaustion` is called after the first successful phase/task on resume, confirming the blocking condition has been resolved
- The `TerminalExhaustionError` should be catchable at the top-level runner to trigger checkpoint save + event dispatch before process exit
- Consumers can extend `TerminalReasonCode` via module augmentation for domain-specific codes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Structured terminal exhaustion tracking for CheckpointState #378

Summary

Motivation

Proposed API

Reason Codes

Checkpoint Extension

Event Type

Error Class

CheckpointManager Integration

Implementation Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Structured terminal exhaustion tracking for CheckpointState #378

Description

Summary

Motivation

Proposed API

Reason Codes

Checkpoint Extension

Event Type

Error Class

CheckpointManager Integration

Implementation Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions