Summary
Add structured terminal exhaustion tracking to CheckpointState and the event system, giving framework consumers a machine-readable record of why execution stopped and enabling intelligent resume decisions.
Motivation
When a multi-phase pipeline aborts, checkpoint currently records the last completed phase and failedTasks with string error messages. There's no structured way to answer: "Did we stop because we blew the budget? Hit max retries on a critical path? Exceeded a convergence limit? Got killed by a signal?"
This matters for resume logic — a budget-exceeded stop should prompt the user to increase the budget before resuming, while a convergence-limit stop means retrying the same tasks won't help. Without structured reason codes, consumers must parse error strings or lose this context entirely.
AAMF implements this as:
- A
TerminalReasonCode enum ('budget-exceeded' | 'max-retries' | 'convergence-limit' | 'critical-phase-failure' | ...)
- A
TerminalExhaustionState stored in checkpoint with the reason code, optional wave/task/check identifiers, and a human-readable summary
- A
TerminalExhaustionError class that captures this structured data and is thrown to unwind the stack
- A
terminal-exhaustion event type dispatched before the error propagates
Proposed API
Reason Codes
/** Why execution stopped beyond simple success/failure. */
export type TerminalReasonCode =
| 'budget-exceeded' // Token budget exhausted
| 'max-retries-exhausted' // A task/step hit its retry ceiling
| 'convergence-limit' // Iterative loop didn't converge within max iterations
| 'critical-phase-failure' // A critical phase failed with no recovery path
| 'signal-interrupted' // Process received SIGINT/SIGTERM
| 'dependency-blocked' // All remaining work items depend on failed/blocked items
| 'gate-hard-fail'; // A phase gate returned 'fail' (non-retriable)
Checkpoint Extension
interface CheckpointState {
// ... existing fields ...
/** Structured abort metadata, set when execution stopped abnormally. */
terminalExhaustion?: {
reasonCode: TerminalReasonCode;
/** Work unit (issue number) where exhaustion occurred. */
workUnit?: number;
/** Phase/stage where exhaustion occurred. */
phase?: number;
/** Task or step identifier within the phase. */
taskId?: string;
/** Additional qualifier (e.g., gate name, convergence check). */
qualifier?: string;
/** Human-readable explanation. */
summary: string;
/** ISO timestamp of the exhaustion event. */
timestamp: string;
};
}
Event Type
interface TerminalExhaustionEvent {
type: 'terminal-exhaustion';
reasonCode: TerminalReasonCode;
workUnit?: number;
phase?: number;
taskId?: string;
qualifier?: string;
summary: string;
}
Add to FrameworkLifecycleEvent union and wire through FleetEventBus.
Error Class
class TerminalExhaustionError extends Error {
readonly reasonCode: TerminalReasonCode;
readonly metadata: { workUnit?: number; phase?: number; taskId?: string; qualifier?: string };
constructor(reasonCode: TerminalReasonCode, summary: string, metadata?: { ... });
}
CheckpointManager Integration
class CheckpointManager {
// ... existing methods ...
/** Record terminal exhaustion and persist immediately. */
recordTerminalExhaustion(exhaustion: TerminalExhaustionState): Promise<void>;
/** Check if the last run ended with terminal exhaustion. */
getTerminalExhaustion(): TerminalExhaustionState | undefined;
/** Clear terminal exhaustion state (called on successful resume). */
clearTerminalExhaustion(): Promise<void>;
}
Implementation Notes
recordTerminalExhaustion should persist atomically — it's called during error unwinding when the process may be about to exit
- Resume logic should check
getTerminalExhaustion() on load and log a diagnostic explaining why the previous run stopped
clearTerminalExhaustion is called after the first successful phase/task on resume, confirming the blocking condition has been resolved
- The
TerminalExhaustionError should be catchable at the top-level runner to trigger checkpoint save + event dispatch before process exit
- Consumers can extend
TerminalReasonCode via module augmentation for domain-specific codes
Summary
Add structured terminal exhaustion tracking to
CheckpointStateand the event system, giving framework consumers a machine-readable record of why execution stopped and enabling intelligent resume decisions.Motivation
When a multi-phase pipeline aborts, checkpoint currently records the last completed phase and
failedTaskswith string error messages. There's no structured way to answer: "Did we stop because we blew the budget? Hit max retries on a critical path? Exceeded a convergence limit? Got killed by a signal?"This matters for resume logic — a budget-exceeded stop should prompt the user to increase the budget before resuming, while a convergence-limit stop means retrying the same tasks won't help. Without structured reason codes, consumers must parse error strings or lose this context entirely.
AAMF implements this as:
TerminalReasonCodeenum ('budget-exceeded' | 'max-retries' | 'convergence-limit' | 'critical-phase-failure' | ...)TerminalExhaustionStatestored in checkpoint with the reason code, optional wave/task/check identifiers, and a human-readable summaryTerminalExhaustionErrorclass that captures this structured data and is thrown to unwind the stackterminal-exhaustionevent type dispatched before the error propagatesProposed API
Reason Codes
Checkpoint Extension
Event Type
Add to
FrameworkLifecycleEventunion and wire throughFleetEventBus.Error Class
CheckpointManager Integration
Implementation Notes
recordTerminalExhaustionshould persist atomically — it's called during error unwinding when the process may be about to exitgetTerminalExhaustion()on load and log a diagnostic explaining why the previous run stoppedclearTerminalExhaustionis called after the first successful phase/task on resume, confirming the blocking condition has been resolvedTerminalExhaustionErrorshould be catchable at the top-level runner to trigger checkpoint save + event dispatch before process exitTerminalReasonCodevia module augmentation for domain-specific codes