From 3aa03c6c8cb3dbc199fce76be7af5b7f2865a5ca Mon Sep 17 00:00:00 2001 From: Vishal Satam Date: Fri, 15 May 2026 11:33:11 -0700 Subject: [PATCH] docs: Expand troubleshooting event types for root cause diagnosis Add detailed event type documentation to step 2b of the troubleshooting guide, providing explicit guidance on key execution history events: - Execution-level failures (ExecutionFailed, ExecutionTimedOut, ExecutionStopped) - Context and step failures (ContextFailed, StepFailed with RetryDetails) - Callback issues (CallbackStarted, CallbackTimedOut, CallbackFailed) - Chained invocation failures (ChainedInvokeFailed/TimedOut/Stopped) - Other signals (WaitCancelled, InvocationCompleted with Error) This gives agents clearer guidance for interpreting execution history events rather than relying on implicit knowledge. Syncs with awslabs/agent-plugins#160. --- .../steering/troubleshooting-executions.md | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md b/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md index 9b976b5..557c487 100644 --- a/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md +++ b/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md @@ -39,7 +39,32 @@ Steps: 2. If the command succeeds, analyze and provide a user-friendly diagnosis: a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT) - b. Identify the root cause: + b. Identify the root cause by looking for these key events in the history: + + **Execution-level failures:** + - `ExecutionFailed` — entire execution crashed; extract the error and cause fields + - `ExecutionTimedOut` — the execution exceeded its configured timeout + - `ExecutionStopped` — execution was manually stopped via StopDurableExecution + + **Context and step failures:** + - `ContextFailed` — a child context threw an unhandled error; check the parent context for what triggered it + - `StepFailed` — an individual step failed; includes RetryDetails (CurrentAttempt, NextAttemptDelaySeconds) showing retry state + + **Callback issues:** + - `CallbackStarted` with a Timeout field — confirms a timeout was registered; correlate with any subsequent `CallbackTimedOut` + - `CallbackTimedOut` — a timeout fired but may not have been caught by the function code + - `CallbackFailed` — the callback was resolved with an error + + **Chained invocation failures:** + - `ChainedInvokeFailed` — a chained (child) durable execution failed + - `ChainedInvokeTimedOut` — a chained execution exceeded its timeout + - `ChainedInvokeStopped` — a chained execution was stopped + + **Other signals:** + - `WaitCancelled` — a scheduled wait was cancelled before completing + - `InvocationCompleted` with an Error field — the Lambda invocation itself errored (e.g., runtime crash) + + **Diagnosis patterns:** - Failed operations: Show the EXACT error message verbatim in a code block - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting - Timeout: Show which operation was running when timeout occurred