fix(agent-runtime): hold cancelRun until executor session is committed#441
fix(agent-runtime): hold cancelRun until executor session is committed#441jerryliang64 merged 4 commits intomasterfrom
Conversation
cancelRun used to abort and persist the user message the moment it was called, even when the executor's underlying session (e.g. Claude Code SDK jsonl) had not yet been written to disk. The next resume on the same thread would then diverge from a session that was never created and fail at startup. Now cancelRun blocks until the executor yields a message that marks the session as committed (default heuristic: any non-system message) or the cancelCommitTimeoutMs watchdog (default 30s) fires, in which case the run is marked failed and the thread is left untouched so the next request starts with a fresh session. Adds an optional isSessionCommitted hook on AgentHandler for stricter executor policies.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe runtime now tracks per-run RunTaskState (including a committed flag and events) and consults an optional Changes
Sequence Diagram(s)sequenceDiagram
actor Caller as Caller / API
participant AgentRuntime
participant Executor
participant Watchdog as Watchdog/Timer
Caller->>AgentRuntime: cancelRun(runId)
AgentRuntime->>AgentRuntime: Inspect task.committed
alt not committed
AgentRuntime->>Watchdog: start cancelCommitTimeout
Note over AgentRuntime,Executor: Wait for committing message or timeout
Executor->>AgentRuntime: yield AgentMessage (msg)
AgentRuntime->>AgentRuntime: call isSessionCommitted(msg, history)?
alt returns true
AgentRuntime->>AgentRuntime: mark task.committed = true
AgentRuntime->>AgentRuntime: emit "commit"
Watchdog->>AgentRuntime: cancel timeout
else returns false
AgentRuntime->>AgentRuntime: continue waiting
end
alt timeout fires
Watchdog->>AgentRuntime: timeout
AgentRuntime->>AgentRuntime: persist run as Failed
AgentRuntime->>Executor: abort execution
AgentRuntime->>Caller: reject with AgentTimeoutError
end
else already committed
AgentRuntime->>AgentRuntime: proceed immediately
end
alt committed before abort
AgentRuntime->>Executor: abort execution
AgentRuntime->>AgentRuntime: persist abort-thread messages
AgentRuntime->>Caller: return cancelled result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
core/agent-runtime/test/RunBuilder.test.ts (1)
157-161: Good coverage for the watchdog timeout transition.Test correctly pins down the new
Cancelling → Failedtransition added inRunBuilder.fail(). Consider also assertingupdate.lastError.code === AgentErrorCode.ExecErrorto parallel the existingin_progress → failedtest and catch regressions in the error-payload shape.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/agent-runtime/test/RunBuilder.test.ts` around lines 157 - 161, Add an assertion to the Cancelling → Failed test that the error payload shape matches other failure tests: after creating the RunBuilder with RunBuilder.create(..., 'thread_1') and calling rb.fail(new Error('commit timeout')), assert that update.lastError.code === AgentErrorCode.ExecError (and optionally that update.lastError.message contains 'commit timeout') so the test mirrors the existing in_progress → failed check and prevents regressions to the error payload created by RunBuilder.fail.core/controller-decorator/src/decorator/agent/AgentHandler.ts (1)
16-27: Hook signature is clear and backward compatible.Shape matches
AgentExecutor.isSessionCommittedincore/agent-runtime/src/AgentRuntime.ts(line 55), and thehistoryarray lets handlers implement stateful heuristics (e.g., “commit once the second non-system message lands”). One nit: documenting the guarantee thathistoryincludesmsgas its last element (matchesmarkCommittedIfNeededpassingstreamMessagesafterpush(msg)) would remove ambiguity for handler authors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/controller-decorator/src/decorator/agent/AgentHandler.ts` around lines 16 - 27, Update the isSessionCommitted? hook documentation in AgentHandler.ts to state explicitly that the provided history array always includes the current msg as its last element (this matches how markCommittedIfNeeded passes streamMessages after push(msg)); reference the isSessionCommitted signature and note it is shape-compatible with AgentExecutor.isSessionCommitted in AgentRuntime.ts so handler authors can rely on history[history.length-1] === msg when implementing stateful heuristics.core/agent-runtime/src/AgentRuntime.ts (3)
62-68: Operational consideration: default 30s cancel latency.The default
cancelCommitTimeoutMs = 30_000means a client-initiated cancel during an executor cold-start can now block the HTTP request for up to 30s before returning 408. Two suggestions:
- Log a
warn(not justerror) when the watchdog fires incancelRunso operators have a signal for repeated cold-start commit timeouts — the current code path only logs ifstore.updateRunitself fails.- Consider exposing a metric or structured log line at
waitForCommittedentry with the configured timeout and the run's elapsed time, to make commit-latency distributions observable in production.This is also worth surfacing in the package README / changelog since existing deployments with short client-side timeouts may now see 408s where they previously saw 200s.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/agent-runtime/src/AgentRuntime.ts` around lines 62 - 68, The cancelCommitTimeoutMs default can block client cancels during cold-starts; update cancelRun to emit a processLogger.warn (not just error) when the watchdog times out before marking the run failed so operators see repeated commit timeouts (refer to cancelRun and store.updateRun paths), and add an observable log/metric at the start of waitForCommitted that records the configured cancelCommitTimeoutMs and the run's elapsed time (include runId and elapsedMs) so commit-latency distributions are measurable in production.
478-497: Heuristic fallback is reasonable; minor resilience note.Default
msg.type !== 'system'correctly treats Claude Code SDK'ssystem/initas uncommitted while countinguser/assistant/resultas committed (consistent withcore/types/agent-runtime/AgentMessage.ts). The try/catch around the hook (line 489) preventing a throwing handler from killing the run loop is a good defensive choice.One nit:
committedis reassigned in both branches; consider initializinglet committed = false;so the catch branch doesn't need an explicit assignment, slightly reducing cognitive load. Non-blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/agent-runtime/src/AgentRuntime.ts` around lines 478 - 497, In markCommittedIfNeeded, initialize committed to false (e.g., let committed = false;) before the try so the catch doesn't need to reassign; then keep the existing branch that sets committed via this.executor.isSessionCommitted(...) or the heuristic (msg.type !== 'system'), and leave the existing behavior that sets task.committed = true and calls task.emitter.emit('commit') when committed is truthy; this simplifies the control flow around executor.isSessionCommitted and preserves current semantics.
696-747:cancelRunsemantics are correct, but worth calling out two edge cases.
When
waitForCommittedresolves because the task'sendevent fired (notcommit),task.committedmay still be false. The subsequenttask.abortController.abort()is a no-op on an already-ended task, and thefreshRunterminal check (line 734) correctly avoids therb.cancel()transition if the task finished as Completed/Failed. This path looks sound but is subtle — a comment near line 727 noting "task may have already ended before committing; freshRun check covers it" would help future readers.In the timeout branch (lines 710–725), after
rb.fail()the localrbis in stateFailed, andtask.abortController.abort()then drives the executor's aborted branch, which callsfinaliseAbortedRun. That helper short-circuits on terminal states (line 580), so theFailedstate is preserved. Good — but this relies on ordering: the store write must complete before the task observes the abort and races through its finally. Since youawait store.updateRun(..., rb.fail(...))beforeabort(), the ordering holds. Recommend preserving that order explicitly with a comment (already hinted at in the existing comment, but tyingfinaliseAbortedRun's terminal-skip to it would be clearer).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/agent-runtime/src/AgentRuntime.ts` around lines 696 - 747, cancelRun has two subtle race-related edge cases that need inline comments: add a brief comment in cancelRun near the waitForCommitted resolution path noting that waitForCommitted can return because the task's 'end' event fired (so task.committed may still be false), making task.abortController.abort() a no-op for an already-ended task and explaining that the subsequent freshRun terminal-check (using AgentRuntime.TERMINAL_RUN_STATUSES and RunBuilder.fromRecord) intentionally prevents applying rb.cancel() to a run that already finished; and add a short comment in the timeout/error branch where you call this.store.updateRun(runId, rb.fail(...)) before task.abortController.abort() explaining the required ordering—store write must complete before abort so finaliseAbortedRun will observe the terminal state and skip transitioning (reference finaliseAbortedRun, rb.fail, and abortController.abort).core/agent-runtime/test/AgentRuntime.test.ts (1)
771-928: Comprehensive coverage of the new commit-gating paths.Scenarios cover the four important branches: (a) hold-until-commit via default heuristic, (b) timeout →
Failed+ empty thread +AgentTimeoutError, (c) fast-path when already committed, and (d) customisSessionCommittedhook. The deterministic gating (blocking executor + promise gate) avoids timing flake in principle.One optional hardening: in the timeout test (line 826–862), after
runtime.cancelRunrejects, assertingpersisted.failedAtandpersisted.lastError?.code === AgentErrorCode.ExecErrorwould pin down the exact shape produced byrb.fail(AgentTimeoutError)and catch any future change that silently swaps error propagation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/agent-runtime/test/AgentRuntime.test.ts` around lines 771 - 928, The timeout test should assert that the persisted run records the failure details produced by the runtime's error path; after the runtime.cancelRun(...) rejection in the failing-commit-timeout test, fetch the persisted run via store.getRun(result.id) and assert that persisted.failedAt is set and persisted.lastError?.code === AgentErrorCode.ExecError (the shape produced when rb.fail(AgentTimeoutError) is used), so update the test around the timeout branch to include these assertions referencing runtime.cancelRun, AgentTimeoutError, rb.fail, persisted.failedAt and persisted.lastError?.code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/agent-runtime/src/AgentRuntime.ts`:
- Around line 506-531: In waitForCommitted, ESLint flags cleanup as used before
it's declared; fix by forward-declaring timer and defining cleanup before
creating the timeout/handlers: declare let timer: ReturnType<typeof setTimeout>;
then define const cleanup = (): void => { if (timer) clearTimeout(timer);
task.emitter.off('commit', onCommit); task.emitter.off('end', onEnd); }; then
create onCommit/onEnd and assign timer = globalThis.setTimeout(..., timeoutMs)
(using AgentTimeoutError in the reject). Update uses of task.emitter.once/off
and the onCommit/onEnd handlers to reference the already-declared cleanup.
---
Nitpick comments:
In `@core/agent-runtime/src/AgentRuntime.ts`:
- Around line 62-68: The cancelCommitTimeoutMs default can block client cancels
during cold-starts; update cancelRun to emit a processLogger.warn (not just
error) when the watchdog times out before marking the run failed so operators
see repeated commit timeouts (refer to cancelRun and store.updateRun paths), and
add an observable log/metric at the start of waitForCommitted that records the
configured cancelCommitTimeoutMs and the run's elapsed time (include runId and
elapsedMs) so commit-latency distributions are measurable in production.
- Around line 478-497: In markCommittedIfNeeded, initialize committed to false
(e.g., let committed = false;) before the try so the catch doesn't need to
reassign; then keep the existing branch that sets committed via
this.executor.isSessionCommitted(...) or the heuristic (msg.type !== 'system'),
and leave the existing behavior that sets task.committed = true and calls
task.emitter.emit('commit') when committed is truthy; this simplifies the
control flow around executor.isSessionCommitted and preserves current semantics.
- Around line 696-747: cancelRun has two subtle race-related edge cases that
need inline comments: add a brief comment in cancelRun near the waitForCommitted
resolution path noting that waitForCommitted can return because the task's 'end'
event fired (so task.committed may still be false), making
task.abortController.abort() a no-op for an already-ended task and explaining
that the subsequent freshRun terminal-check (using
AgentRuntime.TERMINAL_RUN_STATUSES and RunBuilder.fromRecord) intentionally
prevents applying rb.cancel() to a run that already finished; and add a short
comment in the timeout/error branch where you call this.store.updateRun(runId,
rb.fail(...)) before task.abortController.abort() explaining the required
ordering—store write must complete before abort so finaliseAbortedRun will
observe the terminal state and skip transitioning (reference finaliseAbortedRun,
rb.fail, and abortController.abort).
In `@core/agent-runtime/test/AgentRuntime.test.ts`:
- Around line 771-928: The timeout test should assert that the persisted run
records the failure details produced by the runtime's error path; after the
runtime.cancelRun(...) rejection in the failing-commit-timeout test, fetch the
persisted run via store.getRun(result.id) and assert that persisted.failedAt is
set and persisted.lastError?.code === AgentErrorCode.ExecError (the shape
produced when rb.fail(AgentTimeoutError) is used), so update the test around the
timeout branch to include these assertions referencing runtime.cancelRun,
AgentTimeoutError, rb.fail, persisted.failedAt and persisted.lastError?.code.
In `@core/agent-runtime/test/RunBuilder.test.ts`:
- Around line 157-161: Add an assertion to the Cancelling → Failed test that the
error payload shape matches other failure tests: after creating the RunBuilder
with RunBuilder.create(..., 'thread_1') and calling rb.fail(new Error('commit
timeout')), assert that update.lastError.code === AgentErrorCode.ExecError (and
optionally that update.lastError.message contains 'commit timeout') so the test
mirrors the existing in_progress → failed check and prevents regressions to the
error payload created by RunBuilder.fail.
In `@core/controller-decorator/src/decorator/agent/AgentHandler.ts`:
- Around line 16-27: Update the isSessionCommitted? hook documentation in
AgentHandler.ts to state explicitly that the provided history array always
includes the current msg as its last element (this matches how
markCommittedIfNeeded passes streamMessages after push(msg)); reference the
isSessionCommitted signature and note it is shape-compatible with
AgentExecutor.isSessionCommitted in AgentRuntime.ts so handler authors can rely
on history[history.length-1] === msg when implementing stateful heuristics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9802f7c6-e798-495d-9db8-740815c1a8c3
📒 Files selected for processing (6)
core/agent-runtime/src/AgentRuntime.tscore/agent-runtime/src/RunBuilder.tscore/agent-runtime/test/AgentRuntime.test.tscore/agent-runtime/test/RunBuilder.test.tscore/controller-decorator/src/decorator/agent/AgentHandler.tscore/types/agent-runtime/errors.ts
There was a problem hiding this comment.
Code Review
This pull request implements a commitment-aware cancellation process in AgentRuntime to ensure that run state is only persisted after an executor's session is committed to storage. It introduces a new isSessionCommitted hook, a RunTaskState interface for tracking commitment status, and a configurable timeout for cancellation. The cancelRun method now blocks until commitment or timeout, with the latter resulting in an AgentTimeoutError and a transition to the Failed status. Feedback was provided regarding the history parameter in the commitment check, noting that it includes the current message and suggesting the check be performed before the message is added to the history array.
30s of silent hold makes it hard to tell from the logs whether a cancel request is waiting for the executor to commit or genuinely stuck. Emit info on entering the hold and error when the watchdog fires.
Arrow-function handlers that reference each other via a shared cleanup() tripped @typescript-eslint/no-use-before-define in CI. Stash them on a container object so cleanup reads them by property access and the source order stays linear.
…rite
The cancelRun commit-timeout path writes Failed to the store and then
aborts. When the executor does not respect the abort signal and
finishes naturally, syncRun / asyncRun / executeStreamBackground exit
their for-await loops without hitting the in-loop abort branch. The
existing post-loop status check only treated Cancelling/Cancelled as
terminal-ish, so rb.complete(usage) would race through and overwrite
the watchdog-set Failed with Completed.
Introduce POST_LOOP_TERMINAL_STATUSES ({Cancelling, Cancelled, Failed,
Expired}; Completed is intentionally excluded so the normal success
path is untouched) and route all three execution paths through it.
syncRun, which previously had no such check, gains one mirroring its
abort branch. streamRun's SSE error event now reports the actual run
status instead of a hardcoded 'cancelled' string.
Adds a regression test that yields only a system message and then
finishes naturally after the 50ms watchdog has already written Failed
— the run must remain Failed, not flip to Completed, and the thread
must stay empty because the executor never committed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
cancelRunused to abort and persist the current turn's user message the instant it was called, even when the executor's underlying session (e.g. the Claude Code SDK jsonl) had not yet been written to disk. The next resume request on the same thread would then diverge from a session that was never created and fail at executor startup.cancelRunnow blocks until the executor reports a committed message (default heuristic: anytype !== 'system'), then aborts and persists as before. If the executor never commits withincancelCommitTimeoutMs(default 30s), the run is markedfailed— notcancelled— and the thread is left untouched so the next request starts with a fresh session.AgentHandler/AgentExecutorgain an optionalisSessionCommitted(msg, history)hook for executors that want a stricter policy (e.g. a filesystem probe for the jsonl file) than the default heuristic.RunBuilder.failnow accepts thecancelling → failedtransition, covering the watchdog-timeout path above.Behaviour changes
cancelRunresponse time grows from ~ms to up tocancelCommitTimeoutMs(default 30s) in the cold-start window. Callers should be prepared for the longer HTTP hold.persistMessagesOnAbortis now gated ontask.committedin all three execution paths (syncRun,asyncRun,streamRun). When the executor never committed, the thread is not appended with the turn's input — this is the whole point of the fix, and reverses the previous behaviour encoded by the now-updatedsyncRun: should persist only the user message when aborted before any chunkregression test.AgentTimeoutError(HTTP 408 via thestatusproperty) instead of a silent successful cancel.The interface additions are fully backward compatible:
isSessionCommittedandcancelCommitTimeoutMsare optional.Test plan
core/agent-runtimetest suite passes (130 tests) including 4 new cases covering cancel-hold-until-committed, cancel-after-committed-is-immediate, cancel-timeout-fails-run, and customisSessionCommittedhook.core/controller-decoratortest suite passes (74 tests).RunBuilderhas a new unit test forcancelling → failed.Summary by CodeRabbit
New Features
Bug Fixes
Tests