Skip to content

fix(agent-runtime): hold cancelRun until executor session is committed#441

Merged
jerryliang64 merged 4 commits intomasterfrom
fix/cancel-hold-until-committed
Apr 23, 2026
Merged

fix(agent-runtime): hold cancelRun until executor session is committed#441
jerryliang64 merged 4 commits intomasterfrom
fix/cancel-hold-until-committed

Conversation

@jerryliang64
Copy link
Copy Markdown
Contributor

@jerryliang64 jerryliang64 commented Apr 23, 2026

Summary

  • cancelRun used to abort and persist the current turn's user message the instant it was called, even when the executor's underlying session (e.g. the Claude Code SDK jsonl) had not yet been written to disk. The next resume request on the same thread would then diverge from a session that was never created and fail at executor startup.
  • cancelRun now blocks until the executor reports a committed message (default heuristic: any type !== 'system'), then aborts and persists as before. If the executor never commits within cancelCommitTimeoutMs (default 30s), the run is marked failednot cancelled — and the thread is left untouched so the next request starts with a fresh session.
  • AgentHandler / AgentExecutor gain an optional isSessionCommitted(msg, history) hook for executors that want a stricter policy (e.g. a filesystem probe for the jsonl file) than the default heuristic.
  • RunBuilder.fail now accepts the cancelling → failed transition, covering the watchdog-timeout path above.

Behaviour changes

  • cancelRun response time grows from ~ms to up to cancelCommitTimeoutMs (default 30s) in the cold-start window. Callers should be prepared for the longer HTTP hold.
  • persistMessagesOnAbort is now gated on task.committed in all three execution paths (syncRun, asyncRun, streamRun). When the executor never committed, the thread is not appended with the turn's input — this is the whole point of the fix, and reverses the previous behaviour encoded by the now-updated syncRun: should persist only the user message when aborted before any chunk regression test.
  • Timeout waiting for commit produces AgentTimeoutError (HTTP 408 via the status property) instead of a silent successful cancel.

The interface additions are fully backward compatible: isSessionCommitted and cancelCommitTimeoutMs are optional.

Test plan

  • core/agent-runtime test suite passes (130 tests) including 4 new cases covering cancel-hold-until-committed, cancel-after-committed-is-immediate, cancel-timeout-fails-run, and custom isSessionCommitted hook.
  • core/controller-decorator test suite passes (74 tests).
  • RunBuilder has a new unit test for cancelling → failed.
  • Downstream smoke test via sandbox-ai-use (planned after this lands and a canary is published).

Summary by CodeRabbit

  • New Features

    • Added optional session-commit hook to let the runtime detect when an executor/session is persisted.
    • Runs now emit a commit event when persistence is detected; cancellation waits for commit before aborting.
  • Bug Fixes

    • Cancellation now persists thread messages only after commit; abort-before-commit leaves thread empty.
    • Added configurable cancel timeout (cancelCommitTimeoutMs) and new AgentTimeoutError (HTTP 408) if commit times out.
    • Allow cancelling -> failed state transition without error.
  • Tests

    • Expanded tests covering cancelRun commit behavior, timeout watchdog, and state transitions.

cancelRun used to abort and persist the user message the moment it was
called, even when the executor's underlying session (e.g. Claude Code
SDK jsonl) had not yet been written to disk. The next resume on the
same thread would then diverge from a session that was never created
and fail at startup.

Now cancelRun blocks until the executor yields a message that marks
the session as committed (default heuristic: any non-system message)
or the cancelCommitTimeoutMs watchdog (default 30s) fires, in which
case the run is marked failed and the thread is left untouched so the
next request starts with a fresh session. Adds an optional
isSessionCommitted hook on AgentHandler for stricter executor policies.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5484795a-dd37-4608-84a7-a49e22f54b70

📥 Commits

Reviewing files that changed from the base of the PR and between 05a57e6 and 25d1e77.

📒 Files selected for processing (2)
  • core/agent-runtime/src/AgentRuntime.ts
  • core/agent-runtime/test/AgentRuntime.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/agent-runtime/src/AgentRuntime.ts

📝 Walkthrough

Walkthrough

The runtime now tracks per-run RunTaskState (including a committed flag and events) and consults an optional executor.isSessionCommitted hook (fallback: non-system message) to detect when a session is persisted. cancelRun waits for commit (or times out with AgentTimeoutError) before aborting and conditionally persisting abort-thread messages.

Changes

Cohort / File(s) Summary
Core Runtime State Management
core/agent-runtime/src/AgentRuntime.ts
Introduced per-run RunTaskState with committed flag and EventEmitter. Added commit-detection via optional executor.isSessionCommitted(msg, history) with msg.type !== 'system' fallback. cancelRun blocks until commit or timeout, gates abort/persistence on task.committed, expands post-loop terminal-status checks, and emits commit/end events.
Run State Transitions
core/agent-runtime/src/RunBuilder.ts
RunBuilder.fail() now permits transitioning from CancellingFailed; updated validation and JSDoc.
Interface Definitions
core/controller-decorator/src/decorator/agent/AgentHandler.ts
Added optional `isSessionCommitted?(msg: AgentMessage, history: AgentMessage[]): boolean
Error Types
core/types/agent-runtime/errors.ts
Added AgentTimeoutError (extends Error, status = 408) for commit-wait timeouts in cancelRun.
Tests
core/agent-runtime/test/AgentRuntime.test.ts, core/agent-runtime/test/RunBuilder.test.ts
Expanded tests: cancelRun commit-wait behavior, commit-timeout -> AgentTimeoutError and failed persistence, custom isSessionCommitted hook, quick return when already committed, and CancellingFailed RunBuilder test. Adjusted abort persistence expectations for pre-commit aborts.

Sequence Diagram(s)

sequenceDiagram
    actor Caller as Caller / API
    participant AgentRuntime
    participant Executor
    participant Watchdog as Watchdog/Timer

    Caller->>AgentRuntime: cancelRun(runId)
    AgentRuntime->>AgentRuntime: Inspect task.committed
    alt not committed
        AgentRuntime->>Watchdog: start cancelCommitTimeout
        Note over AgentRuntime,Executor: Wait for committing message or timeout
        Executor->>AgentRuntime: yield AgentMessage (msg)
        AgentRuntime->>AgentRuntime: call isSessionCommitted(msg, history)?  
        alt returns true
            AgentRuntime->>AgentRuntime: mark task.committed = true
            AgentRuntime->>AgentRuntime: emit "commit"
            Watchdog->>AgentRuntime: cancel timeout
        else returns false
            AgentRuntime->>AgentRuntime: continue waiting
        end
        alt timeout fires
            Watchdog->>AgentRuntime: timeout
            AgentRuntime->>AgentRuntime: persist run as Failed
            AgentRuntime->>Executor: abort execution
            AgentRuntime->>Caller: reject with AgentTimeoutError
        end
    else already committed
        AgentRuntime->>AgentRuntime: proceed immediately
    end
    alt committed before abort
        AgentRuntime->>Executor: abort execution
        AgentRuntime->>AgentRuntime: persist abort-thread messages
        AgentRuntime->>Caller: return cancelled result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • akitaSummer

Poem

🐰 I waited for the commit to show,
A little hop through async flow,
If time runs out I sound the bell,
Timeout warns — all's saved or well.
🥕 Commit, abort, and off I go!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: making cancelRun wait until the executor session is committed before aborting, which is the core objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cancel-hold-until-committed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
core/agent-runtime/test/RunBuilder.test.ts (1)

157-161: Good coverage for the watchdog timeout transition.

Test correctly pins down the new Cancelling → Failed transition added in RunBuilder.fail(). Consider also asserting update.lastError.code === AgentErrorCode.ExecError to parallel the existing in_progress → failed test and catch regressions in the error-payload shape.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/agent-runtime/test/RunBuilder.test.ts` around lines 157 - 161, Add an
assertion to the Cancelling → Failed test that the error payload shape matches
other failure tests: after creating the RunBuilder with RunBuilder.create(...,
'thread_1') and calling rb.fail(new Error('commit timeout')), assert that
update.lastError.code === AgentErrorCode.ExecError (and optionally that
update.lastError.message contains 'commit timeout') so the test mirrors the
existing in_progress → failed check and prevents regressions to the error
payload created by RunBuilder.fail.
core/controller-decorator/src/decorator/agent/AgentHandler.ts (1)

16-27: Hook signature is clear and backward compatible.

Shape matches AgentExecutor.isSessionCommitted in core/agent-runtime/src/AgentRuntime.ts (line 55), and the history array lets handlers implement stateful heuristics (e.g., “commit once the second non-system message lands”). One nit: documenting the guarantee that history includes msg as its last element (matches markCommittedIfNeeded passing streamMessages after push(msg)) would remove ambiguity for handler authors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/controller-decorator/src/decorator/agent/AgentHandler.ts` around lines
16 - 27, Update the isSessionCommitted? hook documentation in AgentHandler.ts to
state explicitly that the provided history array always includes the current msg
as its last element (this matches how markCommittedIfNeeded passes
streamMessages after push(msg)); reference the isSessionCommitted signature and
note it is shape-compatible with AgentExecutor.isSessionCommitted in
AgentRuntime.ts so handler authors can rely on history[history.length-1] === msg
when implementing stateful heuristics.
core/agent-runtime/src/AgentRuntime.ts (3)

62-68: Operational consideration: default 30s cancel latency.

The default cancelCommitTimeoutMs = 30_000 means a client-initiated cancel during an executor cold-start can now block the HTTP request for up to 30s before returning 408. Two suggestions:

  1. Log a warn (not just error) when the watchdog fires in cancelRun so operators have a signal for repeated cold-start commit timeouts — the current code path only logs if store.updateRun itself fails.
  2. Consider exposing a metric or structured log line at waitForCommitted entry with the configured timeout and the run's elapsed time, to make commit-latency distributions observable in production.

This is also worth surfacing in the package README / changelog since existing deployments with short client-side timeouts may now see 408s where they previously saw 200s.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/agent-runtime/src/AgentRuntime.ts` around lines 62 - 68, The
cancelCommitTimeoutMs default can block client cancels during cold-starts;
update cancelRun to emit a processLogger.warn (not just error) when the watchdog
times out before marking the run failed so operators see repeated commit
timeouts (refer to cancelRun and store.updateRun paths), and add an observable
log/metric at the start of waitForCommitted that records the configured
cancelCommitTimeoutMs and the run's elapsed time (include runId and elapsedMs)
so commit-latency distributions are measurable in production.

478-497: Heuristic fallback is reasonable; minor resilience note.

Default msg.type !== 'system' correctly treats Claude Code SDK's system/init as uncommitted while counting user/assistant/result as committed (consistent with core/types/agent-runtime/AgentMessage.ts). The try/catch around the hook (line 489) preventing a throwing handler from killing the run loop is a good defensive choice.

One nit: committed is reassigned in both branches; consider initializing let committed = false; so the catch branch doesn't need an explicit assignment, slightly reducing cognitive load. Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/agent-runtime/src/AgentRuntime.ts` around lines 478 - 497, In
markCommittedIfNeeded, initialize committed to false (e.g., let committed =
false;) before the try so the catch doesn't need to reassign; then keep the
existing branch that sets committed via this.executor.isSessionCommitted(...) or
the heuristic (msg.type !== 'system'), and leave the existing behavior that sets
task.committed = true and calls task.emitter.emit('commit') when committed is
truthy; this simplifies the control flow around executor.isSessionCommitted and
preserves current semantics.

696-747: cancelRun semantics are correct, but worth calling out two edge cases.

  1. When waitForCommitted resolves because the task's end event fired (not commit), task.committed may still be false. The subsequent task.abortController.abort() is a no-op on an already-ended task, and the freshRun terminal check (line 734) correctly avoids the rb.cancel() transition if the task finished as Completed/Failed. This path looks sound but is subtle — a comment near line 727 noting "task may have already ended before committing; freshRun check covers it" would help future readers.

  2. In the timeout branch (lines 710–725), after rb.fail() the local rb is in state Failed, and task.abortController.abort() then drives the executor's aborted branch, which calls finaliseAbortedRun. That helper short-circuits on terminal states (line 580), so the Failed state is preserved. Good — but this relies on ordering: the store write must complete before the task observes the abort and races through its finally. Since you await store.updateRun(..., rb.fail(...)) before abort(), the ordering holds. Recommend preserving that order explicitly with a comment (already hinted at in the existing comment, but tying finaliseAbortedRun's terminal-skip to it would be clearer).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/agent-runtime/src/AgentRuntime.ts` around lines 696 - 747, cancelRun has
two subtle race-related edge cases that need inline comments: add a brief
comment in cancelRun near the waitForCommitted resolution path noting that
waitForCommitted can return because the task's 'end' event fired (so
task.committed may still be false), making task.abortController.abort() a no-op
for an already-ended task and explaining that the subsequent freshRun
terminal-check (using AgentRuntime.TERMINAL_RUN_STATUSES and
RunBuilder.fromRecord) intentionally prevents applying rb.cancel() to a run that
already finished; and add a short comment in the timeout/error branch where you
call this.store.updateRun(runId, rb.fail(...)) before
task.abortController.abort() explaining the required ordering—store write must
complete before abort so finaliseAbortedRun will observe the terminal state and
skip transitioning (reference finaliseAbortedRun, rb.fail, and
abortController.abort).
core/agent-runtime/test/AgentRuntime.test.ts (1)

771-928: Comprehensive coverage of the new commit-gating paths.

Scenarios cover the four important branches: (a) hold-until-commit via default heuristic, (b) timeout → Failed + empty thread + AgentTimeoutError, (c) fast-path when already committed, and (d) custom isSessionCommitted hook. The deterministic gating (blocking executor + promise gate) avoids timing flake in principle.

One optional hardening: in the timeout test (line 826–862), after runtime.cancelRun rejects, asserting persisted.failedAt and persisted.lastError?.code === AgentErrorCode.ExecError would pin down the exact shape produced by rb.fail(AgentTimeoutError) and catch any future change that silently swaps error propagation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/agent-runtime/test/AgentRuntime.test.ts` around lines 771 - 928, The
timeout test should assert that the persisted run records the failure details
produced by the runtime's error path; after the runtime.cancelRun(...) rejection
in the failing-commit-timeout test, fetch the persisted run via
store.getRun(result.id) and assert that persisted.failedAt is set and
persisted.lastError?.code === AgentErrorCode.ExecError (the shape produced when
rb.fail(AgentTimeoutError) is used), so update the test around the timeout
branch to include these assertions referencing runtime.cancelRun,
AgentTimeoutError, rb.fail, persisted.failedAt and persisted.lastError?.code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/agent-runtime/src/AgentRuntime.ts`:
- Around line 506-531: In waitForCommitted, ESLint flags cleanup as used before
it's declared; fix by forward-declaring timer and defining cleanup before
creating the timeout/handlers: declare let timer: ReturnType<typeof setTimeout>;
then define const cleanup = (): void => { if (timer) clearTimeout(timer);
task.emitter.off('commit', onCommit); task.emitter.off('end', onEnd); }; then
create onCommit/onEnd and assign timer = globalThis.setTimeout(..., timeoutMs)
(using AgentTimeoutError in the reject). Update uses of task.emitter.once/off
and the onCommit/onEnd handlers to reference the already-declared cleanup.

---

Nitpick comments:
In `@core/agent-runtime/src/AgentRuntime.ts`:
- Around line 62-68: The cancelCommitTimeoutMs default can block client cancels
during cold-starts; update cancelRun to emit a processLogger.warn (not just
error) when the watchdog times out before marking the run failed so operators
see repeated commit timeouts (refer to cancelRun and store.updateRun paths), and
add an observable log/metric at the start of waitForCommitted that records the
configured cancelCommitTimeoutMs and the run's elapsed time (include runId and
elapsedMs) so commit-latency distributions are measurable in production.
- Around line 478-497: In markCommittedIfNeeded, initialize committed to false
(e.g., let committed = false;) before the try so the catch doesn't need to
reassign; then keep the existing branch that sets committed via
this.executor.isSessionCommitted(...) or the heuristic (msg.type !== 'system'),
and leave the existing behavior that sets task.committed = true and calls
task.emitter.emit('commit') when committed is truthy; this simplifies the
control flow around executor.isSessionCommitted and preserves current semantics.
- Around line 696-747: cancelRun has two subtle race-related edge cases that
need inline comments: add a brief comment in cancelRun near the waitForCommitted
resolution path noting that waitForCommitted can return because the task's 'end'
event fired (so task.committed may still be false), making
task.abortController.abort() a no-op for an already-ended task and explaining
that the subsequent freshRun terminal-check (using
AgentRuntime.TERMINAL_RUN_STATUSES and RunBuilder.fromRecord) intentionally
prevents applying rb.cancel() to a run that already finished; and add a short
comment in the timeout/error branch where you call this.store.updateRun(runId,
rb.fail(...)) before task.abortController.abort() explaining the required
ordering—store write must complete before abort so finaliseAbortedRun will
observe the terminal state and skip transitioning (reference finaliseAbortedRun,
rb.fail, and abortController.abort).

In `@core/agent-runtime/test/AgentRuntime.test.ts`:
- Around line 771-928: The timeout test should assert that the persisted run
records the failure details produced by the runtime's error path; after the
runtime.cancelRun(...) rejection in the failing-commit-timeout test, fetch the
persisted run via store.getRun(result.id) and assert that persisted.failedAt is
set and persisted.lastError?.code === AgentErrorCode.ExecError (the shape
produced when rb.fail(AgentTimeoutError) is used), so update the test around the
timeout branch to include these assertions referencing runtime.cancelRun,
AgentTimeoutError, rb.fail, persisted.failedAt and persisted.lastError?.code.

In `@core/agent-runtime/test/RunBuilder.test.ts`:
- Around line 157-161: Add an assertion to the Cancelling → Failed test that the
error payload shape matches other failure tests: after creating the RunBuilder
with RunBuilder.create(..., 'thread_1') and calling rb.fail(new Error('commit
timeout')), assert that update.lastError.code === AgentErrorCode.ExecError (and
optionally that update.lastError.message contains 'commit timeout') so the test
mirrors the existing in_progress → failed check and prevents regressions to the
error payload created by RunBuilder.fail.

In `@core/controller-decorator/src/decorator/agent/AgentHandler.ts`:
- Around line 16-27: Update the isSessionCommitted? hook documentation in
AgentHandler.ts to state explicitly that the provided history array always
includes the current msg as its last element (this matches how
markCommittedIfNeeded passes streamMessages after push(msg)); reference the
isSessionCommitted signature and note it is shape-compatible with
AgentExecutor.isSessionCommitted in AgentRuntime.ts so handler authors can rely
on history[history.length-1] === msg when implementing stateful heuristics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9802f7c6-e798-495d-9db8-740815c1a8c3

📥 Commits

Reviewing files that changed from the base of the PR and between 9e8c9f3 and 05a57e6.

📒 Files selected for processing (6)
  • core/agent-runtime/src/AgentRuntime.ts
  • core/agent-runtime/src/RunBuilder.ts
  • core/agent-runtime/test/AgentRuntime.test.ts
  • core/agent-runtime/test/RunBuilder.test.ts
  • core/controller-decorator/src/decorator/agent/AgentHandler.ts
  • core/types/agent-runtime/errors.ts

Comment thread core/agent-runtime/src/AgentRuntime.ts
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a commitment-aware cancellation process in AgentRuntime to ensure that run state is only persisted after an executor's session is committed to storage. It introduces a new isSessionCommitted hook, a RunTaskState interface for tracking commitment status, and a configurable timeout for cancellation. The cancelRun method now blocks until commitment or timeout, with the latter resulting in an AgentTimeoutError and a transition to the Failed status. Feedback was provided regarding the history parameter in the commitment check, noting that it includes the current message and suggesting the check be performed before the message is added to the history array.

Comment thread core/agent-runtime/src/AgentRuntime.ts
jerryliang64 and others added 3 commits April 23, 2026 23:20
30s of silent hold makes it hard to tell from the logs whether a cancel
request is waiting for the executor to commit or genuinely stuck.
Emit info on entering the hold and error when the watchdog fires.
Arrow-function handlers that reference each other via a shared cleanup()
tripped @typescript-eslint/no-use-before-define in CI. Stash them on a
container object so cleanup reads them by property access and the source
order stays linear.
…rite

The cancelRun commit-timeout path writes Failed to the store and then
aborts. When the executor does not respect the abort signal and
finishes naturally, syncRun / asyncRun / executeStreamBackground exit
their for-await loops without hitting the in-loop abort branch. The
existing post-loop status check only treated Cancelling/Cancelled as
terminal-ish, so rb.complete(usage) would race through and overwrite
the watchdog-set Failed with Completed.

Introduce POST_LOOP_TERMINAL_STATUSES ({Cancelling, Cancelled, Failed,
Expired}; Completed is intentionally excluded so the normal success
path is untouched) and route all three execution paths through it.
syncRun, which previously had no such check, gains one mirroring its
abort branch. streamRun's SSE error event now reports the actual run
status instead of a hardcoded 'cancelled' string.

Adds a regression test that yields only a system message and then
finishes naturally after the 50ms watchdog has already written Failed
— the run must remain Failed, not flip to Completed, and the thread
must stay empty because the executor never committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jerryliang64 jerryliang64 merged commit 4e02a28 into master Apr 23, 2026
12 checks passed
@jerryliang64 jerryliang64 deleted the fix/cancel-hold-until-committed branch April 23, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant