Skip to content

[Bug] Per-task failures escape existing isolation and become worker-fatal (0.4.0) #1254

@kollektiv

Description

@kollektiv

Per-task failures escape existing isolation and become worker-fatal (0.4.0)

workflow_future.rs:669 already wraps the workflow body in catch_unwind and converts a panic into a workflow-task failure — so the SDK's intent is clearly that one workflow's error doesn't take down the worker. We hit a production crash-loop that traced to two gaps in that isolation on the workflow side, plus a similar gap on the activity side, all in temporalio-sdk 0.4.0. In each case a single task's error propagates out of Worker::run() and the process exits. Go/Python/Java SDKs treat all three as task-level failures.

1. Cancelling an already-fired timer is worker-fatal

Symptom: Worker::run() returns Err with cause chain Workflow futures encountered an error: Command Timer(<seq>) not found to unblock!

Path: src/workflow_future.rs:699 — the RustWfCmd::Cancel(CancellableID::Timer(seq)) arm calls self.unblock(UnblockEvent::Timer(seq, TimerResult::Cancelled))?. If FireTimer has already removed that seq from command_status in the same activation, unblock returns Err("Command Timer(<seq>) not found to unblock!") (:198), the ? surfaces it from WorkflowFuture::poll, and it reaches the joiner (see #2).

Minimal repro (pure SDK, no signals):

let t1 = ctx.timer(Duration::from_secs(5));
let t2 = ctx.timer(Duration::from_secs(5));
futures_util::pin_mut!(t1, t2);
futures_util::select_biased! {
    _ = t1 => {}
    _ = t2 => {}
}
if !t2.is_terminated() { t2.cancel(); }

Both timers fire in the same activation. handle_job processes both FireTimers and removes both seqs from command_status. select_biased! polls t1 first → takes that branch; t2 was never polled to Ready, so is_terminated() is false; t2.cancel() emits RustWfCmd::Cancel(Timer(seq_t2)); :699 misses → worker exits.

A variant that's closer to what we hit: a wait_condition whose predicate becomes true in the same activation as a FireTimer, raced under select_biased! with the condition arm first, followed by timer.cancel().

Suggested fix: in the RustWfCmd::Cancel(Timer) arm, treat a missing seq as a no-op — the timer already resolved, cancelling it is semantically a nop. E.g., drop the ? and debug!-log the miss.

2. wf_future_joiner propagates per-workflow errors to Worker::run()

Path: src/lib.rs:621–637. Inside try_for_each_concurrent:

let result = join_handle.await.map_err(anyhow::Error::new)?;
if let Err(e) = result && !matches!(e, WorkflowTermination::Evicted) {
    return Err(anyhow::Error::new(e));
}

This sits in Worker::run()'s tokio::try_join!, so any of: a JoinError (panic outside the :669 catch — e.g., in sub-future processing), WorkflowFuture::poll returning Err (#1 above, or the :495 activation-channel-lost path), kills the whole worker.

Suggested fix: on per-workflow Err/JoinError, send RespondWorkflowTaskFailed for that run with the cause, evict it from the cache, log at error!, and continue the stream. Reserve worker-fatal for genuinely unrecoverable state (completions channel closed, core shutdown).

3. Unregistered activity type is worker-fatal

Path: src/lib.rs:924–929 in ActivityHalf::activity_task_handler:

let act_fn = self.activities.get(&start.activity_type).ok_or_else(|| {
    anyhow!("No function registered for activity type {}", start.activity_type)
})?;

propagates via ? at :735 in the activity poll loop → try_join!Worker::run() returns Err.

This makes rolling deploys that add or remove activity types hazardous: during the overlap, old-image workers receive new activity types (or vice versa) and exit. Go's SDK responds with a NotFoundError for the task and lets the server retry it elsewhere.

Suggested fix: complete the activity task with ActivityExecutionResult::failed(...) carrying a not-registered ApplicationFailure, and return Ok(()) from activity_task_handler.

Version

temporalio-sdk = "0.4.0" (crates.io). If any of these are already fixed on main or in a later release, a pointer would be great — happy to verify a patch.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions