Per-task failures escape existing isolation and become worker-fatal (0.4.0)
workflow_future.rs:669 already wraps the workflow body in catch_unwind and converts a panic into a workflow-task failure — so the SDK's intent is clearly that one workflow's error doesn't take down the worker. We hit a production crash-loop that traced to two gaps in that isolation on the workflow side, plus a similar gap on the activity side, all in temporalio-sdk 0.4.0. In each case a single task's error propagates out of Worker::run() and the process exits. Go/Python/Java SDKs treat all three as task-level failures.
1. Cancelling an already-fired timer is worker-fatal
Symptom: Worker::run() returns Err with cause chain Workflow futures encountered an error: Command Timer(<seq>) not found to unblock!
Path: src/workflow_future.rs:699 — the RustWfCmd::Cancel(CancellableID::Timer(seq)) arm calls self.unblock(UnblockEvent::Timer(seq, TimerResult::Cancelled))?. If FireTimer has already removed that seq from command_status in the same activation, unblock returns Err("Command Timer(<seq>) not found to unblock!") (:198), the ? surfaces it from WorkflowFuture::poll, and it reaches the joiner (see #2).
Minimal repro (pure SDK, no signals):
let t1 = ctx.timer(Duration::from_secs(5));
let t2 = ctx.timer(Duration::from_secs(5));
futures_util::pin_mut!(t1, t2);
futures_util::select_biased! {
_ = t1 => {}
_ = t2 => {}
}
if !t2.is_terminated() { t2.cancel(); }
Both timers fire in the same activation. handle_job processes both FireTimers and removes both seqs from command_status. select_biased! polls t1 first → takes that branch; t2 was never polled to Ready, so is_terminated() is false; t2.cancel() emits RustWfCmd::Cancel(Timer(seq_t2)); :699 misses → worker exits.
A variant that's closer to what we hit: a wait_condition whose predicate becomes true in the same activation as a FireTimer, raced under select_biased! with the condition arm first, followed by timer.cancel().
Suggested fix: in the RustWfCmd::Cancel(Timer) arm, treat a missing seq as a no-op — the timer already resolved, cancelling it is semantically a nop. E.g., drop the ? and debug!-log the miss.
2. wf_future_joiner propagates per-workflow errors to Worker::run()
Path: src/lib.rs:621–637. Inside try_for_each_concurrent:
let result = join_handle.await.map_err(anyhow::Error::new)?;
if let Err(e) = result && !matches!(e, WorkflowTermination::Evicted) {
return Err(anyhow::Error::new(e));
}
This sits in Worker::run()'s tokio::try_join!, so any of: a JoinError (panic outside the :669 catch — e.g., in sub-future processing), WorkflowFuture::poll returning Err (#1 above, or the :495 activation-channel-lost path), kills the whole worker.
Suggested fix: on per-workflow Err/JoinError, send RespondWorkflowTaskFailed for that run with the cause, evict it from the cache, log at error!, and continue the stream. Reserve worker-fatal for genuinely unrecoverable state (completions channel closed, core shutdown).
3. Unregistered activity type is worker-fatal
Path: src/lib.rs:924–929 in ActivityHalf::activity_task_handler:
let act_fn = self.activities.get(&start.activity_type).ok_or_else(|| {
anyhow!("No function registered for activity type {}", start.activity_type)
})?;
propagates via ? at :735 in the activity poll loop → try_join! → Worker::run() returns Err.
This makes rolling deploys that add or remove activity types hazardous: during the overlap, old-image workers receive new activity types (or vice versa) and exit. Go's SDK responds with a NotFoundError for the task and lets the server retry it elsewhere.
Suggested fix: complete the activity task with ActivityExecutionResult::failed(...) carrying a not-registered ApplicationFailure, and return Ok(()) from activity_task_handler.
Version
temporalio-sdk = "0.4.0" (crates.io). If any of these are already fixed on main or in a later release, a pointer would be great — happy to verify a patch.
Per-task failures escape existing isolation and become worker-fatal (0.4.0)
workflow_future.rs:669already wraps the workflow body incatch_unwindand converts a panic into a workflow-task failure — so the SDK's intent is clearly that one workflow's error doesn't take down the worker. We hit a production crash-loop that traced to two gaps in that isolation on the workflow side, plus a similar gap on the activity side, all intemporalio-sdk0.4.0. In each case a single task's error propagates out ofWorker::run()and the process exits. Go/Python/Java SDKs treat all three as task-level failures.1. Cancelling an already-fired timer is worker-fatal
Symptom:
Worker::run()returnsErrwith cause chainWorkflow futures encountered an error: Command Timer(<seq>) not found to unblock!Path:
src/workflow_future.rs:699— theRustWfCmd::Cancel(CancellableID::Timer(seq))arm callsself.unblock(UnblockEvent::Timer(seq, TimerResult::Cancelled))?. IfFireTimerhas already removed that seq fromcommand_statusin the same activation,unblockreturnsErr("Command Timer(<seq>) not found to unblock!")(:198), the?surfaces it fromWorkflowFuture::poll, and it reaches the joiner (see #2).Minimal repro (pure SDK, no signals):
Both timers fire in the same activation.
handle_jobprocesses bothFireTimers and removes both seqs fromcommand_status.select_biased!pollst1first → takes that branch;t2was never polled toReady, sois_terminated()is false;t2.cancel()emitsRustWfCmd::Cancel(Timer(seq_t2));:699misses → worker exits.A variant that's closer to what we hit: a
wait_conditionwhose predicate becomes true in the same activation as aFireTimer, raced underselect_biased!with the condition arm first, followed bytimer.cancel().Suggested fix: in the
RustWfCmd::Cancel(Timer)arm, treat a missing seq as a no-op — the timer already resolved, cancelling it is semantically a nop. E.g., drop the?anddebug!-log the miss.2.
wf_future_joinerpropagates per-workflow errors toWorker::run()Path:
src/lib.rs:621–637. Insidetry_for_each_concurrent:This sits in
Worker::run()'stokio::try_join!, so any of: aJoinError(panic outside the:669catch — e.g., in sub-future processing),WorkflowFuture::pollreturningErr(#1 above, or the:495activation-channel-lost path), kills the whole worker.Suggested fix: on per-workflow
Err/JoinError, sendRespondWorkflowTaskFailedfor that run with the cause, evict it from the cache, log aterror!, and continue the stream. Reserve worker-fatal for genuinely unrecoverable state (completions channel closed, core shutdown).3. Unregistered activity type is worker-fatal
Path:
src/lib.rs:924–929inActivityHalf::activity_task_handler:propagates via
?at:735in the activity poll loop →try_join!→Worker::run()returnsErr.This makes rolling deploys that add or remove activity types hazardous: during the overlap, old-image workers receive new activity types (or vice versa) and exit. Go's SDK responds with a
NotFoundErrorfor the task and lets the server retry it elsewhere.Suggested fix: complete the activity task with
ActivityExecutionResult::failed(...)carrying a not-registeredApplicationFailure, and returnOk(())fromactivity_task_handler.Version
temporalio-sdk = "0.4.0"(crates.io). If any of these are already fixed onmainor in a later release, a pointer would be great — happy to verify a patch.