[Bug] Per-task failures escape existing isolation and become worker-fatal (0.4.0)

## Per-task failures escape existing isolation and become worker-fatal (0.4.0)

`workflow_future.rs:669` already wraps the workflow body in `catch_unwind` and converts a panic into a workflow-task failure — so the SDK's intent is clearly that one workflow's error doesn't take down the worker. We hit a production crash-loop that traced to two gaps in that isolation on the workflow side, plus a similar gap on the activity side, all in `temporalio-sdk` 0.4.0. In each case a single task's error propagates out of `Worker::run()` and the process exits. Go/Python/Java SDKs treat all three as task-level failures.

### 1. Cancelling an already-fired timer is worker-fatal

**Symptom:** `Worker::run()` returns `Err` with cause chain `Workflow futures encountered an error: Command Timer(<seq>) not found to unblock!`

**Path:** `src/workflow_future.rs:699` — the `RustWfCmd::Cancel(CancellableID::Timer(seq))` arm calls `self.unblock(UnblockEvent::Timer(seq, TimerResult::Cancelled))?`. If `FireTimer` has already removed that seq from `command_status` in the same activation, `unblock` returns `Err("Command Timer(<seq>) not found to unblock!")` (`:198`), the `?` surfaces it from `WorkflowFuture::poll`, and it reaches the joiner (see #2).

**Minimal repro** (pure SDK, no signals):
```rust
let t1 = ctx.timer(Duration::from_secs(5));
let t2 = ctx.timer(Duration::from_secs(5));
futures_util::pin_mut!(t1, t2);
futures_util::select_biased! {
    _ = t1 => {}
    _ = t2 => {}
}
if !t2.is_terminated() { t2.cancel(); }
```
Both timers fire in the same activation. `handle_job` processes both `FireTimer`s and removes both seqs from `command_status`. `select_biased!` polls `t1` first → takes that branch; `t2` was never polled to `Ready`, so `is_terminated()` is false; `t2.cancel()` emits `RustWfCmd::Cancel(Timer(seq_t2))`; `:699` misses → worker exits.

A variant that's closer to what we hit: a `wait_condition` whose predicate becomes true in the same activation as a `FireTimer`, raced under `select_biased!` with the condition arm first, followed by `timer.cancel()`.

**Suggested fix:** in the `RustWfCmd::Cancel(Timer)` arm, treat a missing seq as a no-op — the timer already resolved, cancelling it is semantically a nop. E.g., drop the `?` and `debug!`-log the miss.

### 2. `wf_future_joiner` propagates per-workflow errors to `Worker::run()`

**Path:** `src/lib.rs:621–637`. Inside `try_for_each_concurrent`:
```rust
let result = join_handle.await.map_err(anyhow::Error::new)?;
if let Err(e) = result && !matches!(e, WorkflowTermination::Evicted) {
    return Err(anyhow::Error::new(e));
}
```
This sits in `Worker::run()`'s `tokio::try_join!`, so any of: a `JoinError` (panic outside the `:669` catch — e.g., in sub-future processing), `WorkflowFuture::poll` returning `Err` (#1 above, or the `:495` activation-channel-lost path), kills the whole worker.

**Suggested fix:** on per-workflow `Err`/`JoinError`, send `RespondWorkflowTaskFailed` for that run with the cause, evict it from the cache, log at `error!`, and continue the stream. Reserve worker-fatal for genuinely unrecoverable state (completions channel closed, core shutdown).

### 3. Unregistered activity type is worker-fatal

**Path:** `src/lib.rs:924–929` in `ActivityHalf::activity_task_handler`:
```rust
let act_fn = self.activities.get(&start.activity_type).ok_or_else(|| {
    anyhow!("No function registered for activity type {}", start.activity_type)
})?;
```
propagates via `?` at `:735` in the activity poll loop → `try_join!` → `Worker::run()` returns `Err`.

This makes rolling deploys that add or remove activity types hazardous: during the overlap, old-image workers receive new activity types (or vice versa) and exit. Go's SDK responds with a `NotFoundError` for the task and lets the server retry it elsewhere.

**Suggested fix:** complete the activity task with `ActivityExecutionResult::failed(...)` carrying a not-registered `ApplicationFailure`, and return `Ok(())` from `activity_task_handler`.

### Version

`temporalio-sdk = "0.4.0"` (crates.io). If any of these are already fixed on `main` or in a later release, a pointer would be great — happy to verify a patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Per-task failures escape existing isolation and become worker-fatal (0.4.0) #1254

Per-task failures escape existing isolation and become worker-fatal (0.4.0)

1. Cancelling an already-fired timer is worker-fatal

2. `wf_future_joiner` propagates per-workflow errors to `Worker::run()`

3. Unregistered activity type is worker-fatal

Version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Per-task failures escape existing isolation and become worker-fatal (0.4.0) #1254

Description

Per-task failures escape existing isolation and become worker-fatal (0.4.0)

1. Cancelling an already-fired timer is worker-fatal

2. wf_future_joiner propagates per-workflow errors to Worker::run()

3. Unregistered activity type is worker-fatal

Version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `wf_future_joiner` propagates per-workflow errors to `Worker::run()`