Skip to content

fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM (0.46.1 release)#708

Merged
lwshang merged 3 commits intomainfrom
lwshang/fix-async-watch-panic-698
Mar 6, 2026
Merged

fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM (0.46.1 release)#708
lwshang merged 3 commits intomainfrom
lwshang/fix-async-watch-panic-698

Conversation

@lwshang
Copy link
Copy Markdown
Contributor

@lwshang lwshang commented Mar 5, 2026

Summary

Fixes #698[bug] failed to observe change after notificaton. panic reported against ic-agent 0.45 and still present in 0.46.0.

Root cause

async-watch v0.3.1 has a race condition in its changed() implementation: after the internal event-listener future resolves, maybe_changed() can return None if the executor context switches before the version is read, causing an unconditional expect(...) to panic. This race only manifests on multi-threaded tokio runtimes.

Fix

  • Replace async-watch with tokio::sync::watch on non-WASM targets. tokio::sync::watch is battle-tested and race-condition-free. async-watch is kept for WASM where the single-threaded environment avoids the race.
  • Change continuebreak when fetch_receiver_recv returns Err (sender dropped). The old continue caused HealthManagerActor::run to spin-loop forever, starving the cancellation token arm in select!.

Changes

  • ic-agent/Cargo.toml: move async-watch to WASM-only deps; add "sync" to non-WASM tokio features
  • type_aliases.rs: platform-specific SenderWatch/ReceiverWatch type aliases
  • dynamic_route_provider.rs: platform-specific watch channel construction
  • health_check.rs: platform-specific fetch_receiver_recv helper; continuebreak on closed channel; two regression tests

Regression tests

Two tests added to health_check.rs (non-WASM only):

  1. test_health_manager_no_panic_on_rapid_updates_and_shutdown — 50-iteration stress test with multi-threaded tokio, flooding the watch channel 200× per iteration while health-check actors run at 1 ms intervals. Captures JoinHandle so any spawned-task panic surfaces as a test failure.

  2. test_health_manager_exits_when_fetch_sender_dropped — deterministic test: drops fetch_sender while keeping the cancellation token alive, expects the actor to exit within 2 s. With the old continue bug this test times out (confirmed by temporarily reverting the fix and running the test).

lwshang and others added 2 commits March 5, 2026 11:15
…M to fix panic

async-watch v0.3.1 has a race condition in its changed() implementation that
panics with "[bug] failed to observe change after notificaton." under the
multi-threaded tokio runtime. Replace it with tokio::sync::watch on non-WASM
targets, keeping async-watch only for WASM where single-threaded execution
avoids the race.

Also fix a secondary bug where Err (channel closed) in HealthManagerActor::run
did `continue` instead of `break`, causing the actor to spin forever starving
the cancellation token arm.

Fixes #698.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lwshang lwshang changed the title fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM to fix panic (#698) fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM Mar 5, 2026
@lwshang lwshang closed this Mar 5, 2026
@lwshang lwshang reopened this Mar 5, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lwshang lwshang changed the title fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM fix(ic-agent): replace async-watch with tokio::sync::watch on non-WASM (0.46.1 release) Mar 6, 2026
@lwshang lwshang marked this pull request as ready for review March 6, 2026 16:57
@lwshang lwshang requested a review from a team as a code owner March 6, 2026 16:57
@lwshang lwshang enabled auto-merge (squash) March 6, 2026 16:57
@lwshang lwshang merged commit 37dcc29 into main Mar 6, 2026
17 checks passed
@lwshang lwshang deleted the lwshang/fix-async-watch-panic-698 branch March 6, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ic-agent 0.45 panic - Failed to send node's health state: SendError(..)

2 participants