Skip to content

fix(cloud): surface worker param-update errors via WebRTC data channel#725

Open
livepeer-tessa wants to merge 4 commits intomainfrom
fix/724-ip-adapter-url-timeout-handling
Open

fix(cloud): surface worker param-update errors via WebRTC data channel#725
livepeer-tessa wants to merge 4 commits intomainfrom
fix/724-ip-adapter-url-timeout-handling

Conversation

@livepeer-tessa
Copy link
Contributor

Problem

Closes #724

When a cloud worker fails to fetch ip_adapter_style_image_url (or any other URL-type param) during a parameter update, it sends an error response over the WebRTC data channel:

{"last_error": "Error updating params: Request timeout while fetching image from URL"}

The on_dc_message handler in CloudWebRTCClient was just logging this at DEBUG level — silently dropping the error. No Kafka event was published, and the UI received no notification. The pipeline continued running but with a stale/missing style image.

Fix

cloud_webrtc_client.py

  • Parse incoming data channel messages as JSON
  • If an error or last_error field is present, call cloud_manager._on_worker_error(error_text, raw_payload)

cloud_connection.py

  • Add _worker_error_callbacks list + add_worker_error_callback / remove_worker_error_callback helpers
  • Add _on_worker_error(error_message, raw_payload) which:
    • Logs at WARNING
    • Publishes a Kafka error event (error_type=cloud_worker_param_update_error)
    • Notifies all registered callbacks

cloud_track.py

  • After WebRTC starts, register _on_worker_error callback (when a notification_callback is set)
  • Forwards errors to notification_callback as type=worker_param_update_error for frontend visibility
  • Deregisters on stop() to avoid memory leaks / stale references

Testing

All 335 existing tests pass (pytest tests/ -x -q).

Notes

This PR does not add retry logic for URL fetches — that would require changes on the fal.ai worker side (tracked in #724). This PR ensures the error is at minimum:

  1. Visible in logs at WARNING level (not silently dropped)
  2. Published as a Kafka event for monitoring
  3. Surfaced to the frontend via the existing notification_callback chain

leszko and others added 4 commits March 20, 2026 08:19
…ly outputs

Signed-off-by: Rafal Leszko <rafal@livepeer.org>
…eprocessVideoBlock

On the first chunk (current_start_frame == 0), target_num_frames is
num_frame_per_block * vae_temporal_downsample_factor + 1 (e.g. 13 for
default config). PreprocessVideoBlock already resamples 'video' and
'vace_input_frames' to this count, but 'vace_input_masks' was never
adjusted. When masks arrive from a queue or client parameter they have
the base chunk size (e.g. 12 frames), causing VaceEncodingBlock to
raise:

  ValueError: vace_input_masks shape mismatch: expected [B, 1, 13, ...]
              got [B, 1, 12, ...]

Fix: add vace_input_masks to PreprocessVideoBlock inputs/outputs and
resample its temporal dimension to target_num_frames whenever it does
not already match, using the same linear-interpolation index strategy
used for video/vace_input_frames.

Fixes #721

Signed-off-by: livepeer-robot <robot@livepeer.org>
Signed-off-by: livepeer-robot <robot@livepeer.org>
Previously the WebRTC data channel on_message handler silently dropped
all messages from the cloud worker at debug log level, including error
responses like:

  {"last_error": "Error updating params: Request timeout while fetching image from URL"}

This meant IP adapter URL fetch failures (and similar param-update errors)
were completely invisible to the UI and not published as Kafka events.

Changes:
- cloud_webrtc_client: parse incoming data channel messages as JSON;
  if 'error' or 'last_error' is present, call cloud_manager._on_worker_error
- cloud_connection: add _worker_error_callbacks list, add/remove helpers,
  and _on_worker_error() which logs at WARNING, publishes a Kafka error
  event (type=cloud_worker_param_update_error), and notifies callbacks
- cloud_track: register _on_worker_error callback after WebRTC starts;
  forwards errors to notification_callback as type=worker_param_update_error
  so the frontend can surface them; deregisters on stop() to avoid leaks

Fixes #724

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ee2cb8b-3c13-48b7-accb-6d1bea66f487

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/724-ip-adapter-url-timeout-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-725--preview
WebSocket wss://fal.run/daydream/scope-pr-725--preview/ws
Commit e88440e

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-725--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-725--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[streamdiffusion] Request timeout while fetching IP adapter style image URL — unhandled in param update path

2 participants