fix(cloud): surface worker param-update errors via WebRTC data channel#725
fix(cloud): surface worker param-update errors via WebRTC data channel#725livepeer-tessa wants to merge 4 commits intomainfrom
Conversation
…ly outputs Signed-off-by: Rafal Leszko <rafal@livepeer.org>
…eprocessVideoBlock
On the first chunk (current_start_frame == 0), target_num_frames is
num_frame_per_block * vae_temporal_downsample_factor + 1 (e.g. 13 for
default config). PreprocessVideoBlock already resamples 'video' and
'vace_input_frames' to this count, but 'vace_input_masks' was never
adjusted. When masks arrive from a queue or client parameter they have
the base chunk size (e.g. 12 frames), causing VaceEncodingBlock to
raise:
ValueError: vace_input_masks shape mismatch: expected [B, 1, 13, ...]
got [B, 1, 12, ...]
Fix: add vace_input_masks to PreprocessVideoBlock inputs/outputs and
resample its temporal dimension to target_num_frames whenever it does
not already match, using the same linear-interpolation index strategy
used for video/vace_input_frames.
Fixes #721
Signed-off-by: livepeer-robot <robot@livepeer.org>
Signed-off-by: livepeer-robot <robot@livepeer.org>
Previously the WebRTC data channel on_message handler silently dropped
all messages from the cloud worker at debug log level, including error
responses like:
{"last_error": "Error updating params: Request timeout while fetching image from URL"}
This meant IP adapter URL fetch failures (and similar param-update errors)
were completely invisible to the UI and not published as Kafka events.
Changes:
- cloud_webrtc_client: parse incoming data channel messages as JSON;
if 'error' or 'last_error' is present, call cloud_manager._on_worker_error
- cloud_connection: add _worker_error_callbacks list, add/remove helpers,
and _on_worker_error() which logs at WARNING, publishes a Kafka error
event (type=cloud_worker_param_update_error), and notifies callbacks
- cloud_track: register _on_worker_error callback after WebRTC starts;
forwards errors to notification_callback as type=worker_param_update_error
so the frontend can surface them; deregisters on stop() to avoid leaks
Fixes #724
Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Problem
Closes #724
When a cloud worker fails to fetch
ip_adapter_style_image_url(or any other URL-type param) during a parameter update, it sends an error response over the WebRTC data channel:{"last_error": "Error updating params: Request timeout while fetching image from URL"}The
on_dc_messagehandler inCloudWebRTCClientwas just logging this atDEBUGlevel — silently dropping the error. No Kafka event was published, and the UI received no notification. The pipeline continued running but with a stale/missing style image.Fix
cloud_webrtc_client.pyerrororlast_errorfield is present, callcloud_manager._on_worker_error(error_text, raw_payload)cloud_connection.py_worker_error_callbackslist +add_worker_error_callback/remove_worker_error_callbackhelpers_on_worker_error(error_message, raw_payload)which:WARNINGerror_type=cloud_worker_param_update_error)cloud_track.py_on_worker_errorcallback (when anotification_callbackis set)notification_callbackastype=worker_param_update_errorfor frontend visibilitystop()to avoid memory leaks / stale referencesTesting
All 335 existing tests pass (
pytest tests/ -x -q).Notes
This PR does not add retry logic for URL fetches — that would require changes on the fal.ai worker side (tracked in #724). This PR ensures the error is at minimum:
notification_callbackchain