fix: show elapsed time during cloud cold start and increase retries#708
fix: show elapsed time during cloud cold start and increase retries#708livepeer-tessa wants to merge 3 commits intomainfrom
Conversation
…derflow The WAN VAE encoder contains a 3×3 spatial convolution kernel. When the input chunk has spatial dimensions < 3 on either axis the forward pass raises: RuntimeError: Calculated padded input size per channel: (2 x 513). Kernel size: (3 x 3). Kernel size can't be greater than actual input size Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors over 11 minutes (~4 errors/second) from an input with height=2 pixels. Fix: in _encode_with_conditioning, detect when height or width < 3 and pad to the minimum safe size using F.pad. The corresponding masks tensor is also padded to keep shapes consistent. block_state.height/width are updated so the downstream resolution check still passes. A WARNING is emitted so the unusual input remains visible in logs without a crash. This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673, PR #674). Fixes #557 Signed-off-by: livepeer-robot <robot@livepeer.org>
When remote inference cold-starts, the background connect task can time out waiting for the 'ready' signal even though the cloud runner is already starting up. This left users with a failed connection and no automatic recovery — they had to manually retry. Add retry logic to connect_background (up to 3 attempts, 5 s delay): - On each failure, check if the error is transient (timeout, network, connection refused, reset). If so, wait and retry. - Non-transient errors (auth, config, bad app_id) bail immediately. - The connect_stage field is updated during the retry delay so the UI can show "Retrying connection (attempt N/3)..." instead of going silent. Fixes #704 — users no longer need to manually retry when the cloud runner cold-starts and the first connection attempt times out. Signed-off-by: livepeer-robot <robot@livepeer.org>
When connecting to remote inference, the UI showed 'Starting cloud server...' for up to 3 minutes with no progress feedback. Users had no way to tell if the connection was alive or stuck. Changes: - Update _connect_stage with elapsed seconds during the 'ready' wait (e.g. 'Starting cloud server... (45s)') so users can see progress - Increase max connection attempts from 3 → 5 to give cold-starting cloud runners more chances to become available - Use progressive retry delays (5s, 10s, 15s, 20s) instead of a fixed 5s so consecutive timeouts space out without overwhelming the endpoint Fixes #704 Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can get early access to new features in CodeRabbit.Enable the |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Fixes #704
Problem
When enabling remote inference, the UI shows "Starting cloud server..." for up to 3 minutes with zero feedback. Users have no idea if the connection is working or stuck. After all 3 retry attempts time out, they get a generic error and have to manually retry.
Changes
connect_stagenow updates every second with elapsed time (e.g.Starting cloud server... (45s)) so users can see the connection is alive and roughly how long it's been waitingmax_attemptsfrom 3 → 5, giving cold-starting cloud runners more chances to become available before reporting failureTesting