Skip to content

fix: show elapsed time during cloud cold start and increase retries#708

Open
livepeer-tessa wants to merge 3 commits intomainfrom
fix/cloud-connection-progress-feedback
Open

fix: show elapsed time during cloud cold start and increase retries#708
livepeer-tessa wants to merge 3 commits intomainfrom
fix/cloud-connection-progress-feedback

Conversation

@livepeer-tessa
Copy link
Contributor

Fixes #704

Problem

When enabling remote inference, the UI shows "Starting cloud server..." for up to 3 minutes with zero feedback. Users have no idea if the connection is working or stuck. After all 3 retry attempts time out, they get a generic error and have to manually retry.

Changes

  • Progress feedback during cold start: The connect_stage now updates every second with elapsed time (e.g. Starting cloud server... (45s)) so users can see the connection is alive and roughly how long it's been waiting
  • More retries: Increased max_attempts from 3 → 5, giving cold-starting cloud runners more chances to become available before reporting failure
  • Progressive retry delays: Instead of a fixed 5s between retries, delays now increase (5s, 10s, 15s, 20s) to avoid hammering the endpoint during prolonged unavailability

Testing

  • Enable remote inference with a freshly cold cloud runner
  • Observe that the status text updates each second (e.g. "Starting cloud server... (30s)")
  • If timeout occurs, verify up to 5 retry attempts are made with increasing delays

livepeer-robot added 3 commits March 15, 2026 18:19
…derflow

The WAN VAE encoder contains a 3×3 spatial convolution kernel.  When
the input chunk has spatial dimensions < 3 on either axis the forward
pass raises:

  RuntimeError: Calculated padded input size per channel: (2 x 513).
  Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video
pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors
over 11 minutes (~4 errors/second) from an input with height=2 pixels.

Fix: in _encode_with_conditioning, detect when height or width < 3 and
pad to the minimum safe size using F.pad.  The corresponding masks tensor
is also padded to keep shapes consistent.  block_state.height/width are
updated so the downstream resolution check still passes.  A WARNING is
emitted so the unusual input remains visible in logs without a crash.

This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673,
PR #674).

Fixes #557
Signed-off-by: livepeer-robot <robot@livepeer.org>
When remote inference cold-starts, the background connect task can time
out waiting for the 'ready' signal even though the cloud runner is
already starting up. This left users with a failed connection and no
automatic recovery — they had to manually retry.

Add retry logic to connect_background (up to 3 attempts, 5 s delay):
- On each failure, check if the error is transient (timeout, network,
  connection refused, reset). If so, wait and retry.
- Non-transient errors (auth, config, bad app_id) bail immediately.
- The connect_stage field is updated during the retry delay so the UI
  can show "Retrying connection (attempt N/3)..." instead of going
  silent.

Fixes #704 — users no longer need to manually retry when the cloud
runner cold-starts and the first connection attempt times out.

Signed-off-by: livepeer-robot <robot@livepeer.org>
When connecting to remote inference, the UI showed 'Starting cloud
server...' for up to 3 minutes with no progress feedback. Users had no
way to tell if the connection was alive or stuck.

Changes:
- Update _connect_stage with elapsed seconds during the 'ready' wait
  (e.g. 'Starting cloud server... (45s)') so users can see progress
- Increase max connection attempts from 3 → 5 to give cold-starting
  cloud runners more chances to become available
- Use progressive retry delays (5s, 10s, 15s, 20s) instead of a fixed 5s
  so consecutive timeouts space out without overwhelming the endpoint

Fixes #704

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 513ec5ba-ec63-4772-bcc0-db1fb13b1685

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cloud-connection-progress-feedback
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-708--preview
WebSocket wss://fal.run/daydream/scope-pr-708--preview/ws
Commit eaa3a12

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-708--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-708--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot connect to Scope remote inference

1 participant