fix: auto-delete stale checkpoint on tensor shape mismatch (#693) by livepeer-tessa · Pull Request #695 · daydreamlive/scope

livepeer-tessa · 2026-03-15T06:20:54Z

Problem

krea-realtime-video fails to load on fal.ai workers with:

The size of tensor a (5120) must match the size of tensor b (1536) at non-singleton dimension 1.
If this error persists, consider removing the models directory '/data/models' and re-downloading models.

This is a stale/incompatible checkpoint cached on the worker — likely downloaded against an older model version where the hidden dim was different. Because the file persists on disk, every subsequent job hits the same error (4 occurrences across 2 jobs before this was caught).

Root Cause

In WanDiffusionWrapper.__init__, load_state_dict(..., assign=True, strict=False) ignores missing/unexpected keys but still raises RuntimeError when tensor shapes don't match. The error is logged and surfaced, but the stale file is left on disk — so re-tries always fail the same way.

Fix

Catch RuntimeError from load_state_dict, detect shape-mismatch phrasing ("size of tensor" / "size mismatch"), delete the stale checkpoint file, then re-raise with a clear actionable message. On the next job attempt fal.ai will re-download the checkpoint fresh and the load will succeed.

This fix applies to all pipelines that use WanDiffusionWrapper (krea-realtime-video, streamdiffusionv2, longlive, memflow, reward-forcing).

Testing

Shape-mismatch RuntimeError → stale file deleted, clear message re-raised ✓
Other RuntimeError (e.g. CUDA OOM) → re-raised unchanged ✓
File deletion failure (permissions) → warning logged, original error still re-raised ✓

Fixes #693

When WanDiffusionWrapper loads a safetensors checkpoint and the tensor shapes don't match the current model architecture, PyTorch raises a RuntimeError (e.g. 'size of tensor a (5120) must match tensor b (1536)'). This silently recurs on every subsequent job because the bad file stays on the worker's model cache. Fix: catch RuntimeError from load_state_dict, check for shape-mismatch phrasing, delete the stale file, and re-raise with a clear message. On next attempt fal.ai will re-download the checkpoint and load will succeed. Fixes #693 Signed-off-by: livepeer-robot <robot@livepeer.org>

coderabbitai · 2026-03-15T06:21:00Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f2ae03c4-cb28-45ef-9491-2f59192fa05b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tessa/693-krea-checkpoint-mismatch-auto-remediation

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-15T06:27:47Z

🚀 fal.ai Preview Deployment


App ID	`daydream/scope-pr-695--preview`
WebSocket	`wss://fal.run/daydream/scope-pr-695--preview/ws`
Commit	`b671646`

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-695--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

github-actions · 2026-03-15T06:30:54Z

✅ E2E Tests passed


Status	passed
fal App	`daydream/scope-pr-695--preview`
Run	View logs

Test Artifacts

Check the workflow run for screenshots.

livepeer-tessa requested review from emranemran and mjh1 March 15, 2026 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695

fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/693-krea-checkpoint-mismatch-auto-remediation

livepeer-tessa commented Mar 15, 2026

Uh oh!

coderabbitai bot commented Mar 15, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

livepeer-tessa commented Mar 15, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

coderabbitai bot commented Mar 15, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 15, 2026

🚀 fal.ai Preview Deployment

Testing

Uh oh!

github-actions bot commented Mar 15, 2026

✅ E2E Tests passed

Test Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant