Skip to content

fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695

Open
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/693-krea-checkpoint-mismatch-auto-remediation
Open

fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/693-krea-checkpoint-mismatch-auto-remediation

Conversation

@livepeer-tessa
Copy link
Contributor

Problem

krea-realtime-video fails to load on fal.ai workers with:

The size of tensor a (5120) must match the size of tensor b (1536) at non-singleton dimension 1.
If this error persists, consider removing the models directory '/data/models' and re-downloading models.

This is a stale/incompatible checkpoint cached on the worker — likely downloaded against an older model version where the hidden dim was different. Because the file persists on disk, every subsequent job hits the same error (4 occurrences across 2 jobs before this was caught).

Root Cause

In WanDiffusionWrapper.__init__, load_state_dict(..., assign=True, strict=False) ignores missing/unexpected keys but still raises RuntimeError when tensor shapes don't match. The error is logged and surfaced, but the stale file is left on disk — so re-tries always fail the same way.

Fix

Catch RuntimeError from load_state_dict, detect shape-mismatch phrasing ("size of tensor" / "size mismatch"), delete the stale checkpoint file, then re-raise with a clear actionable message. On the next job attempt fal.ai will re-download the checkpoint fresh and the load will succeed.

This fix applies to all pipelines that use WanDiffusionWrapper (krea-realtime-video, streamdiffusionv2, longlive, memflow, reward-forcing).

Testing

  • Shape-mismatch RuntimeError → stale file deleted, clear message re-raised ✓
  • Other RuntimeError (e.g. CUDA OOM) → re-raised unchanged ✓
  • File deletion failure (permissions) → warning logged, original error still re-raised ✓

Fixes #693

When WanDiffusionWrapper loads a safetensors checkpoint and the tensor
shapes don't match the current model architecture, PyTorch raises a
RuntimeError (e.g. 'size of tensor a (5120) must match tensor b (1536)').
This silently recurs on every subsequent job because the bad file stays on
the worker's model cache.

Fix: catch RuntimeError from load_state_dict, check for shape-mismatch
phrasing, delete the stale file, and re-raise with a clear message.
On next attempt fal.ai will re-download the checkpoint and load will
succeed.

Fixes #693

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 15, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f2ae03c4-cb28-45ef-9491-2f59192fa05b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tessa/693-krea-checkpoint-mismatch-auto-remediation
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-695--preview
WebSocket wss://fal.run/daydream/scope-pr-695--preview/ws
Commit b671646

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-695--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-695--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

krea-realtime-video fails to load: tensor size mismatch (5120 vs 1536) at non-singleton dimension 1

1 participant