fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695
Open
livepeer-tessa wants to merge 1 commit intomainfrom
Open
fix: auto-delete stale checkpoint on tensor shape mismatch (#693)#695livepeer-tessa wants to merge 1 commit intomainfrom
livepeer-tessa wants to merge 1 commit intomainfrom
Conversation
When WanDiffusionWrapper loads a safetensors checkpoint and the tensor shapes don't match the current model architecture, PyTorch raises a RuntimeError (e.g. 'size of tensor a (5120) must match tensor b (1536)'). This silently recurs on every subsequent job because the bad file stays on the worker's model cache. Fix: catch RuntimeError from load_state_dict, check for shape-mismatch phrasing, delete the stale file, and re-raise with a clear message. On next attempt fal.ai will re-download the checkpoint and load will succeed. Fixes #693 Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Contributor
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
Contributor
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
krea-realtime-videofails to load on fal.ai workers with:This is a stale/incompatible checkpoint cached on the worker — likely downloaded against an older model version where the hidden dim was different. Because the file persists on disk, every subsequent job hits the same error (4 occurrences across 2 jobs before this was caught).
Root Cause
In
WanDiffusionWrapper.__init__,load_state_dict(..., assign=True, strict=False)ignores missing/unexpected keys but still raisesRuntimeErrorwhen tensor shapes don't match. The error is logged and surfaced, but the stale file is left on disk — so re-tries always fail the same way.Fix
Catch
RuntimeErrorfromload_state_dict, detect shape-mismatch phrasing ("size of tensor"/"size mismatch"), delete the stale checkpoint file, then re-raise with a clear actionable message. On the next job attempt fal.ai will re-download the checkpoint fresh and the load will succeed.This fix applies to all pipelines that use
WanDiffusionWrapper(krea-realtime-video, streamdiffusionv2, longlive, memflow, reward-forcing).Testing
RuntimeError→ stale file deleted, clear message re-raised ✓RuntimeError(e.g. CUDA OOM) → re-raised unchanged ✓Fixes #693