Skip to content

krea-realtime-video fails to load: tensor size mismatch (5120 vs 1536) at non-singleton dimension 1 #693

@livepeer-tessa

Description

@livepeer-tessa

Summary

krea-realtime-video pipeline fails to load on fal.ai workers with a tensor dimension mismatch error, suggesting a stale or incompatible model cache on the worker. The error message itself advises clearing /data/models but this is not surfaced to the user in a useful way.

Error Details

From Grafana fal.ai logs (2026-03-14 19:57 – 2026-03-15 06:09 UTC, 4 occurrences across 2 jobs):

scope.server.pipeline_manager - ERROR - [1ee1c374] Failed to load pipeline krea-realtime-video: The size of tensor a (5120) must match the size of tensor b (1536) at non-singleton dimension 1. If this error persists, consider removing the models directory '/data/models' and re-downloading models.
scope.server.pipeline_manager - ERROR - [1ee1c374] Failed to load pipeline: krea-realtime-video
scope.server.pipeline_manager - ERROR - [1ee1c374] Some pipelines failed to load

Jobs affected:

  • f1d3920e-30ce-4da5-a596-82ea9dffbcc4 (1 occurrence, 2026-03-14 19:57 UTC)
  • 4ed8373d-9714-4f3e-9db6-7c862a4be208 (3 occurrences, 2026-03-14 21:04–21:06 UTC)

App: github_f1lhgmk5v76a0ev1w0u378by-scope-app--prod

Root Cause

The tensor dimensions 5120 and 1536 suggest a mismatch between the model checkpoint's weight shapes and what the current code expects — likely a stale cached model file on the fal.ai worker that was built against an older version of krea-realtime-video, or a partial/corrupt download.

Tensor shape 5120 = 5 × 1024, 1536 = 3 × 512 — the ratio suggests this could be a projection layer or embedding that changed shape between model versions.

Expected Behaviour

  • The pipeline manager should detect this class of checkpoint-incompatibility error and automatically clear the stale model cache + re-download, rather than failing repeatedly
  • Alternatively, a version hash / manifest check on startup would catch the mismatch before attempting to load
  • If manual intervention is required, the error message should surface to the user as "model cache incompatible, please re-download" rather than a raw tensor error

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions