Skip to content

fix: latch worker flag when torch._dynamo.reset() fails to prevent stale-cache recompile#671

Open
livepeer-tessa wants to merge 4 commits intomainfrom
fix/dynamo-reset-failure-guard
Open

fix: latch worker flag when torch._dynamo.reset() fails to prevent stale-cache recompile#671
livepeer-tessa wants to merge 4 commits intomainfrom
fix/dynamo-reset-failure-guard

Conversation

@livepeer-tessa
Copy link
Contributor

@livepeer-tessa livepeer-tessa commented Mar 11, 2026

Addresses CodeRabbit's review comment on #670.

Problem

If torch._dynamo.reset() raises during _unload_pipeline_by_id_unsafe, the exception was silently swallowed and pipeline_unloaded was published unconditionally. Stale Dynamo/FP8 compile caches remained live in the worker process, so the next krea-realtime-video load would attempt torch.compile against those caches — re-entering the warmup crash from the FP8→Krea conflict that #669 was meant to fix.

Fix

Introduces self._dynamo_reset_failed: bool = False on PipelineManager. When torch._dynamo.reset() raises:

  1. The flag is latched True (persists for the worker process lifetime)
  2. The unload still completes and pipeline_unloaded is published (memory is freed)
  3. Any subsequent krea-realtime-video load sees the flag and forces compile=False, with a warning log to restart the worker

This is safer than failing the unload entirely (which would strand the pipeline in a broken state) while still preventing the stale-cache recompile crash.

Changes

  • __init__: adds self._dynamo_reset_failed = False
  • _unload_pipeline_by_id_unsafe: latches flag on reset failure
  • _load_pipeline_implementation (krea branch): checks flag before deciding compile=

Summary by CodeRabbit

  • Bug Fixes

    • Prevented FP8-related compiled kernels from running when incompatible; now skipped with clear warnings.
    • Improved cache reset on pipeline unload to avoid stale FP8 state leaking into future runs.
    • Added safeguard to disable compilation after a failed cache reset to prevent repeated failures.
  • Improvements

    • Smarter compilation enablement based on GPU capability and prior reset state, with user-facing guidance to restart if needed.
    • More stable attention/KV behavior when compilation is turned off; warmup of compiled kernels is skipped with informational messaging.

livepeer-robot added 2 commits March 11, 2026 18:37
Float8DynamicActivationFloat8WeightConfig is not compatible with
torch.compile(fullgraph=False). During warmup on H100 (where compile=True),
AOT autograd's gen_alias_from_base calls aten.as_strided on Float8Tensor
outputs, which is not implemented in torchao:

  NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented
  operator/function: func=<OpOverload(op='aten.as_strided', overload='default')>

The crash manifests specifically after longlive (also FP8) because
torch._dynamo's compile cache is never reset between pipeline switches,
allowing longlive's Float8 dispatch state to persist and influence Krea's
subsequent compile attempt.

Two fixes:

1. krea_realtime_video/pipeline.py: when FP8 quantization is active, skip
   block.compile() — the two optimizations are currently mutually exclusive
   with fullgraph=False. FP8 alone still provides meaningful memory/compute
   savings on H100 without compile.

2. pipeline_manager.py: call torch._dynamo.reset() on every pipeline unload
   to clear stale compiled graphs and Float8 dispatch state, preventing
   cross-pipeline cache pollution.

Fixes #669

Signed-off-by: livepeer-robot <robot@livepeer.org>
…ale-cache recompile

If torch._dynamo.reset() raises during pipeline unload, stale Dynamo/FP8
compile caches remain active in the worker process. Previously the code
swallowed the exception and published pipeline_unloaded unconditionally,
leaving the next krea-realtime-video load free to torch.compile against
those stale caches — re-entering the warmup crash from the FP8→Krea
conflict.

Fix: set self._dynamo_reset_failed = True on reset failure. The Krea load
path now checks this flag and forces compile=False for the lifetime of the
worker, with a clear log warning to restart the process to re-enable
compilation.

Addresses CodeRabbit review comment on PR #670.

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Pipeline now avoids compiling attention blocks and skips warmup when FP8 quantization is active; KV-cache attention bias is set based on compile flag. Pipeline manager tracks torch._dynamo.reset() failures on unload, disables compilation for subsequent loads (with logs) until restart, and only enables compile on Hopper-capable GPUs.

Changes

Cohort / File(s) Summary
Realtime video pipeline
src/scope/core/pipelines/krea_realtime_video/pipeline.py
Added FP8-aware gating: when FP8 quantization is detected, log a warning and skip attention block compilation and warmup. Warmup computations (local_attn_size, num_frame_per_block, warmup_runs) and execution only run when compile=True. Initialize kv_cache_attention_bias to DEFAULT_KV_CACHE_ATTENTION_BIAS if compiling, otherwise KV_CACHE_ATTENTION_BIAS_DISABLED.
Pipeline manager & dynamo handling
src/scope/server/pipeline_manager.py
Introduce private _dynamo_reset_failed flag; on unload attempt torch._dynamo.reset() and set flag on failure. Loading computes _should_compile only if GPU is Hopper-capable and _dynamo_reset_failed is false; propagate compile decision into KreaRealtimeVideoPipeline construction and log guidance when compilation is disabled due to prior reset failure.

Sequence Diagram

sequenceDiagram
    participant User as User/Caller
    participant PM as PipelineManager
    participant Dynamo as torch._dynamo
    participant GPU as GPU
    participant Pipeline as KreaRealtimeVideoPipeline

    User->>PM: load_pipeline()
    activate PM

    rect rgba(100, 149, 237, 0.5)
    Note over PM,GPU: Determine compilation capability
    PM->>GPU: Check Hopper capability
    GPU-->>PM: Hopper? (yes/no)
    PM->>PM: Check _dynamo_reset_failed flag
    end

    alt _dynamo_reset_failed OR not Hopper
        PM->>PM: _should_compile = False
        PM->>PM: Log compilation disabled (warning)
    else
        PM->>PM: _should_compile = True
    end

    rect rgba(152, 251, 152, 0.5)
    Note over PM,Pipeline: Construct pipeline with compile decision
    PM->>Pipeline: KreaRealtimeVideoPipeline(compile=_should_compile)
    activate Pipeline
    Pipeline->>Pipeline: Initialize public state (KV bias depends on compile)
    Pipeline->>Pipeline: Check FP8 quantization config
    alt FP8 enabled
        Pipeline->>Pipeline: Log warning about FP8 + compile incompatibility
        Pipeline->>Pipeline: Skip attention block compilation & warmup
    else
        Pipeline->>Pipeline: Compile attention blocks
        Pipeline->>Pipeline: Perform warmup priming compiled kernels
    end
    Pipeline-->>PM: Pipeline ready
    deactivate Pipeline
    end

    PM-->>User: Pipeline loaded
    deactivate PM

    User->>PM: unload_pipeline()
    activate PM
    rect rgba(240, 128, 128, 0.5)
    Note over PM,Dynamo: Reset torch._dynamo cache to avoid FP8 cache leakage
    PM->>Dynamo: reset()
    alt Reset succeeds
        Dynamo-->>PM: Success
        PM->>PM: Clear _dynamo_reset_failed flag
    else Reset fails
        Dynamo-->>PM: Error
        PM->>PM: Set _dynamo_reset_failed flag
        PM->>PM: Log warning advising restart to re-enable compilation
    end
    end
    PM-->>User: Pipeline unloaded
    deactivate PM
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through kernels, sniffed the FP8 breeze,
Told the compiler "pause" when floats did tease.
I set the cache bias, left a tiny flag,
If Dynamo trips, I wag my debugging bag. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the primary change: latching a worker flag when torch._dynamo.reset() fails to prevent stale cache recompilation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/dynamo-reset-failure-guard

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/scope/server/pipeline_manager.py (1)

699-713: Narrow the warning to the path that actually consumes the latch.

Lines 709-710 say this forces compile=False for all later pipeline loads, but _dynamo_reset_failed is only checked in the krea-realtime-video branch at Lines 982-999. Tightening the message will keep operators from assuming other loaders are protected.

✏️ Suggested wording
-                "forcing compile=False for all subsequent pipeline loads."
+                "forcing compile=False for subsequent krea-realtime-video loads."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/pipeline_manager.py` around lines 699 - 713, The warning
message after catching exception from torch._dynamo.reset() is too broad; update
the logged text to indicate that only the krea-realtime-video loader currently
consumes _dynamo_reset_failed and will force compile=False for subsequent
pipeline loads handled by that branch (the logic that checks
_dynamo_reset_failed in the krea-realtime-video branch). Edit the message
emitted in the except block where _dynamo_reset_failed is set so it explicitly
mentions the krea-realtime-video path and that other loaders may not be
affected, and keep the existing assignment to self._dynamo_reset_failed so the
existing krea-realtime-video check continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 982-999: The compile flag passed into KreaRealtimeVideoPipeline is
not propagated to the warmup path, so even when compile=False the warmup loop
still sets kv_cache_attention_bias to DEFAULT_KV_CACHE_ATTENTION_BIAS and
triggers torch.compile; update the pipeline code in KreaRealtimeVideoPipeline so
the warmup routine checks the instance's compile flag (or accept a compile
parameter) and when compile is False: (a) avoid assigning
DEFAULT_KV_CACHE_ATTENTION_BIAS (use None or a non-compiling sentinel) and (b)
skip calling block.compile(fullgraph=False) inside the warmup loop; ensure you
reference and use the pipeline attribute (compile) and symbols
kv_cache_attention_bias, DEFAULT_KV_CACHE_ATTENTION_BIAS, and the warmup loop
where block.compile is invoked.

---

Nitpick comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 699-713: The warning message after catching exception from
torch._dynamo.reset() is too broad; update the logged text to indicate that only
the krea-realtime-video loader currently consumes _dynamo_reset_failed and will
force compile=False for subsequent pipeline loads handled by that branch (the
logic that checks _dynamo_reset_failed in the krea-realtime-video branch). Edit
the message emitted in the except block where _dynamo_reset_failed is set so it
explicitly mentions the krea-realtime-video path and that other loaders may not
be affected, and keep the existing assignment to self._dynamo_reset_failed so
the existing krea-realtime-video check continues to work.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 052a352d-7d44-43da-82f5-f7f5d0673f5e

📥 Commits

Reviewing files that changed from the base of the PR and between 5f6ee61 and 2797c1c.

📒 Files selected for processing (2)
  • src/scope/core/pipelines/krea_realtime_video/pipeline.py
  • src/scope/server/pipeline_manager.py

…ompile=False

When compile=False, kv_cache_attention_bias was still being set to
DEFAULT_KV_CACHE_ATTENTION_BIAS (0.3), which causes the warmup loop to enter
the flex_attention code path and trigger torch._dynamo tracing even though no
block.compile() call was ever made. This meant the _dynamo_reset_failed guard
in pipeline_manager.py had no effect on the warmup-induced recompilation.

Fix:
- Import KV_CACHE_ATTENTION_BIAS_DISABLED (1.0) from causal_model and use it
  as the initial kv_cache_attention_bias when compile=False. This sentinel
  makes causal_model.py take the standard attention branch and skip the
  flex_attention/torch.compile path entirely.
- Guard the warmup loop behind 'if compile:' — warmup exists solely to prime
  the compiled flex_attention kernel, so it is a no-op (and harmful) when
  compilation is disabled. Log a message when skipped for observability.

Addresses CodeRabbit review comment on PR #671.

Signed-off-by: livepeer-robot <robot@livepeer.org>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 11, 2026

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-671--preview
WebSocket wss://fal.run/daydream/scope-pr-671--preview/ws
Commit 0c360be

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-671--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Around line 233-235: The warmup_runs calculation currently uses floor division
and should use ceiling to ensure warmup covers the full number of blocks;
replace the expression for warmup_runs (which references local_attn_size and
num_frame_per_block) with ceil(local_attn_size / num_frame_per_block) + 1 and
add the necessary import (e.g., import math or from math import ceil) so the
warmup count rounds up when values aren’t evenly divisible.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 636a55cb-9cf6-4300-8d23-8ab320a824f5

📥 Commits

Reviewing files that changed from the base of the PR and between 2797c1c and 3ee5402.

📒 Files selected for processing (1)
  • src/scope/core/pipelines/krea_realtime_video/pipeline.py

The comment at line 230 already specifies ceil(local_attn_size / num_frame_per_block) + 1,
but the implementation was using floor division (//). When local_attn_size is not
evenly divisible by num_frame_per_block, this meant warmup stopped one iteration early,
leaving the cache short of the steady-state shape and triggering a recompile on the
first live request.

Replace with the ceiling equivalent: (a + b - 1) // b to avoid importing math.

Fixes coderabbitai suggestion on PR #671.

Signed-off-by: livepeer-robot <robot@livepeer.org>
@livepeer-tessa
Copy link
Contributor Author

Good catch @coderabbitai — the comment already specified ceil but the implementation used floor division. Fixed with the integer ceiling equivalent (a + b - 1) // b (no extra import needed). Pushed in 0c360be.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 11, 2026

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-671--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Tip

For best results, initiate chat on the files or code changes.

@livepeer-tessa: Thanks for the quick fix! The integer ceiling formula (a + b - 1) // b is exactly right — it avoids any floating-point conversion and keeps the logic consistent with the ceil comment. Looks good! 🎉

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Fix Ruff formatting.

The pipeline indicates a formatting violation. Run ruff format on this file to resolve.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py` at line 1, Run the
code formatter (ruff format) on this module to fix the formatting violation;
specifically ensure the top-level import line ("import logging") and the rest of
pipeline.py follow ruff/black formatting rules (correct import/newline spacing,
trailing newline, and any minor whitespace issues) and then re-run ruff to
confirm no remaining violations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Around line 141-152: The current logic logs skipping block.compile() when FP8
is active but leaves the `compile` flag true, causing `initial_kv_bias` and the
warmup path to still run and trigger flex_attention tracing; update the flow so
you track whether compilation actually happened (e.g., set a new boolean like
`compiled` or flip `compile` to False after skipping) immediately after the
Float8 check where `block.compile()` is skipped, and then use that actual
compilation indicator in the subsequent logic that sets `initial_kv_bias` and
controls the warmup/flex_attention branch (the code around `block.compile()`,
`initial_kv_bias`, and the warmup lines) so warmup does not run when compilation
was skipped due to FP8.

---

Outside diff comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Line 1: Run the code formatter (ruff format) on this module to fix the
formatting violation; specifically ensure the top-level import line ("import
logging") and the rest of pipeline.py follow ruff/black formatting rules
(correct import/newline spacing, trailing newline, and any minor whitespace
issues) and then re-run ruff to confirm no remaining violations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7efd0c2d-363f-41b4-9d07-bf3dd528cc1d

📥 Commits

Reviewing files that changed from the base of the PR and between 3ee5402 and 0c360be.

📒 Files selected for processing (1)
  • src/scope/core/pipelines/krea_realtime_video/pipeline.py

Comment on lines +141 to +152
if compile:
# Float8DynamicActivationFloat8WeightConfig is incompatible with
# torch.compile(fullgraph=False): AOT autograd's gen_alias_from_base
# calls aten.as_strided on Float8Tensor outputs, which is not
# implemented. Skip block compilation when FP8 is active.
# See: https://github.com/daydreamlive/scope/issues/669
logger.warning(
"Skipping torch.compile for attention blocks: "
"Float8DynamicActivationFloat8WeightConfig is not compatible "
"with fullgraph=False compilation (aten.as_strided unsupported "
"on Float8Tensor). FP8 quantization is still active."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

FP8 + compile=True still triggers flex_attention warmup.

When FP8 quantization is active and compile=True, the code:

  1. Logs the warning and skips block.compile() (correct)
  2. But compile remains True, so initial_kv_bias is set to 0.3 (line 212)
  3. Warmup runs (line 232) with bias < 1.0, which per lines 223-226 "would otherwise enter the flex_attention code path... and trigger torch._dynamo tracing"

This defeats the purpose of skipping compilation for FP8. Consider tracking whether compilation actually occurred:

Proposed fix
+        # Track whether block compilation actually happens (FP8 is incompatible)
+        did_compile = False
+
         if quantization == Quantization.FP8_E4M3FN:
             # Cast before optional quantization
             generator = generator.to(dtype=dtype)
@@ -140,6 +143,7 @@
         else:
             generator = generator.to(device=device, dtype=dtype)

             if compile:
                 # Only compile the attention blocks
                 for block in generator.model.blocks:
                     # Disable fullgraph right now due to issues with RoPE
                     block.compile(fullgraph=False)
+                did_compile = True

         # ... later ...

         initial_kv_bias = (
-            DEFAULT_KV_CACHE_ATTENTION_BIAS if compile else KV_CACHE_ATTENTION_BIAS_DISABLED
+            DEFAULT_KV_CACHE_ATTENTION_BIAS if did_compile else KV_CACHE_ATTENTION_BIAS_DISABLED
         )

         # ... and ...

-        if compile:
+        if did_compile:
             local_attn_size = getattr(model_config, "local_attn_size", 6)

Also applies to: 211-214, 232-250

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py` around lines 141 -
152, The current logic logs skipping block.compile() when FP8 is active but
leaves the `compile` flag true, causing `initial_kv_bias` and the warmup path to
still run and trigger flex_attention tracing; update the flow so you track
whether compilation actually happened (e.g., set a new boolean like `compiled`
or flip `compile` to False after skipping) immediately after the Float8 check
where `block.compile()` is skipped, and then use that actual compilation
indicator in the subsequent logic that sets `initial_kv_bias` and controls the
warmup/flex_attention branch (the code around `block.compile()`,
`initial_kv_bias`, and the warmup lines) so warmup does not run when compilation
was skipped due to FP8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant