fix: latch worker flag when torch._dynamo.reset() fails to prevent stale-cache recompile#671
fix: latch worker flag when torch._dynamo.reset() fails to prevent stale-cache recompile#671livepeer-tessa wants to merge 4 commits intomainfrom
Conversation
Float8DynamicActivationFloat8WeightConfig is not compatible with torch.compile(fullgraph=False). During warmup on H100 (where compile=True), AOT autograd's gen_alias_from_base calls aten.as_strided on Float8Tensor outputs, which is not implemented in torchao: NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.as_strided', overload='default')> The crash manifests specifically after longlive (also FP8) because torch._dynamo's compile cache is never reset between pipeline switches, allowing longlive's Float8 dispatch state to persist and influence Krea's subsequent compile attempt. Two fixes: 1. krea_realtime_video/pipeline.py: when FP8 quantization is active, skip block.compile() — the two optimizations are currently mutually exclusive with fullgraph=False. FP8 alone still provides meaningful memory/compute savings on H100 without compile. 2. pipeline_manager.py: call torch._dynamo.reset() on every pipeline unload to clear stale compiled graphs and Float8 dispatch state, preventing cross-pipeline cache pollution. Fixes #669 Signed-off-by: livepeer-robot <robot@livepeer.org>
…ale-cache recompile If torch._dynamo.reset() raises during pipeline unload, stale Dynamo/FP8 compile caches remain active in the worker process. Previously the code swallowed the exception and published pipeline_unloaded unconditionally, leaving the next krea-realtime-video load free to torch.compile against those stale caches — re-entering the warmup crash from the FP8→Krea conflict. Fix: set self._dynamo_reset_failed = True on reset failure. The Krea load path now checks this flag and forces compile=False for the lifetime of the worker, with a clear log warning to restart the process to re-enable compilation. Addresses CodeRabbit review comment on PR #670. Signed-off-by: livepeer-robot <robot@livepeer.org>
📝 WalkthroughWalkthroughPipeline now avoids compiling attention blocks and skips warmup when FP8 quantization is active; KV-cache attention bias is set based on compile flag. Pipeline manager tracks Changes
Sequence DiagramsequenceDiagram
participant User as User/Caller
participant PM as PipelineManager
participant Dynamo as torch._dynamo
participant GPU as GPU
participant Pipeline as KreaRealtimeVideoPipeline
User->>PM: load_pipeline()
activate PM
rect rgba(100, 149, 237, 0.5)
Note over PM,GPU: Determine compilation capability
PM->>GPU: Check Hopper capability
GPU-->>PM: Hopper? (yes/no)
PM->>PM: Check _dynamo_reset_failed flag
end
alt _dynamo_reset_failed OR not Hopper
PM->>PM: _should_compile = False
PM->>PM: Log compilation disabled (warning)
else
PM->>PM: _should_compile = True
end
rect rgba(152, 251, 152, 0.5)
Note over PM,Pipeline: Construct pipeline with compile decision
PM->>Pipeline: KreaRealtimeVideoPipeline(compile=_should_compile)
activate Pipeline
Pipeline->>Pipeline: Initialize public state (KV bias depends on compile)
Pipeline->>Pipeline: Check FP8 quantization config
alt FP8 enabled
Pipeline->>Pipeline: Log warning about FP8 + compile incompatibility
Pipeline->>Pipeline: Skip attention block compilation & warmup
else
Pipeline->>Pipeline: Compile attention blocks
Pipeline->>Pipeline: Perform warmup priming compiled kernels
end
Pipeline-->>PM: Pipeline ready
deactivate Pipeline
end
PM-->>User: Pipeline loaded
deactivate PM
User->>PM: unload_pipeline()
activate PM
rect rgba(240, 128, 128, 0.5)
Note over PM,Dynamo: Reset torch._dynamo cache to avoid FP8 cache leakage
PM->>Dynamo: reset()
alt Reset succeeds
Dynamo-->>PM: Success
PM->>PM: Clear _dynamo_reset_failed flag
else Reset fails
Dynamo-->>PM: Error
PM->>PM: Set _dynamo_reset_failed flag
PM->>PM: Log warning advising restart to re-enable compilation
end
end
PM-->>User: Pipeline unloaded
deactivate PM
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/scope/server/pipeline_manager.py (1)
699-713: Narrow the warning to the path that actually consumes the latch.Lines 709-710 say this forces
compile=Falsefor all later pipeline loads, but_dynamo_reset_failedis only checked in thekrea-realtime-videobranch at Lines 982-999. Tightening the message will keep operators from assuming other loaders are protected.✏️ Suggested wording
- "forcing compile=False for all subsequent pipeline loads." + "forcing compile=False for subsequent krea-realtime-video loads."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/server/pipeline_manager.py` around lines 699 - 713, The warning message after catching exception from torch._dynamo.reset() is too broad; update the logged text to indicate that only the krea-realtime-video loader currently consumes _dynamo_reset_failed and will force compile=False for subsequent pipeline loads handled by that branch (the logic that checks _dynamo_reset_failed in the krea-realtime-video branch). Edit the message emitted in the except block where _dynamo_reset_failed is set so it explicitly mentions the krea-realtime-video path and that other loaders may not be affected, and keep the existing assignment to self._dynamo_reset_failed so the existing krea-realtime-video check continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 982-999: The compile flag passed into KreaRealtimeVideoPipeline is
not propagated to the warmup path, so even when compile=False the warmup loop
still sets kv_cache_attention_bias to DEFAULT_KV_CACHE_ATTENTION_BIAS and
triggers torch.compile; update the pipeline code in KreaRealtimeVideoPipeline so
the warmup routine checks the instance's compile flag (or accept a compile
parameter) and when compile is False: (a) avoid assigning
DEFAULT_KV_CACHE_ATTENTION_BIAS (use None or a non-compiling sentinel) and (b)
skip calling block.compile(fullgraph=False) inside the warmup loop; ensure you
reference and use the pipeline attribute (compile) and symbols
kv_cache_attention_bias, DEFAULT_KV_CACHE_ATTENTION_BIAS, and the warmup loop
where block.compile is invoked.
---
Nitpick comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 699-713: The warning message after catching exception from
torch._dynamo.reset() is too broad; update the logged text to indicate that only
the krea-realtime-video loader currently consumes _dynamo_reset_failed and will
force compile=False for subsequent pipeline loads handled by that branch (the
logic that checks _dynamo_reset_failed in the krea-realtime-video branch). Edit
the message emitted in the except block where _dynamo_reset_failed is set so it
explicitly mentions the krea-realtime-video path and that other loaders may not
be affected, and keep the existing assignment to self._dynamo_reset_failed so
the existing krea-realtime-video check continues to work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 052a352d-7d44-43da-82f5-f7f5d0673f5e
📒 Files selected for processing (2)
src/scope/core/pipelines/krea_realtime_video/pipeline.pysrc/scope/server/pipeline_manager.py
…ompile=False When compile=False, kv_cache_attention_bias was still being set to DEFAULT_KV_CACHE_ATTENTION_BIAS (0.3), which causes the warmup loop to enter the flex_attention code path and trigger torch._dynamo tracing even though no block.compile() call was ever made. This meant the _dynamo_reset_failed guard in pipeline_manager.py had no effect on the warmup-induced recompilation. Fix: - Import KV_CACHE_ATTENTION_BIAS_DISABLED (1.0) from causal_model and use it as the initial kv_cache_attention_bias when compile=False. This sentinel makes causal_model.py take the standard attention branch and skip the flex_attention/torch.compile path entirely. - Guard the warmup loop behind 'if compile:' — warmup exists solely to prime the compiled flex_attention kernel, so it is a no-op (and harmful) when compilation is disabled. Log a message when skipped for observability. Addresses CodeRabbit review comment on PR #671. Signed-off-by: livepeer-robot <robot@livepeer.org>
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Around line 233-235: The warmup_runs calculation currently uses floor division
and should use ceiling to ensure warmup covers the full number of blocks;
replace the expression for warmup_runs (which references local_attn_size and
num_frame_per_block) with ceil(local_attn_size / num_frame_per_block) + 1 and
add the necessary import (e.g., import math or from math import ceil) so the
warmup count rounds up when values aren’t evenly divisible.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 636a55cb-9cf6-4300-8d23-8ab320a824f5
📒 Files selected for processing (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py
The comment at line 230 already specifies ceil(local_attn_size / num_frame_per_block) + 1, but the implementation was using floor division (//). When local_attn_size is not evenly divisible by num_frame_per_block, this meant warmup stopped one iteration early, leaving the cache short of the steady-state shape and triggering a recompile on the first live request. Replace with the ceiling equivalent: (a + b - 1) // b to avoid importing math. Fixes coderabbitai suggestion on PR #671. Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Good catch @coderabbitai — the comment already specified |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
|
Tip For best results, initiate chat on the files or code changes.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py (1)
1-1:⚠️ Potential issue | 🟡 MinorFix Ruff formatting.
The pipeline indicates a formatting violation. Run
ruff formaton this file to resolve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py` at line 1, Run the code formatter (ruff format) on this module to fix the formatting violation; specifically ensure the top-level import line ("import logging") and the rest of pipeline.py follow ruff/black formatting rules (correct import/newline spacing, trailing newline, and any minor whitespace issues) and then re-run ruff to confirm no remaining violations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Around line 141-152: The current logic logs skipping block.compile() when FP8
is active but leaves the `compile` flag true, causing `initial_kv_bias` and the
warmup path to still run and trigger flex_attention tracing; update the flow so
you track whether compilation actually happened (e.g., set a new boolean like
`compiled` or flip `compile` to False after skipping) immediately after the
Float8 check where `block.compile()` is skipped, and then use that actual
compilation indicator in the subsequent logic that sets `initial_kv_bias` and
controls the warmup/flex_attention branch (the code around `block.compile()`,
`initial_kv_bias`, and the warmup lines) so warmup does not run when compilation
was skipped due to FP8.
---
Outside diff comments:
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py`:
- Line 1: Run the code formatter (ruff format) on this module to fix the
formatting violation; specifically ensure the top-level import line ("import
logging") and the rest of pipeline.py follow ruff/black formatting rules
(correct import/newline spacing, trailing newline, and any minor whitespace
issues) and then re-run ruff to confirm no remaining violations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7efd0c2d-363f-41b4-9d07-bf3dd528cc1d
📒 Files selected for processing (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py
| if compile: | ||
| # Float8DynamicActivationFloat8WeightConfig is incompatible with | ||
| # torch.compile(fullgraph=False): AOT autograd's gen_alias_from_base | ||
| # calls aten.as_strided on Float8Tensor outputs, which is not | ||
| # implemented. Skip block compilation when FP8 is active. | ||
| # See: https://github.com/daydreamlive/scope/issues/669 | ||
| logger.warning( | ||
| "Skipping torch.compile for attention blocks: " | ||
| "Float8DynamicActivationFloat8WeightConfig is not compatible " | ||
| "with fullgraph=False compilation (aten.as_strided unsupported " | ||
| "on Float8Tensor). FP8 quantization is still active." | ||
| ) |
There was a problem hiding this comment.
FP8 + compile=True still triggers flex_attention warmup.
When FP8 quantization is active and compile=True, the code:
- Logs the warning and skips
block.compile()(correct) - But
compileremainsTrue, soinitial_kv_biasis set to 0.3 (line 212) - Warmup runs (line 232) with bias < 1.0, which per lines 223-226 "would otherwise enter the flex_attention code path... and trigger torch._dynamo tracing"
This defeats the purpose of skipping compilation for FP8. Consider tracking whether compilation actually occurred:
Proposed fix
+ # Track whether block compilation actually happens (FP8 is incompatible)
+ did_compile = False
+
if quantization == Quantization.FP8_E4M3FN:
# Cast before optional quantization
generator = generator.to(dtype=dtype)
@@ -140,6 +143,7 @@
else:
generator = generator.to(device=device, dtype=dtype)
if compile:
# Only compile the attention blocks
for block in generator.model.blocks:
# Disable fullgraph right now due to issues with RoPE
block.compile(fullgraph=False)
+ did_compile = True
# ... later ...
initial_kv_bias = (
- DEFAULT_KV_CACHE_ATTENTION_BIAS if compile else KV_CACHE_ATTENTION_BIAS_DISABLED
+ DEFAULT_KV_CACHE_ATTENTION_BIAS if did_compile else KV_CACHE_ATTENTION_BIAS_DISABLED
)
# ... and ...
- if compile:
+ if did_compile:
local_attn_size = getattr(model_config, "local_attn_size", 6)Also applies to: 211-214, 232-250
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/core/pipelines/krea_realtime_video/pipeline.py` around lines 141 -
152, The current logic logs skipping block.compile() when FP8 is active but
leaves the `compile` flag true, causing `initial_kv_bias` and the warmup path to
still run and trigger flex_attention tracing; update the flow so you track
whether compilation actually happened (e.g., set a new boolean like `compiled`
or flip `compile` to False after skipping) immediately after the Float8 check
where `block.compile()` is skipped, and then use that actual compilation
indicator in the subsequent logic that sets `initial_kv_bias` and controls the
warmup/flex_attention branch (the code around `block.compile()`,
`initial_kv_bias`, and the warmup lines) so warmup does not run when compilation
was skipped due to FP8.
Addresses CodeRabbit's review comment on #670.
Problem
If
torch._dynamo.reset()raises during_unload_pipeline_by_id_unsafe, the exception was silently swallowed andpipeline_unloadedwas published unconditionally. Stale Dynamo/FP8 compile caches remained live in the worker process, so the nextkrea-realtime-videoload would attempttorch.compileagainst those caches — re-entering the warmup crash from the FP8→Krea conflict that #669 was meant to fix.Fix
Introduces
self._dynamo_reset_failed: bool = FalseonPipelineManager. Whentorch._dynamo.reset()raises:True(persists for the worker process lifetime)pipeline_unloadedis published (memory is freed)krea-realtime-videoload sees the flag and forcescompile=False, with a warning log to restart the workerThis is safer than failing the unload entirely (which would strand the pipeline in a broken state) while still preventing the stale-cache recompile crash.
Changes
__init__: addsself._dynamo_reset_failed = False_unload_pipeline_by_id_unsafe: latches flag on reset failure_load_pipeline_implementation(krea branch): checks flag before decidingcompile=Summary by CodeRabbit
Bug Fixes
Improvements