Skip to content

streamdiffusion-sdxl: CUDA OOM on pipeline reload crashes worker after 3 restarts — GPU memory not fully freed between loads #723

@livepeer-tessa

Description

@livepeer-tessa

Summary

The streamdiffusion-sdxl pipeline on fal.ai workers enters a terminal ERROR state when pipeline parameters are updated. The cleanup_gpu_memory function fails to fully reclaim VRAM, leaving ~0.84 GB allocated plus large amounts of non-PyTorch memory (TensorRT/CUDA context). Subsequent reload attempts fail immediately with torch.OutOfMemoryError, and after 3 restarts the process guardian gives up.

First observed: 2026-03-20 ~18:12 UTC
Source: Grafana/Loki {container="live-video-to-video_streamdiffusion-sdxl_*"}
Affected streams: multiple (e.g. str_NiGmm2AkNBisfMQo, str_kM9rT9Sk69LsQU8o)

Error Chain

timestamp=2026-03-20 18:12:37 level=ERROR location=pipeline.py:142:_reload_pipeline
[update_params] Error reloading pipeline, falling back to previous params:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB.
GPU 0 has a total capacity of 23.53 GiB of which 14.12 MiB is free.
Including non-PyTorch memory, this process has 1.34 GiB memory in use.
Of the allocated memory 862.59 MiB is allocated by PyTorch, and 61.41 MiB is reserved by PyTorch but unallocated.

After fallback also fails:

RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory.
Tried to allocate 98.00 MiB. ...only 14.12 MiB is free.

After 3 restarts:

timestamp=2026-03-20 18:13:12 level=ERROR location=process_guardian.py:334:_monitor_loop
Failed to stop streamer and restart process. Moving to ERROR state
Exception: Pipeline process max restarts reached (3)

Root Cause Analysis

The cleanup_gpu_memory call reports success but leaves 0.84 GB allocated / 0.90 GB cached in PyTorch:

GPU Memory after cleanup: 0.84GB allocated, 0.90GB cached

Meanwhile the GPU shows only 14.12 MiB free out of 23.53 GiB total — meaning ~22.7 GiB is consumed by non-PyTorch memory (TensorRT engines, CUDA contexts, etc.) that cleanup_gpu_memory does not release.

The crash occurs in wrapper.py:_load_model while moving text_encoder_2 to CUDA:

pipe.text_encoder_2 = pipe.text_encoder_2.to(device=self.device)  # line 1179

The model uses acceleration='tensorrt' — TensorRT-compiled engines are likely not being destroyed/released as part of cleanup.

Stack Trace

File "/app/runner/src/runner/live/process/process.py", line 201, in process_loop
  asyncio.run(self._run_pipeline_loops())
...
File "/app/runner/src/runner/live/process/process.py", line 244, in _run_pipeline_loops
  pipeline = await self._initialize_pipeline()
...
File "/app/live/streamdiffusion/pipeline/pipeline.py", line 152, in _reload_pipeline
  raise RuntimeError(f"Failed to reload pipeline with previous params: {e}") from e
RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory. Tried to allocate 98.00 MiB...

Impact

  • Streams fail permanently after param update, requiring manual container restart
  • Affects multiple concurrent streams (~2+ observed in a single 12h window)
  • Pipeline enters ERROR state with no automatic recovery

Suggested Fix

  1. TensorRT engine cleanup: Explicitly destroy TensorRT engines in cleanup_gpu_memory before re-loading. Call .cuda_context.pop() and destroy compiled TensorRT modules.
  2. CUDA context reset: After torch cleanup, call torch.cuda.synchronize(), torch.cuda.empty_cache(), and potentially gc.collect() to ensure all resources are released.
  3. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True as an env var workaround to reduce fragmentation (already suggested in the error message).
  4. Add a VRAM check before reload attempts: if free VRAM < estimated model size, force a harder cleanup or restart instead of retrying.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions