Summary
The streamdiffusion-sdxl pipeline on fal.ai workers enters a terminal ERROR state when pipeline parameters are updated. The cleanup_gpu_memory function fails to fully reclaim VRAM, leaving ~0.84 GB allocated plus large amounts of non-PyTorch memory (TensorRT/CUDA context). Subsequent reload attempts fail immediately with torch.OutOfMemoryError, and after 3 restarts the process guardian gives up.
First observed: 2026-03-20 ~18:12 UTC
Source: Grafana/Loki {container="live-video-to-video_streamdiffusion-sdxl_*"}
Affected streams: multiple (e.g. str_NiGmm2AkNBisfMQo, str_kM9rT9Sk69LsQU8o)
Error Chain
timestamp=2026-03-20 18:12:37 level=ERROR location=pipeline.py:142:_reload_pipeline
[update_params] Error reloading pipeline, falling back to previous params:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB.
GPU 0 has a total capacity of 23.53 GiB of which 14.12 MiB is free.
Including non-PyTorch memory, this process has 1.34 GiB memory in use.
Of the allocated memory 862.59 MiB is allocated by PyTorch, and 61.41 MiB is reserved by PyTorch but unallocated.
After fallback also fails:
RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory.
Tried to allocate 98.00 MiB. ...only 14.12 MiB is free.
After 3 restarts:
timestamp=2026-03-20 18:13:12 level=ERROR location=process_guardian.py:334:_monitor_loop
Failed to stop streamer and restart process. Moving to ERROR state
Exception: Pipeline process max restarts reached (3)
Root Cause Analysis
The cleanup_gpu_memory call reports success but leaves 0.84 GB allocated / 0.90 GB cached in PyTorch:
GPU Memory after cleanup: 0.84GB allocated, 0.90GB cached
Meanwhile the GPU shows only 14.12 MiB free out of 23.53 GiB total — meaning ~22.7 GiB is consumed by non-PyTorch memory (TensorRT engines, CUDA contexts, etc.) that cleanup_gpu_memory does not release.
The crash occurs in wrapper.py:_load_model while moving text_encoder_2 to CUDA:
pipe.text_encoder_2 = pipe.text_encoder_2.to(device=self.device) # line 1179
The model uses acceleration='tensorrt' — TensorRT-compiled engines are likely not being destroyed/released as part of cleanup.
Stack Trace
File "/app/runner/src/runner/live/process/process.py", line 201, in process_loop
asyncio.run(self._run_pipeline_loops())
...
File "/app/runner/src/runner/live/process/process.py", line 244, in _run_pipeline_loops
pipeline = await self._initialize_pipeline()
...
File "/app/live/streamdiffusion/pipeline/pipeline.py", line 152, in _reload_pipeline
raise RuntimeError(f"Failed to reload pipeline with previous params: {e}") from e
RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory. Tried to allocate 98.00 MiB...
Impact
- Streams fail permanently after param update, requiring manual container restart
- Affects multiple concurrent streams (~2+ observed in a single 12h window)
- Pipeline enters ERROR state with no automatic recovery
Suggested Fix
- TensorRT engine cleanup: Explicitly destroy TensorRT engines in
cleanup_gpu_memory before re-loading. Call .cuda_context.pop() and destroy compiled TensorRT modules.
- CUDA context reset: After torch cleanup, call
torch.cuda.synchronize(), torch.cuda.empty_cache(), and potentially gc.collect() to ensure all resources are released.
- Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True as an env var workaround to reduce fragmentation (already suggested in the error message).
- Add a VRAM check before reload attempts: if free VRAM < estimated model size, force a harder cleanup or restart instead of retrying.
Related
Summary
The
streamdiffusion-sdxlpipeline on fal.ai workers enters a terminal ERROR state when pipeline parameters are updated. Thecleanup_gpu_memoryfunction fails to fully reclaim VRAM, leaving ~0.84 GB allocated plus large amounts of non-PyTorch memory (TensorRT/CUDA context). Subsequent reload attempts fail immediately withtorch.OutOfMemoryError, and after 3 restarts the process guardian gives up.First observed: 2026-03-20 ~18:12 UTC
Source: Grafana/Loki
{container="live-video-to-video_streamdiffusion-sdxl_*"}Affected streams: multiple (e.g.
str_NiGmm2AkNBisfMQo,str_kM9rT9Sk69LsQU8o)Error Chain
After fallback also fails:
After 3 restarts:
Root Cause Analysis
The
cleanup_gpu_memorycall reports success but leaves 0.84 GB allocated / 0.90 GB cached in PyTorch:Meanwhile the GPU shows only 14.12 MiB free out of 23.53 GiB total — meaning ~22.7 GiB is consumed by non-PyTorch memory (TensorRT engines, CUDA contexts, etc.) that
cleanup_gpu_memorydoes not release.The crash occurs in
wrapper.py:_load_modelwhile movingtext_encoder_2to CUDA:The model uses
acceleration='tensorrt'— TensorRT-compiled engines are likely not being destroyed/released as part of cleanup.Stack Trace
Impact
Suggested Fix
cleanup_gpu_memorybefore re-loading. Call.cuda_context.pop()and destroy compiled TensorRT modules.torch.cuda.synchronize(),torch.cuda.empty_cache(), and potentiallygc.collect()to ensure all resources are released.PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueas an env var workaround to reduce fragmentation (already suggested in the error message).Related