streamdiffusion-sdxl: CUDA OOM on pipeline reload crashes worker after 3 restarts — GPU memory not fully freed between loads

## Summary

The `streamdiffusion-sdxl` pipeline on fal.ai workers enters a terminal **ERROR state** when pipeline parameters are updated. The `cleanup_gpu_memory` function fails to fully reclaim VRAM, leaving ~0.84 GB allocated plus large amounts of non-PyTorch memory (TensorRT/CUDA context). Subsequent reload attempts fail immediately with `torch.OutOfMemoryError`, and after **3 restarts** the process guardian gives up.

**First observed:** 2026-03-20 ~18:12 UTC  
**Source:** Grafana/Loki `{container="live-video-to-video_streamdiffusion-sdxl_*"}`  
**Affected streams:** multiple (e.g. `str_NiGmm2AkNBisfMQo`, `str_kM9rT9Sk69LsQU8o`)

## Error Chain

```
timestamp=2026-03-20 18:12:37 level=ERROR location=pipeline.py:142:_reload_pipeline
[update_params] Error reloading pipeline, falling back to previous params:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB.
GPU 0 has a total capacity of 23.53 GiB of which 14.12 MiB is free.
Including non-PyTorch memory, this process has 1.34 GiB memory in use.
Of the allocated memory 862.59 MiB is allocated by PyTorch, and 61.41 MiB is reserved by PyTorch but unallocated.
```

After fallback also fails:
```
RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory.
Tried to allocate 98.00 MiB. ...only 14.12 MiB is free.
```

After 3 restarts:
```
timestamp=2026-03-20 18:13:12 level=ERROR location=process_guardian.py:334:_monitor_loop
Failed to stop streamer and restart process. Moving to ERROR state
Exception: Pipeline process max restarts reached (3)
```

## Root Cause Analysis

The `cleanup_gpu_memory` call reports success but leaves **0.84 GB allocated / 0.90 GB cached** in PyTorch:
```
GPU Memory after cleanup: 0.84GB allocated, 0.90GB cached
```

Meanwhile the GPU shows only **14.12 MiB free** out of 23.53 GiB total — meaning ~22.7 GiB is consumed by **non-PyTorch memory** (TensorRT engines, CUDA contexts, etc.) that `cleanup_gpu_memory` does not release.

The crash occurs in `wrapper.py:_load_model` while moving `text_encoder_2` to CUDA:
```python
pipe.text_encoder_2 = pipe.text_encoder_2.to(device=self.device)  # line 1179
```

The model uses `acceleration='tensorrt'` — TensorRT-compiled engines are likely not being destroyed/released as part of cleanup.

## Stack Trace

```
File "/app/runner/src/runner/live/process/process.py", line 201, in process_loop
  asyncio.run(self._run_pipeline_loops())
...
File "/app/runner/src/runner/live/process/process.py", line 244, in _run_pipeline_loops
  pipeline = await self._initialize_pipeline()
...
File "/app/live/streamdiffusion/pipeline/pipeline.py", line 152, in _reload_pipeline
  raise RuntimeError(f"Failed to reload pipeline with previous params: {e}") from e
RuntimeError: Failed to reload pipeline with previous params: CUDA out of memory. Tried to allocate 98.00 MiB...
```

## Impact

- Streams fail permanently after param update, requiring manual container restart
- Affects multiple concurrent streams (~2+ observed in a single 12h window)
- Pipeline enters ERROR state with no automatic recovery

## Suggested Fix

1. **TensorRT engine cleanup:** Explicitly destroy TensorRT engines in `cleanup_gpu_memory` before re-loading. Call `.cuda_context.pop()` and destroy compiled TensorRT modules.
2. **CUDA context reset:** After torch cleanup, call `torch.cuda.synchronize()`, `torch.cuda.empty_cache()`, and potentially `gc.collect()` to ensure all resources are released.
3. **Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`** as an env var workaround to reduce fragmentation (already suggested in the error message).
4. **Add a VRAM check** before reload attempts: if free VRAM < estimated model size, force a harder cleanup or restart instead of retrying.

## Related

- #613 Silent CUDA OOM (user-side symptom)
- #136 After CUDA OOM cannot load different pipeline


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamdiffusion-sdxl: CUDA OOM on pipeline reload crashes worker after 3 restarts — GPU memory not fully freed between loads #723

Summary

Error Chain

Root Cause Analysis

Stack Trace

Impact

Suggested Fix

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

streamdiffusion-sdxl: CUDA OOM on pipeline reload crashes worker after 3 restarts — GPU memory not fully freed between loads #723

Description

Summary

Error Chain

Root Cause Analysis

Stack Trace

Impact

Suggested Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions