chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment by forkni · Pull Request #9 · dotsimulate/StreamDiffusion

forkni · 2026-04-07T15:55:44Z

Summary

Portable batch scripts: Replaced all hardcoded D:\Users\alexk\... paths with %~dp0 in Install_StreamDiffusion.bat, Install_TensorRT.bat, and Start_StreamDiffusion.bat. Added prerequisite checks (Python 3.11, Git, optional cl.exe warning).
Version pin alignment: Synchronized all version pins across setup.py, sd_installer/installer.py, sd_installer/tensorrt.py, src/streamdiffusion/tools/install-tensorrt.py, and StreamDiffusionTD/install_tensorrt.py to match verified installed state after deps audit.
TensorRT verification: Added TENSORRT_CHECKS to verifier.py and wired it into cli.py so install-tensorrt automatically verifies TRT, cuDNN, and polygraphy after install.
Deps audit & safe updates: Applied 23 safe patch/minor package updates. 7 CVEs in 2 packages (onnx, protobuf) remain blocked by upstream constraints — documented as accepted risk.
New set_env.bat: Runtime GPU tuning vars (PYTORCH_CUDA_ALLOC_CONF, CUDA_MODULE_LOADING=LAZY, L2 persist settings).
Installer fixes: Pinned previously-unpinned python-osc to 1.10.2, fixed timm install syntax bug, fixed protobuf phase7 downgrade below setup.py minimum.

Key version changes

Package	Old Pin	New Pin
Pillow	>=12.1.1	>=12.2.0
onnxruntime-gpu	1.24.3	1.24.4
polygraphy	0.49.24	0.49.26
colored	2.3.1	2.3.2
pywin32	306	311
peft	0.17.1	0.18.1
protobuf	4.25.3	4.25.8
timm	>=1.0.24 (fuzzy)	1.0.26 (concrete)
python-osc	(unpinned)	1.10.2

Test plan

Run py -3.11 -m sd_installer --base-folder "..." diagnose — all 14 checks pass (11 core + 3 TensorRT)
Run Install_StreamDiffusion.bat from a clean directory — verify portable paths resolve correctly
Run Install_TensorRT.bat — verify TRT install + verification step succeeds
Verify import tensorrt; import polygraphy; import onnxruntime all succeed in venv
Run /deps-audit — confirm no new unblocked vulnerabilities

🤖 Generated with Claude Code

- Fix external data detection in optimize_onnx to check .data/.onnx.data extensions (not just .pb) - Handle torch.onnx.export creating external sidecar files with non-.pb names for >2GB SDXL models - Normalize all external data to weights.pb for consistent downstream handling - Add ByteSize check before single-file ONNX save to prevent silent >2GB serialization failure - Add pre-build verification: check .opt.onnx exists and is non-empty before TRT engine build - Tolerate Windows file-lock failures during post-build ONNX cleanup instead of crashing - Add diagnostic logging for file sizes throughout export/optimize/build pipeline

Adds .clone() immediately after VAE decode in __call__ (img2img) and txt2img inference paths. Prevents TRT VAE buffer being silently reused on the next decode call when prev_image_result is read downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- utilities.py: clean allocate_buffers, simplified ONNX external data handling with ByteSize() check, simplified optimize_onnx with .pb extension detection - postprocessing_orchestrator.py: preserve HEAD docstring for _should_use_sync_processing (correctly describes temporal coherence and feedback loop behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add resolve_ipadapter_paths() to ipadapter_module.py with a mapping of known h94/IP-Adapter model/encoder paths keyed by (model_type, IPAdapterType). Wire into wrapper.py:_load_model() after model detection so both pre-TRT and post-TRT installation paths see the resolved config. - SD-Turbo (SD2.1, dim=1024) + sd15 adapter → auto-resolves to sd21 - SDXL-Turbo + sd15 adapter → auto-resolves to sdxl + sdxl encoder - SD2.1 + plus/faceid → falls back to regular with warning - Custom/local paths are never overridden - Updated hardcoded "SD-Turbo is SD2.1-based" warning to generic msg Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…p-adapter_sd21.bin The h94/IP-Adapter repo never released an SD2.1 adapter. The auto-resolution logic was mapping SD2.1 to a non-existent HuggingFace path, causing a 404 that crashed the entire pipeline. Now gracefully disables IP-Adapter for unsupported architectures and continues without it. Changes: - ipadapter_module.py: Set SD2.1 REGULAR map entry to None (file never existed) - ipadapter_module.py: resolve_ipadapter_paths() sets cfg["enabled"]=False when no adapter exists for the detected architecture - wrapper.py: Early guard skips install if auto-resolution disabled IP-Adapter - wrapper.py: Generic except handler now gracefully skips instead of re-raising Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

In cuda-python 13.x, the 'cudart' module was moved to 'cuda.bindings.runtime'. Add try/except import that prefers the new location and falls back to the legacy 'cuda.cudart' path for cuda-python 12.x compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tence Pre-allocate latent and noise buffers to eliminate per-frame CUDA malloc: - Replace prev_latent_result = x_0_pred_out.clone() with lazy-allocated _latent_cache buffer + copy_() in __call__, txt2img, and txt2img_sd_turbo - Replace torch.randn_like() in TCD non-batched noise loop with lazy-allocated _noise_buf + .normal_() — eliminates per-step allocation on TCD path - Both buffers allocate on first use (shape is fixed per pipeline instance) Port cuda_l2_cache.py from CUDA 0.2.99 fork (PLAN_5 Feature 2): - New file: src/streamdiffusion/tools/cuda_l2_cache.py - Reserves GPU L2 cache for UNet attention weight tensors (mid_block, up_blocks.1) - Gated by SDTD_L2_PERSIST=1 env var (default on), requires Ampere+ GPU - Integrated at end of wrapper._load_model() with silent fallback on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…division - encode_image / decode_image: replace hardcoded torch.float16 autocast with self.dtype so the pipeline correctly honors the torch_dtype constructor param (e.g. bfloat16 would still get fp16 VAE without this fix) - scheduler_step_batch: upcast numerator and alpha_prod_t_sqrt to float32 before the F_theta division, then cast back to original dtype. When alpha_prod_t_sqrt is small (early timesteps), fp16 division can accumulate rounding error; fp32 upcast eliminates this at negligible cost (~1-3us/call). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

setup_l2_persistence() calls pin_hot_unet_weights(persist_mb=0) after reserving L2. But pin_hot_unet_weights unconditionally called reserve_l2_persisting_cache(0), which set the persisting L2 size to 0 bytes — undoing the first reservation entirely. Fix: skip the Tier 1 reserve call in pin_hot_unet_weights when persist_mb=0, since the caller (setup_l2_persistence) has already handled it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ge filter is active The FPS counter was inflated because skipped frames (cached results from the similar image filter) returned in ~1ms instead of ~30ms, but were still counted as processed frames. This caused reported FPS to be ~2x actual GPU inference rate (e.g., 60 FPS reported while GPU at 50% utilization). Added `last_frame_was_skipped` flag to pipeline and `inference_fps` tracking to td_manager. Status line now shows: "FPS: 28.3 (out: 57.1)" separating real inference rate from output rate. OSC now sends inference FPS as the primary metric. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Strategy 1 — Text encoder CPU offload (~1.6 GB VRAM saved, no rebuild): - Add _offload_text_encoders() / _reload_text_encoders() helpers on wrapper - Offload CLIP-L + OpenCLIP-G to CPU after initial prepare() in TRT mode - Reload on-demand before prompt re-encoding in prepare(), update_prompt(), update_stream_params(); always offload back via try/finally Strategy 2 — max_batch_size 4→2 (requires engine rebuild, ~0.5-1.5 GB saved): - Default max_batch_size 4→2 in StreamDiffusionWrapper.__init__ and _load_model - Runtime trt_unet_batch_size=2 with cfg_type="self" + t_index_list=[12,29]; max=4 was always wasted capacity in the TRT optimization profile - Reduces KVO cache max dim2 from 4 to 2, shrinks TRT activation workspace Note: cache_maxframes and max_cache_maxframes remain at 4 to preserve V2V temporal coherence. Delete existing unet.engine to trigger rebuild. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

With cfg_type="full" and 2 denoising steps, trt_unet_batch_size=4. With cfg_type="initialize", batch=3. Max_batch=2 would crash both. Only Strategy 1 (text encoder offloading) remains active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…() in pipeline hot path

…upy 13.x)

…co-install

… IR 11)

…ersion converter

…alibrationDataProvider

… failure

… cleanup intermediates on retry

…ngine build - Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4) Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B' - Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12 Guard with hasattr() fallback - Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED Skip precision flags when STRONGLY_TYPED is network-level only - Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX Add _restore_dynamic_axes() to restore dim_param after quantization - Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging - Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move build_stats.json write before cleanup to prevent accidental deletion. Add two-pass cleanup with gc.collect() between passes to release Python-held file handles that cause Windows lock failures. Delete onnx__ tensor files immediately after repacking into weights.pb during ONNX export (~4 GB freed before quantize stage starts). Adds actionable warning with manual cleanup instructions when file locks persist. Root cause: builder.py cleanup ran os.remove() once with silent except OSError, leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors, model weight dumps) when Windows file locks prevented deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

UNet2DConditionModelEngine has no named_parameters() — the previous code crashed with AttributeError when TRT acceleration was enabled. Two-path dispatch based on UNet type: - PyTorch nn.Module: existing Tier 2 cudaStreamSetAttribute weight-pinning path - TRT engine wrapper: new set_trt_persistent_cache() using IExecutionContext.persistent_cache_limit for activation caching in L2 TRT's persistent_cache_limit checks cudaLimitPersistingL2CacheSize at assignment time (not context-creation time), so Tier 1 reservation must precede the set call — which is the existing execution order. Adds hasattr guard in pin_hot_unet_weights() so TRT engines short-circuit cleanly without attempting named_parameters() iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eprocessing

…ceeding hardware max

…safe Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime because `trt.nptype()` has no numpy equivalent for FP8. Remove `simplify=True` — always fails with protobuf >2GB parse error on our external-data-format ONNX (graceful fallback, but wastes ~1 min). Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8- resilient: catch TypeError from `trt.nptype()` and fall back to `torch.float8_e4m3fn` directly, bypassing the numpy intermediate. FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine, keep timing.cache). Entropy calibration and calibrate_per_node are retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolution is always known before inference and never changes, so all three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now build with static spatial profiles (min=opt=max at exact resolution). Static shapes unlock: - tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was skipped for all dynamic builds with 'symbolic shape, l2tc doesn't take effect' warning - l2_limit_for_tiling — now applied for full L2 cache budget - Geometry-specific kernel selection instead of range-covering kernels - Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²) - Faster builds: single-point tactic search vs 4× spatial range Key fixes: - get_minmax_dims(): static_shape flag was dead code — hardcoded to always return 256-1024 range regardless of the flag - UNet.get_input_profile(): separation logic (opt != min padding) now guarded behind `if not static_shape` — was incorrectly padding opt away from min for static engines where min==opt==max is correct - ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024 range that bypassed get_minmax_dims() entirely; now respects static_shape flag - ControlNet residual scaling: max(min+1,...) guard now bypassed for static shapes where min==max; exact dims used directly - Engine paths: add --res-{H}x{W} suffix for static builds to prevent cache collisions between different resolutions Dead code removal: - build_all_tactics / enable_all_tactics parameter excised from entire call chain (wrapper → builder → utilities → Engine.build/_build_fp8) TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS + JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled Tactic tuning: - avg_timing_iterations=4 added to _apply_gpu_profile_to_config() Default 1 produces noisy single-sample measurements; 4 iterations give stable tactic rankings with negligible extra build time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- pipeline.py: pre-compute _alpha_next/_beta_next/_init_noise_rotated in prepare() - pipeline.py: pre-allocate _combined_latent_buf, _cfg_latent_buf/_cfg_t_buf, _unet_kwargs - pipeline.py: in-place stock_noise[0:1].copy_() eliminates torch.concat malloc - attention_processors.py: lazy-init per-layer _curr_key_buf/_curr_value_buf/_kv_out_buf - stream_parameter_updater.py: keep _init_noise_rotated in sync on seed change - unet_engine.py: cache dummy ControlNet zero tensors in _cached_dummy_controlnet_tensors - td_manager.py: async GPU->CPU via pinned memory + CUDA event (eliminates 1-3ms sync stall) Eliminates ~300+ per-frame CUDA allocations (SDXL 4-step), saves ~1.5-4ms/frame

… first frame

…-change stutter

…or quantization

Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache optimization) was still disabled for UNet because the batch dimension remained dynamic (min=1, max=4). TRT checks that ALL dimensions are concrete before enabling l2tc — a single symbolic dimension disables it for the entire graph. Fix: set build_static_batch=True for all three engine types (UNet, VAE decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and cfg_type='self' (never 'full') is always used, the UNet batch is always exactly len(t_index_list)=2 — never changes at runtime. Also fix get_minmax_dims() static_batch path: was setting min_batch = max(1, batch_size-1) which still created a range (1-2). Now sets min_batch = max_batch = batch_size for a true single-point profile that TRT treats as fully concrete. With all dimensions concrete (batch + spatial), the next UNet build should show tiling_optimization_level=MODERATE and l2_limit_for_tiling applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…PROFILE_TRT - Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run, get_summary() aggregating median times across last N runs - Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new engines embed layer names + tactic IDs for meaningful profiling output - Engine.activate(): attach TRTProfiler + log when env var is set - Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler cannot report per-layer times through graph replay); wrap execution with start_run/end_run; sync stream before end_run to ensure all callbacks fired - Engine.dump_profile(): log per-layer summary, no-op when profiler is None - UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine: add dump_profile() delegation to underlying Engine Zero overhead in production (env var not set = no profiler created, CUDA graphs work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hapes Level 4 compiles dynamic kernels — unnecessary with fully static profiles (build_dynamic_shape=False + build_static_batch=True) and triggers tactic 0x3e9 'Assertion g.nodes.size() == 0' failures in TRT 10.12. Level 3 heuristic selection produces equivalent results for static builds. Level 5 still avoided (OOM during tactic profiling, 160 GiB requests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…TRT 10.12 bug Level 3 does not avoid the tactic 0x3e9 assertion errors — they appear at all optimization levels. Reverting to 4 for better dynamic kernel selection. Added comment documenting the benign TRT 10.12 bug. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t with use_cached_attn CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists for this op, crashing UNet export with use_cached_attn=True. Added _use_prealloc=False flag (default): - False: ONNX-safe .clone() / torch.stack() path used during tracing - True: zero-alloc .copy_() path for non-TRT runtime (set externally) For TRT builds processors don't run at inference time (engine handles KV cache internally), so _use_prealloc=True is only relevant for non-TRT acceleration paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

get_or_load_controlnet_engine() defaulted to opt_image_height/width=704, causing a static shape mismatch at runtime when the pipeline runs at 512×512 (latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine is built at the actual inference resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ControlNet ran with use_cuda_graph=False despite the Engine.infer() and allocate_buffers() infrastructure supporting graph capture. Since shapes are fixed at runtime (same resolution every frame), enabling CUDA graphs eliminates CPU kernel launch overhead per denoising step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erification Batch scripts: - Replace hardcoded D:\Users\alexk paths with %~dp0 (fully portable) - Add prerequisite checks: Python 3.11, Git, cl.exe (warn-only) - Install_TensorRT.bat: check venv exists before activation attempt - Start_StreamDiffusion.bat: call set_env.bat if present; fix td_main.py path casing - New set_env.bat: documents and sets PYTORCH_CUDA_ALLOC_CONF, CUDA_MODULE_LOADING, SDTD_L2_PERSIST for runtime GPU tuning Version pin alignment (post deps-audit, 14/14 checks verified): - setup.py: Pillow>=12.2.0, onnxruntime-gpu==1.24.4, polygraphy==0.49.26, colored==2.3.2 - StreamDiffusionTD/install_tensorrt.py: full TENSORRT_PINS dict, pywin32==311 - src/tools/install-tensorrt.py: polygraphy==0.49.26, pywin32==311 - StreamDiffusion-installer: see submodule commit 24a5693 Audit: - audit_reports/2026-04-07-1122-audit-summary.md: 7 CVEs in 2 blocked packages (onnx: graphsurgeon blocks upgrade; protobuf: mediapipe blocks upgrade), 23 safe package updates applied

forkni · 2026-04-07T15:56:12Z

Closing in favor of a targeted installer-only PR for clean merge

dotsimulate and others added 30 commits March 29, 2026 16:33

perf: pre-allocate image output buffers, replace .clone() with .copy_…

697b548

…() in pipeline hot path

fix: use opencv-contrib-python 4.9.0.80 and add FP8 deps (modelopt, c…

818b03d

…upy 13.x)

fix: pin onnx 1.17.0, onnxruntime-gpu 1.22.0; remove CPU onnxruntime …

aa21e73

…co-install

fix: bump onnx 1.18.0 + onnxruntime-gpu 1.24.3 (modelopt FLOAT4E2M1 +…

2b6c0aa

… IR 11)

fix: patch ByteSize() for >2GB ONNX in modelopt FP8 quantization

600c5bf

fix: reduce FP8 calibration batches 128→8 (KVO cache OOM, 281GB→17GB)

847be93

fix: export UNet ONNX at opset 19 when FP8 enabled to skip modelopt v…

00cf0c7

…ersion converter

fix: merge calibration list-of-dicts into stacked dict for modelopt C…

9e22ea9

…alibrationDataProvider

fix: add NVIDIA DLLs to PATH and retry without quantize_mha on ORT EP…

519069f

… failure

fix: use single calibration batch for modelopt (avoid rank mismatch),…

cfca95b

… cleanup intermediates on retry

perf: add CUDA/PyTorch env var tuning and cudnn.benchmark

0f50188

perf: clean up deprecated TRT 10.x API usage in engine builder and pr…

72f6409

…eprocessing

fix: clamp TRT persistent_cache_limit to L2_cache_size//2 to avoid ex…

07093bf

…ceeding hardware max

INTER-NYC and others added 13 commits April 4, 2026 20:47

perf: skip text encoder reload on identical prompt; seed FPS EMA from…

cd9b6ec

… first frame

fix: guard ControlNet TRT engine compilation behind acceleration check

1e6b351

fix: remove empty_cache() from text encoder offload to prevent prompt…

b75d15b

…-change stutter

perf: keep text encoders on GPU during inference; add force_offload f…

e548b40

…or quantization

forkni closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9

chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9
forkni wants to merge 43 commits intoSDTD_031_devfrom
SDTD_v3-dev-Alex-synced

forkni commented Apr 7, 2026

Uh oh!

forkni commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

forkni commented Apr 7, 2026

Summary

Key version changes

Test plan

Uh oh!

forkni commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants