chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9
Closed
forkni wants to merge 43 commits intoSDTD_031_devfrom
Closed
chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9forkni wants to merge 43 commits intoSDTD_031_devfrom
forkni wants to merge 43 commits intoSDTD_031_devfrom
Conversation
- Fix external data detection in optimize_onnx to check .data/.onnx.data extensions (not just .pb) - Handle torch.onnx.export creating external sidecar files with non-.pb names for >2GB SDXL models - Normalize all external data to weights.pb for consistent downstream handling - Add ByteSize check before single-file ONNX save to prevent silent >2GB serialization failure - Add pre-build verification: check .opt.onnx exists and is non-empty before TRT engine build - Tolerate Windows file-lock failures during post-build ONNX cleanup instead of crashing - Add diagnostic logging for file sizes throughout export/optimize/build pipeline
Adds .clone() immediately after VAE decode in __call__ (img2img) and txt2img inference paths. Prevents TRT VAE buffer being silently reused on the next decode call when prev_image_result is read downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- utilities.py: clean allocate_buffers, simplified ONNX external data handling with ByteSize() check, simplified optimize_onnx with .pb extension detection - postprocessing_orchestrator.py: preserve HEAD docstring for _should_use_sync_processing (correctly describes temporal coherence and feedback loop behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add resolve_ipadapter_paths() to ipadapter_module.py with a mapping of known h94/IP-Adapter model/encoder paths keyed by (model_type, IPAdapterType). Wire into wrapper.py:_load_model() after model detection so both pre-TRT and post-TRT installation paths see the resolved config. - SD-Turbo (SD2.1, dim=1024) + sd15 adapter → auto-resolves to sd21 - SDXL-Turbo + sd15 adapter → auto-resolves to sdxl + sdxl encoder - SD2.1 + plus/faceid → falls back to regular with warning - Custom/local paths are never overridden - Updated hardcoded "SD-Turbo is SD2.1-based" warning to generic msg Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p-adapter_sd21.bin The h94/IP-Adapter repo never released an SD2.1 adapter. The auto-resolution logic was mapping SD2.1 to a non-existent HuggingFace path, causing a 404 that crashed the entire pipeline. Now gracefully disables IP-Adapter for unsupported architectures and continues without it. Changes: - ipadapter_module.py: Set SD2.1 REGULAR map entry to None (file never existed) - ipadapter_module.py: resolve_ipadapter_paths() sets cfg["enabled"]=False when no adapter exists for the detected architecture - wrapper.py: Early guard skips install if auto-resolution disabled IP-Adapter - wrapper.py: Generic except handler now gracefully skips instead of re-raising Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In cuda-python 13.x, the 'cudart' module was moved to 'cuda.bindings.runtime'. Add try/except import that prefers the new location and falls back to the legacy 'cuda.cudart' path for cuda-python 12.x compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tence Pre-allocate latent and noise buffers to eliminate per-frame CUDA malloc: - Replace prev_latent_result = x_0_pred_out.clone() with lazy-allocated _latent_cache buffer + copy_() in __call__, txt2img, and txt2img_sd_turbo - Replace torch.randn_like() in TCD non-batched noise loop with lazy-allocated _noise_buf + .normal_() — eliminates per-step allocation on TCD path - Both buffers allocate on first use (shape is fixed per pipeline instance) Port cuda_l2_cache.py from CUDA 0.2.99 fork (PLAN_5 Feature 2): - New file: src/streamdiffusion/tools/cuda_l2_cache.py - Reserves GPU L2 cache for UNet attention weight tensors (mid_block, up_blocks.1) - Gated by SDTD_L2_PERSIST=1 env var (default on), requires Ampere+ GPU - Integrated at end of wrapper._load_model() with silent fallback on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…division - encode_image / decode_image: replace hardcoded torch.float16 autocast with self.dtype so the pipeline correctly honors the torch_dtype constructor param (e.g. bfloat16 would still get fp16 VAE without this fix) - scheduler_step_batch: upcast numerator and alpha_prod_t_sqrt to float32 before the F_theta division, then cast back to original dtype. When alpha_prod_t_sqrt is small (early timesteps), fp16 division can accumulate rounding error; fp32 upcast eliminates this at negligible cost (~1-3us/call). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup_l2_persistence() calls pin_hot_unet_weights(persist_mb=0) after reserving L2. But pin_hot_unet_weights unconditionally called reserve_l2_persisting_cache(0), which set the persisting L2 size to 0 bytes — undoing the first reservation entirely. Fix: skip the Tier 1 reserve call in pin_hot_unet_weights when persist_mb=0, since the caller (setup_l2_persistence) has already handled it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge filter is active The FPS counter was inflated because skipped frames (cached results from the similar image filter) returned in ~1ms instead of ~30ms, but were still counted as processed frames. This caused reported FPS to be ~2x actual GPU inference rate (e.g., 60 FPS reported while GPU at 50% utilization). Added `last_frame_was_skipped` flag to pipeline and `inference_fps` tracking to td_manager. Status line now shows: "FPS: 28.3 (out: 57.1)" separating real inference rate from output rate. OSC now sends inference FPS as the primary metric. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strategy 1 — Text encoder CPU offload (~1.6 GB VRAM saved, no rebuild): - Add _offload_text_encoders() / _reload_text_encoders() helpers on wrapper - Offload CLIP-L + OpenCLIP-G to CPU after initial prepare() in TRT mode - Reload on-demand before prompt re-encoding in prepare(), update_prompt(), update_stream_params(); always offload back via try/finally Strategy 2 — max_batch_size 4→2 (requires engine rebuild, ~0.5-1.5 GB saved): - Default max_batch_size 4→2 in StreamDiffusionWrapper.__init__ and _load_model - Runtime trt_unet_batch_size=2 with cfg_type="self" + t_index_list=[12,29]; max=4 was always wasted capacity in the TRT optimization profile - Reduces KVO cache max dim2 from 4 to 2, shrinks TRT activation workspace Note: cache_maxframes and max_cache_maxframes remain at 4 to preserve V2V temporal coherence. Delete existing unet.engine to trigger rebuild. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With cfg_type="full" and 2 denoising steps, trt_unet_batch_size=4. With cfg_type="initialize", batch=3. Max_batch=2 would crash both. Only Strategy 1 (text encoder offloading) remains active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…() in pipeline hot path
…alibrationDataProvider
… cleanup intermediates on retry
…ngine build - Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4) Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B' - Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12 Guard with hasattr() fallback - Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED Skip precision flags when STRONGLY_TYPED is network-level only - Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX Add _restore_dynamic_axes() to restore dim_param after quantization - Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging - Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move build_stats.json write before cleanup to prevent accidental deletion. Add two-pass cleanup with gc.collect() between passes to release Python-held file handles that cause Windows lock failures. Delete onnx__ tensor files immediately after repacking into weights.pb during ONNX export (~4 GB freed before quantize stage starts). Adds actionable warning with manual cleanup instructions when file locks persist. Root cause: builder.py cleanup ran os.remove() once with silent except OSError, leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors, model weight dumps) when Windows file locks prevented deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
UNet2DConditionModelEngine has no named_parameters() — the previous code crashed with AttributeError when TRT acceleration was enabled. Two-path dispatch based on UNet type: - PyTorch nn.Module: existing Tier 2 cudaStreamSetAttribute weight-pinning path - TRT engine wrapper: new set_trt_persistent_cache() using IExecutionContext.persistent_cache_limit for activation caching in L2 TRT's persistent_cache_limit checks cudaLimitPersistingL2CacheSize at assignment time (not context-creation time), so Tier 1 reservation must precede the set call — which is the existing execution order. Adds hasattr guard in pin_hot_unet_weights() so TRT engines short-circuit cleanly without attempting named_parameters() iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ceeding hardware max
…safe Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime because `trt.nptype()` has no numpy equivalent for FP8. Remove `simplify=True` — always fails with protobuf >2GB parse error on our external-data-format ONNX (graceful fallback, but wastes ~1 min). Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8- resilient: catch TypeError from `trt.nptype()` and fall back to `torch.float8_e4m3fn` directly, bypassing the numpy intermediate. FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine, keep timing.cache). Entropy calibration and calibrate_per_node are retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolution is always known before inference and never changes, so all
three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now
build with static spatial profiles (min=opt=max at exact resolution).
Static shapes unlock:
- tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was
skipped for all dynamic builds with 'symbolic shape, l2tc doesn't
take effect' warning
- l2_limit_for_tiling — now applied for full L2 cache budget
- Geometry-specific kernel selection instead of range-covering kernels
- Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²)
- Faster builds: single-point tactic search vs 4× spatial range
Key fixes:
- get_minmax_dims(): static_shape flag was dead code — hardcoded to
always return 256-1024 range regardless of the flag
- UNet.get_input_profile(): separation logic (opt != min padding) now
guarded behind `if not static_shape` — was incorrectly padding opt
away from min for static engines where min==opt==max is correct
- ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024
range that bypassed get_minmax_dims() entirely; now respects
static_shape flag
- ControlNet residual scaling: max(min+1,...) guard now bypassed for
static shapes where min==max; exact dims used directly
- Engine paths: add --res-{H}x{W} suffix for static builds to prevent
cache collisions between different resolutions
Dead code removal:
- build_all_tactics / enable_all_tactics parameter excised from entire
call chain (wrapper → builder → utilities → Engine.build/_build_fp8)
TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS +
JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled
Tactic tuning:
- avg_timing_iterations=4 added to _apply_gpu_profile_to_config()
Default 1 produces noisy single-sample measurements; 4 iterations
give stable tactic rankings with negligible extra build time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pipeline.py: pre-compute _alpha_next/_beta_next/_init_noise_rotated in prepare() - pipeline.py: pre-allocate _combined_latent_buf, _cfg_latent_buf/_cfg_t_buf, _unet_kwargs - pipeline.py: in-place stock_noise[0:1].copy_() eliminates torch.concat malloc - attention_processors.py: lazy-init per-layer _curr_key_buf/_curr_value_buf/_kv_out_buf - stream_parameter_updater.py: keep _init_noise_rotated in sync on seed change - unet_engine.py: cache dummy ControlNet zero tensors in _cached_dummy_controlnet_tensors - td_manager.py: async GPU->CPU via pinned memory + CUDA event (eliminates 1-3ms sync stall) Eliminates ~300+ per-frame CUDA allocations (SDXL 4-step), saves ~1.5-4ms/frame
Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache optimization) was still disabled for UNet because the batch dimension remained dynamic (min=1, max=4). TRT checks that ALL dimensions are concrete before enabling l2tc — a single symbolic dimension disables it for the entire graph. Fix: set build_static_batch=True for all three engine types (UNet, VAE decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and cfg_type='self' (never 'full') is always used, the UNet batch is always exactly len(t_index_list)=2 — never changes at runtime. Also fix get_minmax_dims() static_batch path: was setting min_batch = max(1, batch_size-1) which still created a range (1-2). Now sets min_batch = max_batch = batch_size for a true single-point profile that TRT treats as fully concrete. With all dimensions concrete (batch + spatial), the next UNet build should show tiling_optimization_level=MODERATE and l2_limit_for_tiling applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PROFILE_TRT - Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run, get_summary() aggregating median times across last N runs - Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new engines embed layer names + tactic IDs for meaningful profiling output - Engine.activate(): attach TRTProfiler + log when env var is set - Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler cannot report per-layer times through graph replay); wrap execution with start_run/end_run; sync stream before end_run to ensure all callbacks fired - Engine.dump_profile(): log per-layer summary, no-op when profiler is None - UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine: add dump_profile() delegation to underlying Engine Zero overhead in production (env var not set = no profiler created, CUDA graphs work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hapes Level 4 compiles dynamic kernels — unnecessary with fully static profiles (build_dynamic_shape=False + build_static_batch=True) and triggers tactic 0x3e9 'Assertion g.nodes.size() == 0' failures in TRT 10.12. Level 3 heuristic selection produces equivalent results for static builds. Level 5 still avoided (OOM during tactic profiling, 160 GiB requests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TRT 10.12 bug Level 3 does not avoid the tactic 0x3e9 assertion errors — they appear at all optimization levels. Reverting to 4 for better dynamic kernel selection. Added comment documenting the benign TRT 10.12 bug. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t with use_cached_attn CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists for this op, crashing UNet export with use_cached_attn=True. Added _use_prealloc=False flag (default): - False: ONNX-safe .clone() / torch.stack() path used during tracing - True: zero-alloc .copy_() path for non-TRT runtime (set externally) For TRT builds processors don't run at inference time (engine handles KV cache internally), so _use_prealloc=True is only relevant for non-TRT acceleration paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_or_load_controlnet_engine() defaulted to opt_image_height/width=704, causing a static shape mismatch at runtime when the pipeline runs at 512×512 (latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine is built at the actual inference resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ControlNet ran with use_cuda_graph=False despite the Engine.infer() and allocate_buffers() infrastructure supporting graph capture. Since shapes are fixed at runtime (same resolution every frame), enabling CUDA graphs eliminates CPU kernel launch overhead per denoising step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erification Batch scripts: - Replace hardcoded D:\Users\alexk paths with %~dp0 (fully portable) - Add prerequisite checks: Python 3.11, Git, cl.exe (warn-only) - Install_TensorRT.bat: check venv exists before activation attempt - Start_StreamDiffusion.bat: call set_env.bat if present; fix td_main.py path casing - New set_env.bat: documents and sets PYTORCH_CUDA_ALLOC_CONF, CUDA_MODULE_LOADING, SDTD_L2_PERSIST for runtime GPU tuning Version pin alignment (post deps-audit, 14/14 checks verified): - setup.py: Pillow>=12.2.0, onnxruntime-gpu==1.24.4, polygraphy==0.49.26, colored==2.3.2 - StreamDiffusionTD/install_tensorrt.py: full TENSORRT_PINS dict, pywin32==311 - src/tools/install-tensorrt.py: polygraphy==0.49.26, pywin32==311 - StreamDiffusion-installer: see submodule commit 24a5693 Audit: - audit_reports/2026-04-07-1122-audit-summary.md: 7 CVEs in 2 blocked packages (onnx: graphsurgeon blocks upgrade; protobuf: mediapipe blocks upgrade), 23 safe package updates applied
Collaborator
Author
|
Closing in favor of a targeted installer-only PR for clean merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
D:\Users\alexk\...paths with%~dp0inInstall_StreamDiffusion.bat,Install_TensorRT.bat, andStart_StreamDiffusion.bat. Added prerequisite checks (Python 3.11, Git, optional cl.exe warning).setup.py,sd_installer/installer.py,sd_installer/tensorrt.py,src/streamdiffusion/tools/install-tensorrt.py, andStreamDiffusionTD/install_tensorrt.pyto match verified installed state after deps audit.TENSORRT_CHECKStoverifier.pyand wired it intocli.pysoinstall-tensorrtautomatically verifies TRT, cuDNN, and polygraphy after install.set_env.bat: Runtime GPU tuning vars (PYTORCH_CUDA_ALLOC_CONF,CUDA_MODULE_LOADING=LAZY, L2 persist settings).python-oscto1.10.2, fixedtimminstall syntax bug, fixed protobuf phase7 downgrade belowsetup.pyminimum.Key version changes
Test plan
py -3.11 -m sd_installer --base-folder "..." diagnose— all 14 checks pass (11 core + 3 TensorRT)Install_StreamDiffusion.batfrom a clean directory — verify portable paths resolve correctlyInstall_TensorRT.bat— verify TRT install + verification step succeedsimport tensorrt; import polygraphy; import onnxruntimeall succeed in venv/deps-audit— confirm no new unblocked vulnerabilities🤖 Generated with Claude Code