Skip to content

chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9

Closed
forkni wants to merge 43 commits intoSDTD_031_devfrom
SDTD_v3-dev-Alex-synced
Closed

chore(installer): overhaul install scripts — portable paths, TRT verification, pin alignment#9
forkni wants to merge 43 commits intoSDTD_031_devfrom
SDTD_v3-dev-Alex-synced

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented Apr 7, 2026

Summary

  • Portable batch scripts: Replaced all hardcoded D:\Users\alexk\... paths with %~dp0 in Install_StreamDiffusion.bat, Install_TensorRT.bat, and Start_StreamDiffusion.bat. Added prerequisite checks (Python 3.11, Git, optional cl.exe warning).
  • Version pin alignment: Synchronized all version pins across setup.py, sd_installer/installer.py, sd_installer/tensorrt.py, src/streamdiffusion/tools/install-tensorrt.py, and StreamDiffusionTD/install_tensorrt.py to match verified installed state after deps audit.
  • TensorRT verification: Added TENSORRT_CHECKS to verifier.py and wired it into cli.py so install-tensorrt automatically verifies TRT, cuDNN, and polygraphy after install.
  • Deps audit & safe updates: Applied 23 safe patch/minor package updates. 7 CVEs in 2 packages (onnx, protobuf) remain blocked by upstream constraints — documented as accepted risk.
  • New set_env.bat: Runtime GPU tuning vars (PYTORCH_CUDA_ALLOC_CONF, CUDA_MODULE_LOADING=LAZY, L2 persist settings).
  • Installer fixes: Pinned previously-unpinned python-osc to 1.10.2, fixed timm install syntax bug, fixed protobuf phase7 downgrade below setup.py minimum.

Key version changes

Package Old Pin New Pin
Pillow >=12.1.1 >=12.2.0
onnxruntime-gpu 1.24.3 1.24.4
polygraphy 0.49.24 0.49.26
colored 2.3.1 2.3.2
pywin32 306 311
peft 0.17.1 0.18.1
protobuf 4.25.3 4.25.8
timm >=1.0.24 (fuzzy) 1.0.26 (concrete)
python-osc (unpinned) 1.10.2

Test plan

  • Run py -3.11 -m sd_installer --base-folder "..." diagnose — all 14 checks pass (11 core + 3 TensorRT)
  • Run Install_StreamDiffusion.bat from a clean directory — verify portable paths resolve correctly
  • Run Install_TensorRT.bat — verify TRT install + verification step succeeds
  • Verify import tensorrt; import polygraphy; import onnxruntime all succeed in venv
  • Run /deps-audit — confirm no new unblocked vulnerabilities

🤖 Generated with Claude Code

dotsimulate and others added 30 commits March 29, 2026 16:33
- Fix external data detection in optimize_onnx to check .data/.onnx.data extensions (not just .pb)
- Handle torch.onnx.export creating external sidecar files with non-.pb names for >2GB SDXL models
- Normalize all external data to weights.pb for consistent downstream handling
- Add ByteSize check before single-file ONNX save to prevent silent >2GB serialization failure
- Add pre-build verification: check .opt.onnx exists and is non-empty before TRT engine build
- Tolerate Windows file-lock failures during post-build ONNX cleanup instead of crashing
- Add diagnostic logging for file sizes throughout export/optimize/build pipeline
Adds .clone() immediately after VAE decode in __call__ (img2img) and txt2img
inference paths. Prevents TRT VAE buffer being silently reused on the next
decode call when prev_image_result is read downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- utilities.py: clean allocate_buffers, simplified ONNX external data
  handling with ByteSize() check, simplified optimize_onnx with .pb
  extension detection
- postprocessing_orchestrator.py: preserve HEAD docstring for
  _should_use_sync_processing (correctly describes temporal coherence
  and feedback loop behavior)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add resolve_ipadapter_paths() to ipadapter_module.py with a mapping
of known h94/IP-Adapter model/encoder paths keyed by (model_type,
IPAdapterType). Wire into wrapper.py:_load_model() after model
detection so both pre-TRT and post-TRT installation paths see the
resolved config.

- SD-Turbo (SD2.1, dim=1024) + sd15 adapter → auto-resolves to sd21
- SDXL-Turbo + sd15 adapter → auto-resolves to sdxl + sdxl encoder
- SD2.1 + plus/faceid → falls back to regular with warning
- Custom/local paths are never overridden
- Updated hardcoded "SD-Turbo is SD2.1-based" warning to generic msg

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p-adapter_sd21.bin

The h94/IP-Adapter repo never released an SD2.1 adapter. The auto-resolution
logic was mapping SD2.1 to a non-existent HuggingFace path, causing a 404
that crashed the entire pipeline. Now gracefully disables IP-Adapter for
unsupported architectures and continues without it.

Changes:
- ipadapter_module.py: Set SD2.1 REGULAR map entry to None (file never existed)
- ipadapter_module.py: resolve_ipadapter_paths() sets cfg["enabled"]=False when
  no adapter exists for the detected architecture
- wrapper.py: Early guard skips install if auto-resolution disabled IP-Adapter
- wrapper.py: Generic except handler now gracefully skips instead of re-raising

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In cuda-python 13.x, the 'cudart' module was moved to 'cuda.bindings.runtime'.
Add try/except import that prefers the new location and falls back to the
legacy 'cuda.cudart' path for cuda-python 12.x compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tence

Pre-allocate latent and noise buffers to eliminate per-frame CUDA malloc:
- Replace prev_latent_result = x_0_pred_out.clone() with lazy-allocated
  _latent_cache buffer + copy_() in __call__, txt2img, and txt2img_sd_turbo
- Replace torch.randn_like() in TCD non-batched noise loop with lazy-allocated
  _noise_buf + .normal_() — eliminates per-step allocation on TCD path
- Both buffers allocate on first use (shape is fixed per pipeline instance)

Port cuda_l2_cache.py from CUDA 0.2.99 fork (PLAN_5 Feature 2):
- New file: src/streamdiffusion/tools/cuda_l2_cache.py
- Reserves GPU L2 cache for UNet attention weight tensors (mid_block, up_blocks.1)
- Gated by SDTD_L2_PERSIST=1 env var (default on), requires Ampere+ GPU
- Integrated at end of wrapper._load_model() with silent fallback on failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…division

- encode_image / decode_image: replace hardcoded torch.float16 autocast with
  self.dtype so the pipeline correctly honors the torch_dtype constructor param
  (e.g. bfloat16 would still get fp16 VAE without this fix)

- scheduler_step_batch: upcast numerator and alpha_prod_t_sqrt to float32
  before the F_theta division, then cast back to original dtype. When
  alpha_prod_t_sqrt is small (early timesteps), fp16 division can accumulate
  rounding error; fp32 upcast eliminates this at negligible cost (~1-3us/call).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup_l2_persistence() calls pin_hot_unet_weights(persist_mb=0) after
reserving L2. But pin_hot_unet_weights unconditionally called
reserve_l2_persisting_cache(0), which set the persisting L2 size to
0 bytes — undoing the first reservation entirely.

Fix: skip the Tier 1 reserve call in pin_hot_unet_weights when persist_mb=0,
since the caller (setup_l2_persistence) has already handled it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge filter is active

The FPS counter was inflated because skipped frames (cached results from the
similar image filter) returned in ~1ms instead of ~30ms, but were still counted
as processed frames. This caused reported FPS to be ~2x actual GPU inference
rate (e.g., 60 FPS reported while GPU at 50% utilization).

Added `last_frame_was_skipped` flag to pipeline and `inference_fps` tracking
to td_manager. Status line now shows: "FPS: 28.3 (out: 57.1)" separating
real inference rate from output rate. OSC now sends inference FPS as the
primary metric.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strategy 1 — Text encoder CPU offload (~1.6 GB VRAM saved, no rebuild):
- Add _offload_text_encoders() / _reload_text_encoders() helpers on wrapper
- Offload CLIP-L + OpenCLIP-G to CPU after initial prepare() in TRT mode
- Reload on-demand before prompt re-encoding in prepare(), update_prompt(),
  update_stream_params(); always offload back via try/finally

Strategy 2 — max_batch_size 4→2 (requires engine rebuild, ~0.5-1.5 GB saved):
- Default max_batch_size 4→2 in StreamDiffusionWrapper.__init__ and _load_model
- Runtime trt_unet_batch_size=2 with cfg_type="self" + t_index_list=[12,29];
  max=4 was always wasted capacity in the TRT optimization profile
- Reduces KVO cache max dim2 from 4 to 2, shrinks TRT activation workspace

Note: cache_maxframes and max_cache_maxframes remain at 4 to preserve V2V
temporal coherence. Delete existing unet.engine to trigger rebuild.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With cfg_type="full" and 2 denoising steps, trt_unet_batch_size=4.
With cfg_type="initialize", batch=3. Max_batch=2 would crash both.
Only Strategy 1 (text encoder offloading) remains active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ngine build

- Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4)
  Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B'
- Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12
  Guard with hasattr() fallback
- Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED
  Skip precision flags when STRONGLY_TYPED is network-level only
- Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX
  Add _restore_dynamic_axes() to restore dim_param after quantization
- Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging
- Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move build_stats.json write before cleanup to prevent accidental deletion.
Add two-pass cleanup with gc.collect() between passes to release Python-held
file handles that cause Windows lock failures. Delete onnx__ tensor files
immediately after repacking into weights.pb during ONNX export (~4 GB freed
before quantize stage starts). Adds actionable warning with manual cleanup
instructions when file locks persist.

Root cause: builder.py cleanup ran os.remove() once with silent except OSError,
leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors,
model weight dumps) when Windows file locks prevented deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
UNet2DConditionModelEngine has no named_parameters() — the previous code
crashed with AttributeError when TRT acceleration was enabled.

Two-path dispatch based on UNet type:
- PyTorch nn.Module: existing Tier 2 cudaStreamSetAttribute weight-pinning path
- TRT engine wrapper: new set_trt_persistent_cache() using
  IExecutionContext.persistent_cache_limit for activation caching in L2

TRT's persistent_cache_limit checks cudaLimitPersistingL2CacheSize at
assignment time (not context-creation time), so Tier 1 reservation must
precede the set call — which is the existing execution order.

Adds hasattr guard in pin_hot_unet_weights() so TRT engines short-circuit
cleanly without attempting named_parameters() iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…safe

Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused
engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime
because `trt.nptype()` has no numpy equivalent for FP8.

Remove `simplify=True` — always fails with protobuf >2GB parse error on
our external-data-format ONNX (graceful fallback, but wastes ~1 min).

Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8-
resilient: catch TypeError from `trt.nptype()` and fall back to
`torch.float8_e4m3fn` directly, bypassing the numpy intermediate.

FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine,
keep timing.cache). Entropy calibration and calibrate_per_node are retained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolution is always known before inference and never changes, so all
three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now
build with static spatial profiles (min=opt=max at exact resolution).

Static shapes unlock:
- tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was
  skipped for all dynamic builds with 'symbolic shape, l2tc doesn't
  take effect' warning
- l2_limit_for_tiling — now applied for full L2 cache budget
- Geometry-specific kernel selection instead of range-covering kernels
- Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²)
- Faster builds: single-point tactic search vs 4× spatial range

Key fixes:
- get_minmax_dims(): static_shape flag was dead code — hardcoded to
  always return 256-1024 range regardless of the flag
- UNet.get_input_profile(): separation logic (opt != min padding) now
  guarded behind `if not static_shape` — was incorrectly padding opt
  away from min for static engines where min==opt==max is correct
- ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024
  range that bypassed get_minmax_dims() entirely; now respects
  static_shape flag
- ControlNet residual scaling: max(min+1,...) guard now bypassed for
  static shapes where min==max; exact dims used directly
- Engine paths: add --res-{H}x{W} suffix for static builds to prevent
  cache collisions between different resolutions

Dead code removal:
- build_all_tactics / enable_all_tactics parameter excised from entire
  call chain (wrapper → builder → utilities → Engine.build/_build_fp8)
  TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS +
  JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled

Tactic tuning:
- avg_timing_iterations=4 added to _apply_gpu_profile_to_config()
  Default 1 produces noisy single-sample measurements; 4 iterations
  give stable tactic rankings with negligible extra build time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
INTER-NYC and others added 13 commits April 4, 2026 20:47
- pipeline.py: pre-compute _alpha_next/_beta_next/_init_noise_rotated in prepare()
- pipeline.py: pre-allocate _combined_latent_buf, _cfg_latent_buf/_cfg_t_buf, _unet_kwargs
- pipeline.py: in-place stock_noise[0:1].copy_() eliminates torch.concat malloc
- attention_processors.py: lazy-init per-layer _curr_key_buf/_curr_value_buf/_kv_out_buf
- stream_parameter_updater.py: keep _init_noise_rotated in sync on seed change
- unet_engine.py: cache dummy ControlNet zero tensors in _cached_dummy_controlnet_tensors
- td_manager.py: async GPU->CPU via pinned memory + CUDA event (eliminates 1-3ms sync stall)

Eliminates ~300+ per-frame CUDA allocations (SDXL 4-step), saves ~1.5-4ms/frame
Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache
optimization) was still disabled for UNet because the batch dimension
remained dynamic (min=1, max=4). TRT checks that ALL dimensions are
concrete before enabling l2tc — a single symbolic dimension disables it
for the entire graph.

Fix: set build_static_batch=True for all three engine types (UNet, VAE
decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and
cfg_type='self' (never 'full') is always used, the UNet batch is always
exactly len(t_index_list)=2 — never changes at runtime.

Also fix get_minmax_dims() static_batch path: was setting
min_batch = max(1, batch_size-1) which still created a range (1-2).
Now sets min_batch = max_batch = batch_size for a true single-point
profile that TRT treats as fully concrete.

With all dimensions concrete (batch + spatial), the next UNet build
should show tiling_optimization_level=MODERATE and l2_limit_for_tiling
applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PROFILE_TRT

- Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run,
  get_summary() aggregating median times across last N runs
- Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new
  engines embed layer names + tactic IDs for meaningful profiling output
- Engine.activate(): attach TRTProfiler + log when env var is set
- Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler
  cannot report per-layer times through graph replay); wrap execution with
  start_run/end_run; sync stream before end_run to ensure all callbacks fired
- Engine.dump_profile(): log per-layer summary, no-op when profiler is None
- UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine:
  add dump_profile() delegation to underlying Engine

Zero overhead in production (env var not set = no profiler created, CUDA graphs
work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hapes

Level 4 compiles dynamic kernels — unnecessary with fully static profiles
(build_dynamic_shape=False + build_static_batch=True) and triggers
tactic 0x3e9 'Assertion g.nodes.size() == 0' failures in TRT 10.12.
Level 3 heuristic selection produces equivalent results for static builds.
Level 5 still avoided (OOM during tactic profiling, 160 GiB requests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TRT 10.12 bug

Level 3 does not avoid the tactic 0x3e9 assertion errors — they appear at
all optimization levels. Reverting to 4 for better dynamic kernel selection.
Added comment documenting the benign TRT 10.12 bug.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t with use_cached_attn

CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces
aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists
for this op, crashing UNet export with use_cached_attn=True.

Added _use_prealloc=False flag (default):
- False: ONNX-safe .clone() / torch.stack() path used during tracing
- True: zero-alloc .copy_() path for non-TRT runtime (set externally)

For TRT builds processors don't run at inference time (engine handles
KV cache internally), so _use_prealloc=True is only relevant for
non-TRT acceleration paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_or_load_controlnet_engine() defaulted to opt_image_height/width=704,
causing a static shape mismatch at runtime when the pipeline runs at 512×512
(latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine
is built at the actual inference resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ControlNet ran with use_cuda_graph=False despite the Engine.infer() and
allocate_buffers() infrastructure supporting graph capture. Since shapes
are fixed at runtime (same resolution every frame), enabling CUDA graphs
eliminates CPU kernel launch overhead per denoising step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erification

Batch scripts:
- Replace hardcoded D:\Users\alexk paths with %~dp0 (fully portable)
- Add prerequisite checks: Python 3.11, Git, cl.exe (warn-only)
- Install_TensorRT.bat: check venv exists before activation attempt
- Start_StreamDiffusion.bat: call set_env.bat if present; fix td_main.py path casing
- New set_env.bat: documents and sets PYTORCH_CUDA_ALLOC_CONF, CUDA_MODULE_LOADING,
  SDTD_L2_PERSIST for runtime GPU tuning

Version pin alignment (post deps-audit, 14/14 checks verified):
- setup.py: Pillow>=12.2.0, onnxruntime-gpu==1.24.4, polygraphy==0.49.26, colored==2.3.2
- StreamDiffusionTD/install_tensorrt.py: full TENSORRT_PINS dict, pywin32==311
- src/tools/install-tensorrt.py: polygraphy==0.49.26, pywin32==311
- StreamDiffusion-installer: see submodule commit 24a5693

Audit:
- audit_reports/2026-04-07-1122-audit-summary.md: 7 CVEs in 2 blocked packages
  (onnx: graphsurgeon blocks upgrade; protobuf: mediapipe blocks upgrade), 23 safe
  package updates applied
@forkni
Copy link
Copy Markdown
Collaborator Author

forkni commented Apr 7, 2026

Closing in favor of a targeted installer-only PR for clean merge

@forkni forkni closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants