perf: TensorRT engine builder — static shapes, profiling, CUDA graphs by forkni · Pull Request #8 · dotsimulate/StreamDiffusion

forkni · 2026-04-06T18:08:28Z

Summary

Stacked on #7 — merge that PR first.

TensorRT engine builder modernization: static shapes for L2TC, profiling infrastructure, CUDA graphs for ControlNet, deprecated API cleanup, and FP8 buffer safety.

Key Changes

Static Shape Engine Building

engine_manager.py: Engine paths now encode height×width dimensions — separate cache per resolution
models.py: Fully static batch + spatial profiles to unlock L2 tiling cache (l2tc) on UNet
Passes resolution=(height, width) through engine path helpers so pre-built engines at one resolution are never reused at another

TRT Profiling Infrastructure

New StreamDiffusionProfiler(trt.IProfiler) in utilities.py — per-layer median timing with top-25 slowest layers report
Gated by STREAMDIFFUSION_PROFILE_TRT=1 env var (zero overhead when disabled)
Attached to UNet and ControlNet execution contexts

CUDA Graphs for ControlNet

controlnet_engine.py: CUDA graph capture/replay enabled on ControlNet TRT engine
Resolution passed from wrapper → ControlNet builder (previously missing, caused shape mismatch)
ControlNet TRT compilation guarded behind acceleration == "tensorrt" check

Deprecated TRT 10.x API Cleanup

Removed polygraphy builder config usage
Updated deprecated DataType.HALF → DataType.FP16 and similar throughout engine builder and preprocessing TRT engines

FP8 Buffer Safety

allocate_buffers() now handles float8_e4m3fn dtype (maps via try/except, skips numpy for FP8 tensors)
Removed direct_io_types/simplify passes that broke FP8 ONNX graphs
Applied consistently in utilities.py and temporal_net_tensorrt.py

ONNX Export Compatibility

aten::copy guarded behind _use_prealloc flag in CachedSTAttnProcessor2_0
ONNX export tracing uses .clone() path (has symbolic); inference uses .copy_() (zero-alloc)

Files Modified

File	Changes
`builder.py`	Deprecated API cleanup, optimization level comment
`utilities.py`	`allocate_buffers` FP8-safe, `StreamDiffusionProfiler`, CUDA graphs
`engine_manager.py`	Resolution-encoded engine paths, static shapes
`models/models.py`	Static batch + spatial profiles
`models/controlnet_models.py`	Static shapes for ControlNet
`attention_processors.py`	`_use_prealloc` ONNX guard
`runtime_engines/unet_engine.py`	Profiler attachment
`runtime_engines/controlnet_engine.py`	Profiler attachment, CUDA graphs
`wrapper.py`	ControlNet compilation guard, resolution passthrough
`preprocessing/temporal_net_tensorrt.py`	FP8-safe buffer allocation
`tools/compile_raft_tensorrt.py`	Deprecated TRT API cleanup
`preprocessing/realesrgan_trt.py`	Deprecated TRT API cleanup

Impact

Feature	Before	After
Engine path	No resolution encoding	`--512x512--` in path
UNet profiles	Dynamic batch dims	Fully static
ControlNet TRT	No CUDA graph	Graph capture enabled
FP8 buffers	TypeError on float8_e4m3fn	Handled correctly
ONNX export w/ use_cached_attn	aten::copy tracing error	Clean export
TRT profiling	No per-layer timing	`STREAMDIFFUSION_PROFILE_TRT=1`

Test plan

Build UNet TRT engine: verify resolution encoded in engine path (build_stats.json)
STREAMDIFFUSION_PROFILE_TRT=1: verify per-layer timing report in logs
ControlNet + TRT inference: verify CUDA graph capture message in logs
FP8 engine build: verify allocate_buffers with float8_e4m3fn dtype
ONNX export with use_cached_attn=True: confirm no aten::copy tracing error
FP16 baseline engine unaffected (no --fp8 suffix regression)

🤖 Generated with Claude Code

…eprocessing

…safe Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime because `trt.nptype()` has no numpy equivalent for FP8. Remove `simplify=True` — always fails with protobuf >2GB parse error on our external-data-format ONNX (graceful fallback, but wastes ~1 min). Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8- resilient: catch TypeError from `trt.nptype()` and fall back to `torch.float8_e4m3fn` directly, bypassing the numpy intermediate. FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine, keep timing.cache). Entropy calibration and calibrate_per_node are retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolution is always known before inference and never changes, so all three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now build with static spatial profiles (min=opt=max at exact resolution). Static shapes unlock: - tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was skipped for all dynamic builds with 'symbolic shape, l2tc doesn't take effect' warning - l2_limit_for_tiling — now applied for full L2 cache budget - Geometry-specific kernel selection instead of range-covering kernels - Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²) - Faster builds: single-point tactic search vs 4× spatial range Key fixes: - get_minmax_dims(): static_shape flag was dead code — hardcoded to always return 256-1024 range regardless of the flag - UNet.get_input_profile(): separation logic (opt != min padding) now guarded behind `if not static_shape` — was incorrectly padding opt away from min for static engines where min==opt==max is correct - ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024 range that bypassed get_minmax_dims() entirely; now respects static_shape flag - ControlNet residual scaling: max(min+1,...) guard now bypassed for static shapes where min==max; exact dims used directly - Engine paths: add --res-{H}x{W} suffix for static builds to prevent cache collisions between different resolutions Dead code removal: - build_all_tactics / enable_all_tactics parameter excised from entire call chain (wrapper → builder → utilities → Engine.build/_build_fp8) TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS + JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled Tactic tuning: - avg_timing_iterations=4 added to _apply_gpu_profile_to_config() Default 1 produces noisy single-sample measurements; 4 iterations give stable tactic rankings with negligible extra build time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache optimization) was still disabled for UNet because the batch dimension remained dynamic (min=1, max=4). TRT checks that ALL dimensions are concrete before enabling l2tc — a single symbolic dimension disables it for the entire graph. Fix: set build_static_batch=True for all three engine types (UNet, VAE decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and cfg_type='self' (never 'full') is always used, the UNet batch is always exactly len(t_index_list)=2 — never changes at runtime. Also fix get_minmax_dims() static_batch path: was setting min_batch = max(1, batch_size-1) which still created a range (1-2). Now sets min_batch = max_batch = batch_size for a true single-point profile that TRT treats as fully concrete. With all dimensions concrete (batch + spatial), the next UNet build should show tiling_optimization_level=MODERATE and l2_limit_for_tiling applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…PROFILE_TRT - Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run, get_summary() aggregating median times across last N runs - Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new engines embed layer names + tactic IDs for meaningful profiling output - Engine.activate(): attach TRTProfiler + log when env var is set - Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler cannot report per-layer times through graph replay); wrap execution with start_run/end_run; sync stream before end_run to ensure all callbacks fired - Engine.dump_profile(): log per-layer summary, no-op when profiler is None - UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine: add dump_profile() delegation to underlying Engine Zero overhead in production (env var not set = no profiler created, CUDA graphs work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… 10.12 bug, level 3 unsafe

…t with use_cached_attn CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists for this op, crashing UNet export with use_cached_attn=True. Added _use_prealloc=False flag (default): - False: ONNX-safe .clone() / torch.stack() path used during tracing - True: zero-alloc .copy_() path for non-TRT runtime (set externally) For TRT builds processors don't run at inference time (engine handles KV cache internally), so _use_prealloc=True is only relevant for non-TRT acceleration paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

get_or_load_controlnet_engine() defaulted to opt_image_height/width=704, causing a static shape mismatch at runtime when the pipeline runs at 512×512 (latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine is built at the actual inference resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ControlNet ran with use_cuda_graph=False despite the Engine.infer() and allocate_buffers() infrastructure supporting graph capture. Since shapes are fixed at runtime (same resolution every frame), enabling CUDA graphs eliminates CPU kernel launch overhead per denoising step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

INTER-NYC and others added 10 commits April 6, 2026 13:59

perf: clean up deprecated TRT 10.x API usage in engine builder and pr…

124dd5e

…eprocessing

perf(trt): document builder_optimization_level=4; tactic 0x3e9 is TRT…

2854b8b

… 10.12 bug, level 3 unsafe

fix: guard ControlNet TRT engine compilation behind acceleration check

7db3e75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8

perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8
forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
pr8/trt-engine-builder

forkni commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

forkni commented Apr 6, 2026

Summary

Key Changes

Files Modified

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants