perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8
Open
forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
Open
perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
Conversation
…safe Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime because `trt.nptype()` has no numpy equivalent for FP8. Remove `simplify=True` — always fails with protobuf >2GB parse error on our external-data-format ONNX (graceful fallback, but wastes ~1 min). Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8- resilient: catch TypeError from `trt.nptype()` and fall back to `torch.float8_e4m3fn` directly, bypassing the numpy intermediate. FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine, keep timing.cache). Entropy calibration and calibrate_per_node are retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolution is always known before inference and never changes, so all
three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now
build with static spatial profiles (min=opt=max at exact resolution).
Static shapes unlock:
- tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was
skipped for all dynamic builds with 'symbolic shape, l2tc doesn't
take effect' warning
- l2_limit_for_tiling — now applied for full L2 cache budget
- Geometry-specific kernel selection instead of range-covering kernels
- Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²)
- Faster builds: single-point tactic search vs 4× spatial range
Key fixes:
- get_minmax_dims(): static_shape flag was dead code — hardcoded to
always return 256-1024 range regardless of the flag
- UNet.get_input_profile(): separation logic (opt != min padding) now
guarded behind `if not static_shape` — was incorrectly padding opt
away from min for static engines where min==opt==max is correct
- ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024
range that bypassed get_minmax_dims() entirely; now respects
static_shape flag
- ControlNet residual scaling: max(min+1,...) guard now bypassed for
static shapes where min==max; exact dims used directly
- Engine paths: add --res-{H}x{W} suffix for static builds to prevent
cache collisions between different resolutions
Dead code removal:
- build_all_tactics / enable_all_tactics parameter excised from entire
call chain (wrapper → builder → utilities → Engine.build/_build_fp8)
TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS +
JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled
Tactic tuning:
- avg_timing_iterations=4 added to _apply_gpu_profile_to_config()
Default 1 produces noisy single-sample measurements; 4 iterations
give stable tactic rankings with negligible extra build time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache optimization) was still disabled for UNet because the batch dimension remained dynamic (min=1, max=4). TRT checks that ALL dimensions are concrete before enabling l2tc — a single symbolic dimension disables it for the entire graph. Fix: set build_static_batch=True for all three engine types (UNet, VAE decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and cfg_type='self' (never 'full') is always used, the UNet batch is always exactly len(t_index_list)=2 — never changes at runtime. Also fix get_minmax_dims() static_batch path: was setting min_batch = max(1, batch_size-1) which still created a range (1-2). Now sets min_batch = max_batch = batch_size for a true single-point profile that TRT treats as fully concrete. With all dimensions concrete (batch + spatial), the next UNet build should show tiling_optimization_level=MODERATE and l2_limit_for_tiling applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PROFILE_TRT - Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run, get_summary() aggregating median times across last N runs - Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new engines embed layer names + tactic IDs for meaningful profiling output - Engine.activate(): attach TRTProfiler + log when env var is set - Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler cannot report per-layer times through graph replay); wrap execution with start_run/end_run; sync stream before end_run to ensure all callbacks fired - Engine.dump_profile(): log per-layer summary, no-op when profiler is None - UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine: add dump_profile() delegation to underlying Engine Zero overhead in production (env var not set = no profiler created, CUDA graphs work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 10.12 bug, level 3 unsafe
…t with use_cached_attn CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists for this op, crashing UNet export with use_cached_attn=True. Added _use_prealloc=False flag (default): - False: ONNX-safe .clone() / torch.stack() path used during tracing - True: zero-alloc .copy_() path for non-TRT runtime (set externally) For TRT builds processors don't run at inference time (engine handles KV cache internally), so _use_prealloc=True is only relevant for non-TRT acceleration paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_or_load_controlnet_engine() defaulted to opt_image_height/width=704, causing a static shape mismatch at runtime when the pipeline runs at 512×512 (latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine is built at the actual inference resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ControlNet ran with use_cuda_graph=False despite the Engine.infer() and allocate_buffers() infrastructure supporting graph capture. Since shapes are fixed at runtime (same resolution every frame), enabling CUDA graphs eliminates CPU kernel launch overhead per denoising step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TensorRT engine builder modernization: static shapes for L2TC, profiling infrastructure, CUDA graphs for ControlNet, deprecated API cleanup, and FP8 buffer safety.
Key Changes
Static Shape Engine Building
engine_manager.py: Engine paths now encodeheight×widthdimensions — separate cache per resolutionmodels.py: Fully static batch + spatial profiles to unlock L2 tiling cache (l2tc) on UNetresolution=(height, width)through engine path helpers so pre-built engines at one resolution are never reused at anotherTRT Profiling Infrastructure
StreamDiffusionProfiler(trt.IProfiler)inutilities.py— per-layer median timing with top-25 slowest layers reportSTREAMDIFFUSION_PROFILE_TRT=1env var (zero overhead when disabled)CUDA Graphs for ControlNet
controlnet_engine.py: CUDA graph capture/replay enabled on ControlNet TRT engineacceleration == "tensorrt"checkDeprecated TRT 10.x API Cleanup
DataType.HALF→DataType.FP16and similar throughout engine builder and preprocessing TRT enginesFP8 Buffer Safety
allocate_buffers()now handlesfloat8_e4m3fndtype (maps via try/except, skips numpy for FP8 tensors)direct_io_types/simplifypasses that broke FP8 ONNX graphsutilities.pyandtemporal_net_tensorrt.pyONNX Export Compatibility
aten::copyguarded behind_use_preallocflag inCachedSTAttnProcessor2_0.clone()path (has symbolic); inference uses.copy_()(zero-alloc)Files Modified
builder.pyutilities.pyallocate_buffersFP8-safe,StreamDiffusionProfiler, CUDA graphsengine_manager.pymodels/models.pymodels/controlnet_models.pyattention_processors.py_use_preallocONNX guardruntime_engines/unet_engine.pyruntime_engines/controlnet_engine.pywrapper.pypreprocessing/temporal_net_tensorrt.pytools/compile_raft_tensorrt.pypreprocessing/realesrgan_trt.pyImpact
--512x512--in pathSTREAMDIFFUSION_PROFILE_TRT=1Test plan
STREAMDIFFUSION_PROFILE_TRT=1: verify per-layer timing report in logsallocate_bufferswith float8_e4m3fn dtypeuse_cached_attn=True: confirm no aten::copy tracing error--fp8suffix regression)🤖 Generated with Claude Code