Skip to content

perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8

Open
forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
pr8/trt-engine-builder
Open

perf: TensorRT engine builder — static shapes, profiling, CUDA graphs#8
forkni wants to merge 10 commits intopr7/tier1-optimizationsfrom
pr8/trt-engine-builder

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented Apr 6, 2026

Summary

Stacked on #7 — merge that PR first.

TensorRT engine builder modernization: static shapes for L2TC, profiling infrastructure, CUDA graphs for ControlNet, deprecated API cleanup, and FP8 buffer safety.

Key Changes

Static Shape Engine Building

  • engine_manager.py: Engine paths now encode height×width dimensions — separate cache per resolution
  • models.py: Fully static batch + spatial profiles to unlock L2 tiling cache (l2tc) on UNet
  • Passes resolution=(height, width) through engine path helpers so pre-built engines at one resolution are never reused at another

TRT Profiling Infrastructure

  • New StreamDiffusionProfiler(trt.IProfiler) in utilities.py — per-layer median timing with top-25 slowest layers report
  • Gated by STREAMDIFFUSION_PROFILE_TRT=1 env var (zero overhead when disabled)
  • Attached to UNet and ControlNet execution contexts

CUDA Graphs for ControlNet

  • controlnet_engine.py: CUDA graph capture/replay enabled on ControlNet TRT engine
  • Resolution passed from wrapper → ControlNet builder (previously missing, caused shape mismatch)
  • ControlNet TRT compilation guarded behind acceleration == "tensorrt" check

Deprecated TRT 10.x API Cleanup

  • Removed polygraphy builder config usage
  • Updated deprecated DataType.HALFDataType.FP16 and similar throughout engine builder and preprocessing TRT engines

FP8 Buffer Safety

  • allocate_buffers() now handles float8_e4m3fn dtype (maps via try/except, skips numpy for FP8 tensors)
  • Removed direct_io_types/simplify passes that broke FP8 ONNX graphs
  • Applied consistently in utilities.py and temporal_net_tensorrt.py

ONNX Export Compatibility

  • aten::copy guarded behind _use_prealloc flag in CachedSTAttnProcessor2_0
  • ONNX export tracing uses .clone() path (has symbolic); inference uses .copy_() (zero-alloc)

Files Modified

File Changes
builder.py Deprecated API cleanup, optimization level comment
utilities.py allocate_buffers FP8-safe, StreamDiffusionProfiler, CUDA graphs
engine_manager.py Resolution-encoded engine paths, static shapes
models/models.py Static batch + spatial profiles
models/controlnet_models.py Static shapes for ControlNet
attention_processors.py _use_prealloc ONNX guard
runtime_engines/unet_engine.py Profiler attachment
runtime_engines/controlnet_engine.py Profiler attachment, CUDA graphs
wrapper.py ControlNet compilation guard, resolution passthrough
preprocessing/temporal_net_tensorrt.py FP8-safe buffer allocation
tools/compile_raft_tensorrt.py Deprecated TRT API cleanup
preprocessing/realesrgan_trt.py Deprecated TRT API cleanup

Impact

Feature Before After
Engine path No resolution encoding --512x512-- in path
UNet profiles Dynamic batch dims Fully static
ControlNet TRT No CUDA graph Graph capture enabled
FP8 buffers TypeError on float8_e4m3fn Handled correctly
ONNX export w/ use_cached_attn aten::copy tracing error Clean export
TRT profiling No per-layer timing STREAMDIFFUSION_PROFILE_TRT=1

Test plan

  • Build UNet TRT engine: verify resolution encoded in engine path (build_stats.json)
  • STREAMDIFFUSION_PROFILE_TRT=1: verify per-layer timing report in logs
  • ControlNet + TRT inference: verify CUDA graph capture message in logs
  • FP8 engine build: verify allocate_buffers with float8_e4m3fn dtype
  • ONNX export with use_cached_attn=True: confirm no aten::copy tracing error
  • FP16 baseline engine unaffected (no --fp8 suffix regression)

🤖 Generated with Claude Code

INTER-NYC and others added 10 commits April 6, 2026 13:59
…safe

Remove `direct_io_types=True` from ModelOpt quantize_kwargs — it caused
engine I/O tensors to be typed as FLOAT8E4M3FN, which crashes at runtime
because `trt.nptype()` has no numpy equivalent for FP8.

Remove `simplify=True` — always fails with protobuf >2GB parse error on
our external-data-format ONNX (graceful fallback, but wastes ~1 min).

Make `Engine.allocate_buffers` and `TensorRTEngine.allocate_buffers` FP8-
resilient: catch TypeError from `trt.nptype()` and fall back to
`torch.float8_e4m3fn` directly, bypassing the numpy intermediate.

FP8 ONNX must be regenerated (delete unet.engine.fp8.onnx* + unet.engine,
keep timing.cache). Entropy calibration and calibrate_per_node are retained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolution is always known before inference and never changes, so all
three engine types (UNet, VAE encoder, VAE decoder, ControlNet) now
build with static spatial profiles (min=opt=max at exact resolution).

Static shapes unlock:
- tiling_optimization_level (FAST/MODERATE/FULL by GPU tier) — was
  skipped for all dynamic builds with 'symbolic shape, l2tc doesn't
  take effect' warning
- l2_limit_for_tiling — now applied for full L2 cache budget
- Geometry-specific kernel selection instead of range-covering kernels
- Tighter CUDA graph buffer allocation (exact dims vs worst-case 1024²)
- Faster builds: single-point tactic search vs 4× spatial range

Key fixes:
- get_minmax_dims(): static_shape flag was dead code — hardcoded to
  always return 256-1024 range regardless of the flag
- UNet.get_input_profile(): separation logic (opt != min padding) now
  guarded behind `if not static_shape` — was incorrectly padding opt
  away from min for static engines where min==opt==max is correct
- ControlNetTRT.get_input_profile(): had its own hardcoded 384-1024
  range that bypassed get_minmax_dims() entirely; now respects
  static_shape flag
- ControlNet residual scaling: max(min+1,...) guard now bypassed for
  static shapes where min==max; exact dims used directly
- Engine paths: add --res-{H}x{W} suffix for static builds to prevent
  cache collisions between different resolutions

Dead code removal:
- build_all_tactics / enable_all_tactics parameter excised from entire
  call chain (wrapper → builder → utilities → Engine.build/_build_fp8)
  TRT 10.12 defaults already enable EDGE_MASK_CONVOLUTIONS +
  JIT_CONVOLUTIONS; CUBLAS/CUBLAS_LT/CUDNN all deprecated and disabled

Tactic tuning:
- avg_timing_iterations=4 added to _apply_gpu_profile_to_config()
  Default 1 produces noisy single-sample measurements; 4 iterations
  give stable tactic rankings with negligible extra build time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Even with static spatial shapes (512x512), TRT's l2tc (L2 tiling cache
optimization) was still disabled for UNet because the batch dimension
remained dynamic (min=1, max=4). TRT checks that ALL dimensions are
concrete before enabling l2tc — a single symbolic dimension disables it
for the entire graph.

Fix: set build_static_batch=True for all three engine types (UNet, VAE
decoder, VAE encoder) and ControlNet. Since t_index_list is fixed and
cfg_type='self' (never 'full') is always used, the UNet batch is always
exactly len(t_index_list)=2 — never changes at runtime.

Also fix get_minmax_dims() static_batch path: was setting
min_batch = max(1, batch_size-1) which still created a range (1-2).
Now sets min_batch = max_batch = batch_size for a true single-point
profile that TRT treats as fully concrete.

With all dimensions concrete (batch + spatial), the next UNet build
should show tiling_optimization_level=MODERATE and l2_limit_for_tiling
applied without the '[l2tc] VALIDATE FAIL - symbolic shape' warning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PROFILE_TRT

- Add TRTProfiler class (IProfiler impl): per-layer timing with start/end_run,
  get_summary() aggregating median times across last N runs
- Set profiling_verbosity=DETAILED in both FP16 and FP8 build paths so new
  engines embed layer names + tactic IDs for meaningful profiling output
- Engine.activate(): attach TRTProfiler + log when env var is set
- Engine.infer(): disable CUDA graphs when profiler is attached (IProfiler
  cannot report per-layer times through graph replay); wrap execution with
  start_run/end_run; sync stream before end_run to ensure all callbacks fired
- Engine.dump_profile(): log per-layer summary, no-op when profiler is None
- UNet2DConditionModelEngine, AutoencoderKLEngine, ControlNetModelEngine:
  add dump_profile() delegation to underlying Engine

Zero overhead in production (env var not set = no profiler created, CUDA graphs
work normally). Enable with: set STREAMDIFFUSION_PROFILE_TRT=1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t with use_cached_attn

CachedSTAttnProcessor2_0 unconditionally used .copy_() which produces
aten::copy during torch.onnx.export() tracing — no ONNX symbolic exists
for this op, crashing UNet export with use_cached_attn=True.

Added _use_prealloc=False flag (default):
- False: ONNX-safe .clone() / torch.stack() path used during tracing
- True: zero-alloc .copy_() path for non-TRT runtime (set externally)

For TRT builds processors don't run at inference time (engine handles
KV cache internally), so _use_prealloc=True is only relevant for
non-TRT acceleration paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_or_load_controlnet_engine() defaulted to opt_image_height/width=704,
causing a static shape mismatch at runtime when the pipeline runs at 512×512
(latent 64×64 vs expected 88×88). Pass self.height / self.width so the engine
is built at the actual inference resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ControlNet ran with use_cuda_graph=False despite the Engine.infer() and
allocate_buffers() infrastructure supporting graph capture. Since shapes
are fixed at runtime (same resolution every frame), enabling CUDA graphs
eliminates CPU kernel launch overhead per denoising step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants