feat: FP8 quantization & TensorRT build infrastructure#6
Open
forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
Open
feat: FP8 quantization & TensorRT build infrastructure#6forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
Conversation
…alibrationDataProvider
… cleanup intermediates on retry
…ngine build - Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4) Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B' - Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12 Guard with hasattr() fallback - Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED Skip precision flags when STRONGLY_TYPED is network-level only - Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX Add _restore_dynamic_axes() to restore dim_param after quantization - Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging - Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move build_stats.json write before cleanup to prevent accidental deletion. Add two-pass cleanup with gc.collect() between passes to release Python-held file handles that cause Windows lock failures. Delete onnx__ tensor files immediately after repacking into weights.pb during ONNX export (~4 GB freed before quantize stage starts). Adds actionable warning with manual cleanup instructions when file locks persist. Root cause: builder.py cleanup ran os.remove() once with silent except OSError, leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors, model weight dumps) when Windows file locks prevented deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ne path Extract FP8-specific changes from 00cf0c7 (without reformatting). Adds fp8 parameter flow from StreamDiffusionWrapper through to engine compilation with calibration data callback and --fp8 engine path suffix for cache separation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nvidia-modelopt: ONNX export → FP16 optimize → FP8 Q/DQ annotation → TRT STRONGLY_TYPED enginefp8=Trueparameter flow fromStreamDiffusionWrapper.__init__()through_load_model(),compile_unet(), andEngineBuilder.build()--fp8suffix for separate cache (e.g....--controlnet--fp8--mode-img2img/unet.engine)builder.pyprevents ~14 GB intermediate file bloat on Windows (ONNX weights +onnx__*tensor files)Stacking
Stacks on
pr3/ipadapter-vram-deps(which stacks onpr1/inference-performance). PR3 provides onnx/onnxruntime/modelopt dependency pins that FP8 imports require.Commits (8)
61bcf86— patch ByteSize() for >2GB ONNX in modelopt FP8 quantization7a02ae2— reduce FP8 calibration batches 128→8 (KVO cache OOM)402d619— merge calibration list-of-dicts into stacked dict for modelopt7c05f59— add NVIDIA DLLs to PATH and retry without quantize_mha on ORT EP failure5ba15af— use single calibration batch, cleanup intermediates on retry6b0b99d— resolve 4 FP8 bugs for TRT 10.12 (STRONGLY_TYPED network, version-aware flags)24da142— prevent intermediate file bloat on Windows (two-pass cleanup,onnx__*early deletion)670aec4— add FP8 parameter flow: wrapper → compile_unet → engine path prefixFiles Modified
acceleration/tensorrt/fp8_quantize.pyquantize_onnx_fp8()acceleration/tensorrt/builder.pyacceleration/tensorrt/utilities.py_build_fp8()raw TRT builder,onnx__*early cleanup, external data supportacceleration/tensorrt/models/models.pyget_dynamic_axes()output axes for FP8 compatibilityacceleration/tensorrt/__init__.pycompile_unet()extractsfp8/calibration_data_fnfromengine_build_optionsacceleration/tensorrt/engine_manager.py--fp8suffix in UNet engine pathwrapper.pyfp8=param,self.fp8, passthrough to_load_model()andget_engine_path()Test plan
--fp8suffix (no regression)StreamDiffusionWrapper(fp8=True)generates engine path with--fp8suffix*.fp8.onnx) preserved after build; intermediates cleanedunet.engine,unet.fp8.onnx,build_stats.json🤖 Generated with Claude Code