Sync with Microsoft ONNX Runtime - 18032026#980
Merged
ankitm3k merged 40 commits intoovep-developfrom Mar 18, 2026
Merged
Conversation
Bumps [tar](https://github.com/isaacs/node-tar) from 7.5.9 to 7.5.11. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/isaacs/node-tar/commit/bf776f673164215074b62749e0fe80e5834588f4"><code>bf776f6</code></a> 7.5.11</li> <li><a href="https://github.com/isaacs/node-tar/commit/f48b5fa3b7985ddab96dc0f2125a4ffc9911b6ad"><code>f48b5fa</code></a> prevent escaping symlinks with drive-relative paths</li> <li><a href="https://github.com/isaacs/node-tar/commit/97cff15d3539a37a4095eb3d287147d9d77c2dc3"><code>97cff15</code></a> docs: more security info</li> <li><a href="https://github.com/isaacs/node-tar/commit/2b72abc1d47c3570e1ad95c9ab557fc4c2e6e4b1"><code>2b72abc</code></a> 7.5.10</li> <li><a href="https://github.com/isaacs/node-tar/commit/7bc755dd85e623c0279e08eb3784909e6d7e4b9f"><code>7bc755d</code></a> parse root off paths before sanitizing .. parts</li> <li><a href="https://github.com/isaacs/node-tar/commit/c8cb84629dee649feedde03f2f4ea48f2e44e778"><code>c8cb846</code></a> update deps</li> <li>See full diff in <a href="https://github.com/isaacs/node-tar/compare/v7.5.9...v7.5.11">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…soft#27587) ### Description Add support for `LoggingManager::HasDefaultLogger()`.
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…icrosoft#27599) ## Summary This PR adds a new `OrtApi` entry point for reading repeated string attributes from `OrtKernelInfo`: - `KernelInfoGetAttributeArray_string` It also wires that support through the C++ wrapper so callers can use: - `Ort::ConstKernelInfo::GetAttributes<std::string>(...)` ## Problem The existing kernel info APIs already support scalar and array attribute access for numeric types, but there was no C API for reading string-array attributes from `OrtKernelInfo`. That created a gap for code paths that rely on repeated string attributes in kernel metadata, such as: - custom op / kernel consumers using the public C API - C++ wrapper callers expecting `GetAttributes<std::string>` to work end-to-end - plugin EP scenarios that need to compile existing kernels against the adapter/C API surface One concrete case is CUDA plugin EP RNN support, where the RNN kernels read the `activations` attribute via `GetAttrs<std::string>("activations", ...)`. The adapter path needed a corresponding ORT C API to expose that data. ## Changes ### C API Added `OrtApi::KernelInfoGetAttributeArray_string` to fetch repeated string attributes from `OrtKernelInfo`. Behavior: - If `out == nullptr`, the API returns the attribute count in `size`. - Otherwise, the API allocates the pointer array and each UTF-8 string with the provided `OrtAllocator`. - For empty attributes, `*out` is set to `nullptr` and `*size` is set to `0`. - The caller frees each string and the pointer array with the same allocator. ### Implementation Added the implementation in the ORT session/custom-op API layer by: - reading the underlying attribute with `OpKernelInfo::GetAttrs<std::string>` - copying the result into allocator-owned C-style string storage for the public API ### C++ wrapper Completed C++ wrapper support so `Ort::ConstKernelInfo::GetAttributes<std::string>(name)` works through the new C API. The wrapper follows the standard two-call pattern: 1. query the number of strings 2. allocate and fetch the returned string array 3. copy into `std::vector<std::string>` and release allocator-owned memory ### Tests Added framework tests covering: - non-empty string-array attributes - empty string-array attributes - missing attribute failure path - C++ wrapper access through `Ort::ConstKernelInfo` ## Files Changed - `include/onnxruntime/core/session/onnxruntime_c_api.h` - `include/onnxruntime/core/session/onnxruntime_cxx_api.h` - `include/onnxruntime/core/session/onnxruntime_cxx_inline.h` - `onnxruntime/core/session/custom_ops.cc` - `onnxruntime/core/session/onnxruntime_c_api.cc` - `onnxruntime/core/session/ort_apis.h` - `onnxruntime/test/framework/kernel_info_test.cc` ## Why This Change This closes a real API gap in kernel attribute access and makes the public API surface more consistent with the existing numeric attribute helpers. It also unblocks plugin/adapter-based kernel code that depends on repeated string attributes without requiring those kernels to special-case plugin builds. For example, porting rnn operator to cuda plugin EP will need this API. ## Validation Validated with new unit coverage in `kernel_info_test.cc` for: - `KernelInfoGetAttributeArray_string` with populated attributes - `KernelInfoGetAttributeArray_string` with empty attributes - missing-attribute error handling - `Ort::ConstKernelInfo::GetAttributes<std::string>` parity with the C API
…→fp32) patterns (microsoft#27614) ### Description Extends the QDQ selector-action `DQ → MatMul → MatMulNBits` fusion in two ways: **1. Support 2-bit and 8-bit quantized weights** The existing fusion only handled 4-bit (`Int4x2`/`UInt4x2`) DQ weights. This PR broadens it to also support 2-bit (`Int2x4`/`UInt2x4`) and 8-bit (`int8`/`uint8`) quantized weights. - qdq_selectors.cc: Added `Is2BitIntType`, `Is8BitIntType`, and `IsNBitsIntType` helpers. Updated `DQMatMulNodeGroupSelector::Check` to accept 2/4/8-bit weight types. - qdq_actions.cc: Added `DQWeightBits` and `IsDQWeightSigned` helpers to dispatch the correct bit-width and signedness for MLAS transpose and MatMulNBits attributes. - `q4_dq.cpp` (MLAS): Added 8-bit `GetElem`/`SetElem` specializations and an 8-bit `TransposeColumnWiseQuantized` path. Added 6 new template instantiations for 2-bit (signed/unsigned, float/float16) and 8-bit (signed/unsigned, float/float16). **2. Handle `Cast(fp16→fp32)` between DQ and MatMul (FP16 model fusion)** FP16 models often have `DQ(int4→fp16) → Cast(fp16→fp32) → MatMul(fp32)` patterns that the existing selector couldn't match. This PR adds a new `DQCastMatMulToMatMulNBitsSelector` / `DQCastMatMulToMatMulNBitsAction` pair that: - Matches the `DQ → Cast(fp16→fp32) → MatMul` pattern on input B. - Creates a `MatMulNBits` node operating in the DQ scale dtype (fp16). - Always inserts `Cast` on input A (to DQ dtype) and `Cast` on output (DQ dtype to MatMul output dtype), relying on ORT's existing `CastElimination` optimizer to remove redundant back-to-back casts in subsequent passes. - Removes the original DQ, Cast (on B), and MatMul nodes. ### Motivation and Context - Many quantized models (e.g., from Olive, AutoAWQ) use 2-bit or 8-bit quantization, but the `DQ → MatMulNBits` fusion only supported 4-bit weights, leaving these models unoptimized. - FP16 models produce `DQ(→fp16) → Cast(fp16→fp32) → MatMul` patterns because the DQ output type matches the scale type (fp16), but the MatMul operates in fp32. Without handling the intermediate Cast, the fusion was blocked entirely for these models.
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…t#27597) ## Description Fix out-of-bounds read in the RotaryEmbedding operator when user-provided `position_ids` values exceed the cos/sin cache bounds (`max_sequence_length`). ### Problem When `position_ids` contains values that are negative or >= `max_sequence_length`, the kernel computes `cache_offset = position_id * half_rotary_embedding_dim` and reads out-of-bounds from `cos_cache` / `sin_cache`. This can cause undefined behavior (incorrect results, crashes, or memory corruption). ### Fix **CPU (`rotary_embedding.cc`):** - Added upfront validation of all `position_ids` values before the parallel computation loop. Returns an `INVALID_ARGUMENT` error if any value is out of range `[0, max_sequence_length)`. - Validation is only applied when `position_ids_format != 0` (i.e., when position_ids are explicitly provided). When `position_ids` is not provided (format 0), the cache is shaped `(B, S, H/2)` and the index `b * S + s` is always in-bounds by construction. **CUDA (`rotary_embedding_impl.cu`):** - Plumbed the previously-unused `max_sequence_length` parameter through to the kernel. - Added a bounds check inside the `position_ids_format != 0` branch. Out-of-bounds position IDs cause the kernel to pass through the input unchanged (errors cannot be propagated from GPU kernels). - The bounds check is scoped to the `position_ids_format != 0` branch only. When format is 0 (no position_ids), the cache is `(B*S, H/2)` and `b_s_index = b * S + s` is deterministically valid — applying the check unconditionally would incorrectly reject all batches beyond the first since `max_sequence_length == sequence_length` in that case. ### Tests Added three CPU test cases for the ONNX domain `RotaryEmbedding` op: - `RotaryEmbedding_PositionIds_ExceedsMaxSeqLen` — position_id far exceeding cache size - `RotaryEmbedding_PositionIds_Negative` — negative position_id - `RotaryEmbedding_PositionIds_OOB_InBatch` — OOB position_id in a multi-batch, multi-sequence scenario ### Motivation and Context Security hardening — prevent out-of-bounds memory access from untrusted model inputs.
### Description Forge/config folders (e.g. `.github`) should not be ignored by tooling such as `rg`, `fd`, etc. Mark these as not-ignored via `.ignore`. ### Motivation and Context Having to put `-uu` on every `rg` search for references in pipelines/workflows defeats the purpose of ignoring hidden paths.
## Description This PR implements a device-side bounds check for `batch_indices` in the RoiAlign CUDA operator. This is a follow-up to microsoft#27543, which fixed the same vulnerability in the CPU implementation. Previously, CheckROIAlignValidInput() only validated `batch_indices` when they were accessible on the host (CPU). For the CUDA EP, `batch_indices` reside in GPU memory, so host-side validation would require an expensive GPU-to-CPU copy, which could also break CUDA graph capture. This change: 1. Passes `batch_size` from the host to the CUDA kernel. 2. Adds a check within the `RoIAlignForward` kernel to ensure `0 <= batch_index < batch_size`. 3. If an invalid `batch_index` is encountered, the kernel sets the output value for that specific RoI element to 0 and returns early for that thread. ## Impact - **Vulnerability fixed:** Heap out-of-bounds read on GPU. - **Performance:** Negligible impact as it's a simple range check within the existing kernel. - **Compatibility:** No changes to ONNX models or public APIs. ## Validation - Existing `RoiAlignTest` suite. - Added two new test cases: `BatchIndicesOutOfRange_CUDA` and `BatchIndicesNegative_CUDA` to verify that the CUDA provider correctly handles out-of-range batch indices. - Verified that the CUDA provider handles opset 10 without falling back to the CPU EP for these tests.
…rtDevice` aliases (microsoft#27594) # Summary This change replaces the Python-side PCI vendor ID constants in `onnxruntime_inference_collection.py` with a public `OrtDeviceVendorId` enum, exports that enum from `onnxruntime.__init__`, and continues using vendor-aware `OrtDevice` construction for well-known aliases like `"cuda"`, `"dml"`, and `"cann"`. The goal is to make vendor IDs reusable across Python APIs without duplicating raw integer constants while still fixing the plugin EP allocator lookup issue for `OrtValue.ortvalue_from_numpy(..., "cuda", ...)`. # Problem The Python wrapper now needs vendor IDs in more than one place. Keeping them as standalone integer constants is workable for a narrow fix, but it does not give Python callers a clear public API for vendor identity. As more APIs accept or return vendor IDs, users would have to either: - remember the raw PCI ID values - depend on private implementation details - repeat ad hoc constants in their own code That is not a good public surface for something that is now part of regular Python device construction flows. # Why We Need This Change Dynamic EP registration is intended to let a Python package gain hardware capability without requiring that hardware support to be built into the package itself. That only works if the Python-side device description matches the device identity used by dynamically registered EP allocators and data transfers. Without the underlying vendor-aware alias behavior: - registering the CUDA plugin library succeeds - sessions can use `CudaPluginExecutionProvider` - but Python cannot create CUDA `OrtValue`s with `OrtValue.ortvalue_from_numpy(..., "cuda", 0)` At the same time, without a public enum: - vendor IDs remain scattered as raw integers - Python callers do not have a clean symbolic way to specify vendor-specific `"gpu"` / `"npu"` devices - future vendor-aware APIs would keep expanding the same constant-style pattern # Example Use Case Our immediate use case is the CUDA plugin EP flow from Python. We register `libonnxruntime_providers_cuda_plugin.so` from Python and create sessions with `CudaPluginExecutionProvider`. That part works. Stage 4 of the plugin flow needs to create GPU-resident `OrtValue`s: ```python onnxruntime.OrtValue.ortvalue_from_numpy(array, "cuda", 0) ``` Before the vendor-aware alias fix, that failed in a CPU-only Python package even after the CUDA plugin was registered, because the Python wrapper constructed a generic GPU `OrtDevice` without the NVIDIA vendor ID. With this change, Python also has a public enum for vendor IDs, so callers can write explicit vendor-aware code when using generic device names: ```python onnxruntime.OrtValue.ortvalue_from_numpy( array, "gpu", 0, onnxruntime.OrtDeviceVendorId.NVIDIA, ) ``` # Fix The change does three things: 1. Replace the Python-side PCI vendor ID constants with an `IntEnum` named `OrtDeviceVendorId`. 2. Export `OrtDeviceVendorId` from `onnxruntime.__init__` so it is part of the public Python API. 3. Keep the vendor-aware alias behavior in `OrtDevice.make(...)` so that the historical shorthand aliases: - `"cuda"` -> `OrtDeviceVendorId.NVIDIA` - `"dml"` -> `OrtDeviceVendorId.MICROSOFT` - `"cann"` -> `OrtDeviceVendorId.HUAWEI` use the 4-argument `C.OrtDevice(...)` constructor with an explicit vendor ID. Generic device names like `"gpu"` and `"npu"` continue to behave as before unless the caller explicitly provides a vendor ID, and callers can now use either an integer or `OrtDeviceVendorId`. # Why This Approach An enum is the better public API here because it: - keeps Python aligned with the core runtime vendor ID definitions in `ortdevice.h` - preserves integer compatibility because `IntEnum` still works naturally with the pybind layer - gives users readable, discoverable names instead of undocumented raw PCI IDs - scales better as vendor-aware device APIs become more common This keeps the original plugin fix intact while improving the Python API shape instead of just adding more module constants. # Validation Validated in the Python layer by: - confirming the new enum-based implementation preserves vendor-aware alias handling for `"cuda"`, `"dml"`, and `"cann"` - exporting `OrtDeviceVendorId` from the top-level `onnxruntime` package - adding Python test coverage that checks `OrtDevice.make("cuda", 0)` resolves to the NVIDIA vendor ID via the enum - running `python -m compileall` on the updated Python files Targeted pytest execution could not be completed in this workspace because the local source tree does not provide an importable `onnxruntime.capi` module without a built package. # Notes This PR keeps backward compatibility for existing Python call sites: - shorthand aliases like `"cuda"` continue to work - explicit `vendor_id` arguments can still be passed as integers - callers now also have the option to use `onnxruntime.OrtDeviceVendorId`
### Description Use `_tpause` function defined in `waitpkgintrin.h` instead of calling the compiler built-in function (`__builtin_ia32_tpause`) directly. ### Motivation and Context The [`_tpause`][intel-intrinsics-guide] is independent of the compiler, whereas its implementation via the built-in function `__builtin_ia32_tpause` varies by compiler. Therefore, it is advisable not to use it directly. For example, [GCC][waitpkgintrin-gcc] and [LLVM][waitpkgintrin-llvm] have different arguments, leading to portability issues. [intel-intrinsics-guide]: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=tpause&techs=Other&ig_expand=6888 [waitpkgintrin-gcc]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/waitpkgintrin.h;h=42c6b0cd02866eccdfe3308f4792f17fe8c6ae38;hb=HEAD#l51 [waitpkgintrin-llvm]: https://github.com/llvm/llvm-project/blob/a682073ae7a49de4b95498ba01b9ea32e6b5f607/clang/lib/Headers/waitpkgintrin.h#L33-L38
…rite (microsoft#27544) ### Description <!-- Describe your changes. --> This pull request refactors several tensor operation kernels (`GatherND`, `ScatterND`, and `GatherGrad`) to improve type safety and consistency in parallelized code execution. The main change is replacing `int` loop indices with `ptrdiff_t` to avoid overflow. ### Parallelization and Type Safety Improvements * Updated lambda functions and parallel loop indices in `gather_nd.cc` (`GatherNDBase::PrepareForCompute`, `GatherND::GatherNumber`, and `GatherND::GatherString`) to use `ptrdiff_t` instead of `int64_t`, and replaced index arithmetic with explicit casts to maintain correctness. [[1]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL96-R100) [[2]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL121-R121) [[3]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL192-R216) * Refactored `scatter_nd.cc` (`ScatterNDDispatchTarget`) to use `ptrdiff_t` for loop indices and index arithmetic in all reduction cases, ensuring consistent type usage in parallel execution. * Modified `gather_grad.cc` (`GatherGrad::ComputeImpl`) to use `ptrdiff_t` for parallel loop indices, aligning with the changes in other tensor kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Another same issue was fixed in microsoft#27444
…environments (microsoft#27591) ### Description GPU device discovery on Linux relies exclusively on `/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes containers, `nvidia-drm` is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no `OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so `GetEpDevices` never matches the CUDA EP. Adds a fallback path in `GetGpuDevices()` that scans `/sys/bus/pci/devices/` when DRM yields zero GPUs: - **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA datacenter GPUs) per the [PCI Code and ID Assignment Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement) (base class 03h). Accepts an injectable sysfs root path for testability. - **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly from the PCI device sysfs path and populates `OrtHardwareDevice` with `pci_bus_id` and discrete GPU metadata. Note: `card_idx` is intentionally omitted from PCI-discovered devices since `directory_iterator` traversal order is unspecified and cannot be made consistent with DRM's `cardN` ordering. - **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI scan The PCI detection functions are exposed via a new `onnxruntime::pci_device_discovery` namespace (declared in `core/platform/linux/pci_device_discovery.h`) so they can be tested hermetically with fake sysfs directories. The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works. Also adds: - A cross-platform `GpuDevicesHaveValidProperties` test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on `device_id` since some platforms (e.g., Apple Silicon) do not populate it. - Comprehensive hermetic Linux unit tests (`test/platform/linux/pci_device_discovery_test.cc`) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the `ASSERT_STATUS_OK()` macro from `test/util/include/asserts.h` and use `CreateFakePciDevice` to set up complete fake PCI device directories for both `DetectGpuPciPaths` and `GetGpuDeviceFromPci` tests. ### Motivation and Context CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA driver, but does not load `nvidia-drm`. The existing `/sys/class/drm`-only detection path returns no GPU devices, blocking `GetEpDevices` from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description Fixes multiple issues that related to crash and memory leak. 1. Fix an uncommon situation that `BucketCacheManager` may hold pending buffers while cleaning up the WebGPU context, which causes memory leak. 2. Change the WebGPU default instance from a RAII wrapper (wgpu::Instance) to a raw pointer (WGPUInstance) so that it will not be destructed automatically at process exit, which may cause a crash due to accessing DXC code while dxcompiler.dll already unloaded. 3. Fix a crash in a situation that the default ORT logger destructed before a WebGPU device, so that the device callbacks are guarded by condition `logging::LoggingManager::HasDefaultLogger()`. Also includes a few fixes related to Node.js binding. 1. the `OrtEnv` was used as a function local variable. This is problematic because the destruction of OrtEnv may be too late, where some DLLs are already unloaded. (The order of DLL unloading at process exit is not totally controllable). Change it to: - if OrtEnv is constructed on main thread, a cleanup hook will be registered when Node.js starts to exit. If the callback is not called (eg. uncaught exception is thrown), the OrtEnv will not be released. - if OrtEnv is constructed on worker thread, just leave it and allow it to leak at exit. 2. because of (1), if OrtEnv is released already, do not release any active sessions (they are object wraps that destructed later than registered hooks). All of the changes above should have covered different scenarios but ensures: - if any resource is intentionally leaked, it must be at process exit. - if it's not at process exit, resources lifecycle should be managed correctly. - best efforts (but not guarantee) to release resources safely, as to be friendly to the memory leak detector. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…om .cc to headers (microsoft#27617) ## Description This PR refactors several CPU operator helper functions by moving their implementations from `.cc` files into `.h` headers, using the `#ifdef SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for the **CUDA Plugin EP** work, where CUDA kernels are built into a standalone shared library (`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the CPU provider's `.cc` object files. ### Why This Refactoring Is Needed The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files **cannot** depend on framework-internal symbols that live in the CPU provider static library (`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods (e.g., `SliceBase::PrepareForCompute`, `SplitBase::PrepareForCompute`, `ScatterND::ValidateShapes`, `TileOp::IsTileMemcpy`, `PadBase::ComputePads`) whose implementations currently live in CPU `.cc` files. In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are accessed through the `ProviderHostCPU` DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as `inline` functions under `#ifndef SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path retains the existing declaration-only signatures that route through `ProviderHostCPU`. This pattern has already been successfully applied to other operators (e.g., `Einsum`). This PR extends it to the remaining operators that need it. ## Summary of Changes ### Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`) | Operator | File | Functions Moved | |----------|------|-----------------| | **Slice** | `cpu/tensor/slice.h` | `SliceBase::FlattenOutputDims`, `SliceBase::PrepareForCompute` (both overloads), `SliceBase::FillVectorsFromInput`, `slice_detail::CopyInputData<T>` | | **Split** | `cpu/tensor/split.h` | `SplitBase::PrepareForCompute` | | **ScatterND** | `cpu/tensor/scatter_nd.h` | `ScatterND::ValidateShapes` | | **Tile** | `cpu/tensor/tile.h` | `TileOp::IsTileMemcpy` | | **Pad** | `cpu/tensor/padbase.h` | `PadBase::ComputePadsImpl` (new template method replacing `ComputePads` for cross-context compatibility) | | **BiasGelu** | `contrib_ops/cpu/bert/bias_gelu_helper.h` | `bias_gelu_helper::CheckInputs` (templatized on context type) | | **EmbedLayerNorm** | `contrib_ops/cpu/bert/embed_layer_norm_helper.h` | `embed_layer_norm::CheckInputs` (templatized on context type) | | **NonMaxSuppression** | `cpu/object_detection/non_max_suppression.h` + new `non_max_suppression_helper.h` | `NonMaxSuppressionBase` refactored into `NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>` template for plugin compatibility | ### Deleted `.cc` files (implementations moved to headers) - `contrib_ops/cpu/bert/bias_gelu_helper.cc` - `contrib_ops/cpu/bert/embed_layer_norm_helper.cc` ### Provider bridge additions - Added `Tensor::DataAsSpan<int32_t>()` support through the shared provider interface (`provider_interfaces.h`, `provider_wrappedtypes.h`, `provider_bridge_ort.cc`). This was needed because `slice_detail::CopyInputData<int32_t>` calls `Tensor::DataAsSpan<int32_t>()`, which was not previously bridged. ### CUDA-side updates - `cuda/tensor/slice.h`: Updated `Slice` constructor to use the new `SliceBase(info, dynamic, 0)` overload (template-based constructor compatible with both adapter and real `OpKernelInfo`). - `cuda/tensor/pad.cc`: Updated call from `PadBase::ComputePads` to `PadBase::ComputePadsImpl`. - `cuda/tensor/scatter_nd.cc`: Templatized `InitializeElementCountsAndInputDimsSpanOrGpu` on `KernelContextType` (also fixed typo: `InitiliazeElement...` → `InitializeElement...`). - `cuda/object_detection/non_max_suppression.h`: Updated to use `NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext>` instead of `NonMaxSuppressionBase`. ### New file - `cpu/object_detection/non_max_suppression_helper.h`: Contains the template-based `NonMaxSuppressionBaseImpl` class, separating it from the CPU-specific `NonMaxSuppression` kernel registration. ## Testing - Existing unit tests cover all affected operators (Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression). - No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed. - The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing `ProviderHostCPU` bridge. ## Motivation and Context This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables ~10 additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions.
…sembly. (microsoft#27575) ### Description Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration. ### Motivation and Context Performance improvements observed in prompt processing: - 14% speedup (batch size 1) - 6% speedup (batch size 4) - 4% speedup (batch size 8) Tested with granite-3.1-8b --------- Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
### Description <!-- Describe your changes. --> Update all builds to C++20. Previously, C++20 was only enabled for MacOS builds. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Update to C++20 standard to enable use of new language and standard library features.
### Description Fix integer overflow in Col2Im shape calculation by using SafeInt for attacker-controlled dimension arithmetic
…ft#27556) ### Description Add `qwen3` to the Python transformer optimizer's model type registry, enabling graph optimization for Qwen3 models (e.g., Qwen3-Embedding-0.6B, ranked 4th on MTEB). ### Motivation Fixes microsoft#25083 Running `optimum-cli export onnx --optimize O3` on Qwen3 models fails with: ``` ValueError: Unsupported model type: qwen3 ``` This PR resolves that by registering the model type and fixing a fusion gap that blocked normalization fusions. ### Changes **Model type registration** (`optimizer.py`): - Add `"qwen3": (Gpt2OnnxModel, "pytorch", 0)` to `MODEL_TYPES` - Uses `Gpt2OnnxModel` (not `BertOnnxModel`) because its `fuse_attention()` calls `FusionRotaryAttention`, which searches on `SkipSimplifiedLayerNormalization` anchors — needed for RMSNorm-based models **Fusion option defaults** (`fusion_options.py`): - Disable `EmbedLayerNormalization` (decoder-only, no BERT-style embedding) - Set `AttentionMaskFormat.NoMask` (causal masking is implicit) **SkipLayerNormalization fusion fallback** (`fusion_skiplayernorm.py`): - When symbolic shape inference fails (common with dynamo-exported models), the fusion previously returned early, skipping all `SkipLayerNormalization` / `SkipSimplifiedLayerNormalization` fusions - Now it falls through with the safe default `skip_index=1` (second Add input is skip), since both inputs are already verified as non-initializer dynamic tensors (lines 88-90) - This enables `SkipSimplifiedLayerNormalization` fusion on Qwen3 models where shape inference fails **Test** (`test_attention_fusion.py`, `qwen3_model_generator.py`): - Synthetic Qwen3 decoder layer graph with pre-attention RMSNorm, Q/K/V projections, QK-Norm, simplified attention, output projection, residual connection, and post-attention RMSNorm - Verifies 3× `SimplifiedLayerNormalization` (pre-attn, Q-norm, K-norm) + 1× `SkipSimplifiedLayerNormalization` (residual + post-attn RMSNorm) **Verified on real model**: Running the optimizer on an exported Qwen3-Embedding-0.6B (2-layer) reduces nodes from 208 → 150 (28% reduction). All 9 RMSNorm patterns fuse correctly: 5× `SimplifiedLayerNormalization` + 4× `SkipSimplifiedLayerNormalization`. **Scope note**: Full RotaryEmbedding + MultiHeadAttention fusion for Qwen3's dynamo-exported graphs requires additional pattern matching work (static Slice indices, on-the-fly sin/cos computation, QK-Norm in Q/K paths, GQA expansion). That will be addressed in a follow-up PR. ### Test Plan - [x] `test_attention_fusion.py::TestFusion::test_qwen3_normalization_fusion` passes - [x] All 14 existing tests in `test_attention_fusion.py` pass (no regressions) - [x] All 4 tests in `test_optimizer_huggingface_bert.py` pass (bert, distillbert, roberta, xlm_roberta — no regressions from the SkipLayerNorm fallback change) - [x] `lintrunner -a` clean
…operator (microsoft#27201) ### Description 1. Supports volumetric input grid sampling in the CUDA EP `GridSample` operator (i.e.) 5-D input tensor a.k.a 3-D spatial data 2. Registers the CUDA `GridSample` operator for opsets 20 and 22 3. Supports both NCHW and NHWC layouts for volumetric inputs 4. Does not support `cubic` mode for volumetric inputs for now and this is consistent with the CPU version of the implementation and hence will not cause "functional regression" (i.e.) `cubic` mode for 3-D spatial data is not supported on CPU and CUDA before and after this change. This is a TODO for the future. 5. There are enough unit tests in `grid_sample_test.cc` to cover the volumetric input case and this is run in both NCHW (NCDHW for volumetric case) and NHWC (NDHWC for volumetric case) layouts for the CUDA EP ### Motivation and Context Resolve microsoft#21382 Resolve microsoft#18942 Resolve microsoft#16581 Resolve microsoft#18313 Related CPU PRs (for opset 20 and opset 22): microsoft#17744 && microsoft#23344
…-build kernel safety (microsoft#27613) This pull request introduces several optimizations and safety improvements to CUDA kernels used in attention, rotary embedding, and tensor scatter operations for ONNX Runtime's LLM support. The main focus is on reducing decode overhead, improving memory safety, and ensuring correct handling of edge cases in mask and index validation. The most important changes are grouped below by theme. ### Flash Attention & KV Cache Optimization * Replaced the previous pattern of zero-filling and strided copy for KV cache updates with a single fused kernel (`LaunchConcatNewToPastKV`), eliminating redundant memory writes and reducing decode overhead in `attention.cc`. This streamlines the cache update process and improves performance. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL313-R350) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL396-R436) * Updated the documentation and comments to clarify the new fused kernel approach and its performance benefits, as well as the handling of sequence lengths for cache and mask conversion. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL190-R192) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL417-R446) ### Mask Validation & Handling * Improved mask validation in `attention_mask_impl.cu` and `attention_mask_impl.h` by clarifying that CUDA_KERNEL_ASSERT is only active in debug builds. In release builds, non-contiguous masks produce safe output by counting only leading True values, ensuring memory safety and correctness even with invalid masks. [[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L11-R22) [[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R100) [[3]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L32-R35) ### Rotary Embedding Improvements * Switched rotary embedding kernel dispatch to use `OrtToCudaType` for BFloat16, enabling native hardware arithmetic (`__nv_bfloat16`) on SM80+ GPUs for improved performance and correctness. [[1]](diffhunk://#diff-411fdb2010086b3a0ad9b048bb0d0fd7721a0e8d33d9ad396d709254973448c2R5) [[2]](diffhunk://#diff-411fdb2010086b3a0ad9b048bb0d0fd7721a0e8d33d9ad396d709254973448c2L67-R71) [[3]](diffhunk://#diff-b0846d38debfc56c4c9fbb52ae7a201323ec1eab36853cf3627838fce4bb98feR13) [[4]](diffhunk://#diff-b0846d38debfc56c4c9fbb52ae7a201323ec1eab36853cf3627838fce4bb98feR168-R176) * Added explicit kernel instantiation for `__nv_bfloat16` in rotary embedding implementation, ensuring proper support for native CUDA types. ### TensorScatter Safety Enhancements * Enhanced validation and memory safety for `write_indices` in tensor scatter operations by adding in-kernel clamping of invalid indices and clarifying behavior in comments. This prevents out-of-bounds writes and preserves CUDA graph compatibility. [[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R80) [[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41R32-R40) These changes collectively improve performance, robustness, and safety for CUDA-based LLM operations in ONNX Runtime. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ty (microsoft#27637) ### Description Accept pre-1.25 names `"WebGPU_Buffer"`/`"WebNN_Tensor"` as aliases in `CreateMemoryInfo` and normalize them to the current short names `"WebGPU_Buf"`/`"WebNN_Ten"`. This is the **reverse** of microsoft#27475 (which added forward compatibility in the 1.24.x patch branch). ### Motivation and Context Released onnxruntime-genai still uses the old (pre-1.25) long names when calling `CreateMemoryInfo`. Without this change, those calls fail with `ORT_INVALID_ARGUMENT` on main branch. ### Key Design Decision When an old name is detected, it is **normalized** to the current short constant (e.g., `"WebGPU_Buffer"` -> `"WebGPU_Buf"`). This is critical because downstream code (e.g., `external_data_loader.cc`, `webgpu_context.cc`) compares `OrtMemoryInfo.name` against the current constants. Simply passing through the old name would cause those comparisons to fail. ### Changes - `onnxruntime/core/framework/allocator.cc`: Accept and normalize legacy names in `CreateMemoryInfo` - `onnxruntime/test/shared_lib/test_allocator.cc`: Add test verifying legacy names are accepted and normalized ### See Also - microsoft#27207 (original rename) - microsoft#27475 (forward compat in 1.24.x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description Added the `-Wl,-z,max-page-size=16384` linker flag to the `onnxruntimejsi` target in `js/react_native/android/CMakeLists.txt` to support **16KB memory page sizes**. ### **Motivation and Context** Starting with **Android 15**, Google Play requires apps to be compatible with 16KB page size devices. By default, the Android NDK builds shared libraries with 4KB ELF segment alignment. Without this explicit flag, `libonnxruntimejsi.so` fails the Play Store's 16KB alignment check, leading to potential crashes or installation failures on supported hardware. ### **Changes** - Updated `js/react_native/android/CMakeLists.txt`. - Applied `target_link_options` to `onnxruntimejsi` to enforce `max-page-size=16384` (16KB). ### **References** - [[Android Developer Guide: Support 16 KB page sizes](https://developer.android.com/guide/practices/page-sizes#fix-cmake)](https://developer.android.com/guide/practices/page-sizes#fix-cmake) @fs-eire
…27602) We have some warnings for unused function on Linux and a sprintf warning on Windows that is blocking our CI.
…sion.py (microsoft#27642) # Description This PR addresses a build error and subsequent test failures related to recent changes in GridSample and the transformer optimizer. Related PRs: microsoft#27201, microsoft#27556. ## Changes ### 1. Fix GridSample Build Error - Removed an unused local variable `mode_str` in `onnxruntime/core/providers/cuda/tensor/grid_sample.cc` that was causing a warning (treated as error) about shadowing a member variable. - Ref: [`grid_sample.cc`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/core/providers/cuda/tensor/grid_sample.cc#L54) ### 2. Update GridSample Tests - Updated `onnxruntime/test/providers/cpu/tensor/grid_sample_test_custom.inc` to use default execution providers in `RunTests` instead of a hardcoded opset version, ensuring compatibility across different environments. ### 3. Revert Transformer Fusion Fallback - Reverted a recent change in `onnxruntime/python/tools/transformers/fusion_skiplayernorm.py` that enabled a fallback for `SkipLayerNormalization` fusion when symbolic shape inference fails. - This revert was necessary to avoid regressions in GPT-2 tests where model definitions contain typos that intentionally (or coincidentally) break shape inference. - Ref: [`fusion_skiplayernorm.py`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/python/tools/transformers/fusion_skiplayernorm.py#L113) ### 4. Restore Transformer Test Parity - Updated `onnxruntime/test/python/transformers/test_attention_fusion.py` specifically `test_qwen3_normalization_fusion` to match the expected node counts after reverting the fusion fallback. - Ref: [`test_attention_fusion.py`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/test/python/transformers/test_attention_fusion.py#L398) ## Verification - `build_cuda.sh` completed successfully. - `onnxruntime/test/python/transformers/test_attention_fusion.py` passes with "OK". - `lintrunner -a` reports no issues.
…de from .cc to headers (Part 2) (microsoft#27628) ## Description This PR continues the refactoring effort started in PR microsoft#27617, moving additional CPU operator helper function implementations from `.cc` files into `.h` headers using the `#ifdef SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for the **CUDA Plugin EP** work, where CUDA kernels are built into a standalone shared library (`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the CPU provider's `.cc` object files. ### Why This Refactoring Is Needed The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files **cannot** depend on framework-internal symbols that live in the CPU provider static library (`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods whose implementations currently live in CPU `.cc` files. In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are accessed through the `ProviderHostCPU` DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as `inline` functions under `#ifndef SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path retains the existing declaration-only signatures that route through `ProviderHostCPU`. ### Refactoring Patterns Used 1. **Inline move**: Function body moved from `.cc` to `.h`, wrapped in `#ifndef SHARED_PROVIDER` with `inline` linkage. The `#ifdef SHARED_PROVIDER` path keeps the original declaration. 2. **Template-on-context**: Methods like `PrepareCompute`, `PrepareForCompute`, and `GetPresent` are templatized on `KernelContextType` so they work with both `OpKernelContext` (in-tree) and the plugin EP's adapter context. 3. **Template-on-info**: Constructors and initialization methods (e.g., `RoiAlignBase`, `CropBase`, `SpaceDepthBase`) are templatized on `KernelInfoType` with `info.template GetAttr<T>(...)` calls, making them compatible with both `OpKernelInfo` and the plugin's `OpKernelInfoAdapter`. 4. **Helper extraction**: Free helper functions (e.g., `CheckROIAlignValidInput`, `GetAxis`, `AdjustOutputSizeAsPolicy`) moved inline into headers. ## Summary of Changes ### Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`) | Operator | Header File | Functions Moved | |----------|-------------|-----------------| | **AttentionBase** | `contrib_ops/cpu/bert/attention_base.h` | `AttentionBase::CheckInputs` (both overloads), `AttentionBase::CheckMask`, `AttentionBase::GetPresent` (templatized on `TOpKernelContext`) | | **LongformerAttentionBase** | `contrib_ops/cpu/bert/longformer_attention_base.h` | `LongformerAttentionBase::CheckInputs` | | **CumSum** | `cpu/math/cumsum.h` | `GetAxis` (free function) | | **RoiAlign** | `cpu/object_detection/roialign.h` | `CheckROIAlignValidInput` (free function), `RoiAlignBase` constructor templatized on `TKernelInfo` | | **Concat** | `cpu/tensor/concatbase.h` | `ConcatBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Gather** | `cpu/tensor/gatherbase.h` | `GatherBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Unsqueeze** | `cpu/tensor/unsqueeze.h` | `UnsqueezeBase::PrepareCompute` (templatized on `KernelContextType`) | | **Upsample** | `cpu/tensor/upsamplebase.h` | `UpsampleBase::AdjustOutputSizeAsPolicy`, `upsamplebase_helper::AdjustOutputSizeAsPolicy` (free helper) | ### Constructor templatization (for plugin EP adapter compatibility) | Class | Header File | Change | |-------|-------------|--------| | **CropBase** | `contrib_ops/cpu/crop.h` | Constructor templatized on `KernelInfoType`, `GetAttrsOrDefault` calls use `info.template` syntax | | **SpaceDepthBase** | `cpu/tensor/space_depth_ops.h` | Constructor templatized on `KernelInfoType`, `GetAttr` call uses `info.template` syntax | | **RoiAlignBase** | `cpu/object_detection/roialign.h` | Constructor templatized on `TKernelInfo`, all `GetAttr` calls use `info.template` syntax | ### CUDA-side updates | File | Change | |------|--------| | `cuda/tensor/upsample.cc` | Added explicit template instantiations for `Upsample<float>`, `Upsample<double>`, `Upsample<MLFloat16>`, `Upsample<int32_t>`, `Upsample<uint8_t>` (needed because `AdjustOutputSizeAsPolicy` implementation moved to header) | ### Files with code removed (moved to headers) | Source File | Lines Removed | Moved To | |-------------|---------------|----------| | `contrib_ops/cpu/bert/attention_base.cc` | ~333 | `attention_base.h` | | `contrib_ops/cpu/bert/longformer_attention_base.cc` | ~133 | `longformer_attention_base.h` | | `cpu/math/cumsum.cc` | ~23 | `cumsum.h` | | `cpu/object_detection/roialign.cc` | ~74 | `roialign.h` | | `cpu/tensor/concat.cc` | ~8 | `concatbase.h` | | `cpu/tensor/gather.cc` | ~4 | `gatherbase.h` | | `cpu/tensor/unsqueeze.cc` | ~51 | `unsqueeze.h` | | `cpu/tensor/upsample.cc` | ~44 | `upsamplebase.h` | ## Testing - Existing unit tests cover all affected operators (Attention, LongformerAttention, CumSum, RoiAlign, Concat, Gather, Unsqueeze, Upsample, Crop, SpaceToDepth/DepthToSpace). - No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed. - The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing `ProviderHostCPU` bridge. ## Motivation and Context This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions. This PR is a direct continuation of PR microsoft#27617 which applied the same pattern to Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, and NonMaxSuppression operators.
…7626) ### Description Add a Windows VERSIONINFO resource (.rc file) for the Vitis AI provider DLL, following the same pattern used for CUDA, TensorRT, and QNN EPs (added in microsoft#24606). This embeds the ORT version into the DLL's PE header so it shows up in file properties. ### Motivation and Context Need version in onnxruntime_providers_vitisai.dll to track changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#27520) ### Description Skip building `custom_op_library` library if CUDA_MINIMAL is enabled ### Motivation and Context microsoft#27308 removes cudnn include for `custom_op_library` target in cmake if CUDA_MINIMAL is enabled. In fact, the `custom_op_library` target does not define the `USE_CUDA_MINIMAL ` macro (no `target_compile_definitions(custom_op_library PRIVATE -DUSE_CUDA_MINIMAL)` in onnxruntime_unittests.cmake), so one of the files, i.e. cuda_context.h included in cuda_ops.cc still includes cudnn.h and the CI just got lucky to pass becasue cudnn.h is in --cuda_home. If building locally, it might fail to find cudnn.h.
### Description For FP16 models with block-quantized weights (`DQ(int4/int2/int8, fp16_scale) → MatMul(fp16)`), the `DQMatMulToMatMulNBitsSelector` failed to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP during graph partitioning, leaving their execution provider unassigned (empty string `""`). The selector's EP compatibility check rejected these nodes. This PR: - Adds `""` (empty/unassigned EP) to the compatible providers list for `DQMatMulToMatMulNBitsSelector` so it can match FP16 MatMul nodes not yet assigned to an EP. The resulting `MatMulNBits` node is assigned to `kCpuExecutionProvider` by the action (which has both `float` and `MLFloat16` CPU kernels). - Adds `""` to the `QDQSelectorActionTransformer` transformer-level compatible EPs so unassigned nodes reach individual selectors (other selectors are unaffected since their own provider lists don't include `""`). - Removes the `DQCastMatMulToMatMulNBitsSelector` and `DQCastMatMulToMatMulNBitsAction`, which handled a `DQ → Cast(fp16→fp32) → MatMul` pattern that only existed after `InsertCastTransformer` ran. That fusion only worked incidentally when `FuseInitializersTransformer` (Level 4) triggered an optimization loop repeat, giving Level 2 QDQ fusions a second pass — a behavior that didn't occur in all builds (e.g., minimal/extended-minimal builds without `FuseInitializersTransformer`). - Replaces the `DQCastMatMulConvertedToMatMulNBits` test with `DQMatMulFP16ConvertedToMatMulNBits` that tests the actual scenario: `DQ(int4, fp16_scale) → MatMul(fp16)` on CPU EP. ### Motivation and Context FP16 models with block-quantized weights were not getting `DQ → MatMulNBits` fusion when running on CPU EP in certain ORT builds. The fusion worked on x64 full builds by luck — `InsertCastTransformer` created `DQ→Cast→MatMul` patterns, then `FuseInitializersTransformer` (Level 4) modified FP16 initializers causing the optimization loop to repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware selector matched. In builds without `FuseInitializersTransformer` (e.g., minimal builds, arm packages), the loop didn't repeat and the fusion never applied. The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't claim FP16 MatMul nodes during partitioning. These nodes have an empty EP string, which the `QDQSelectorActionTransformer` and `BaseSelector` both rejected. The fix allows the `DQMatMulToMatMulNBits` selector to match unassigned nodes directly on the first Level 2 pass, before `InsertCastTransformer` runs, eliminating the dependency on the optimization loop repeat.
microsoft#27650) ### Description Revert QNN SDK logging verbosity changes introduced in microsoft#24931. This reverts commit ec4f6bf. ### Motivation and Context Loggin used to fail on QNN backend destruction (when releasing QNN context handles) with segmentation faults (even with empty user logging functions), hence reverting the changes. Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description - Updates the `ValidateExternalData()` function: - Resolves symlinks when validating external data path for **models loaded from memory**. - Previously only did a lexical check that did not resolve symlinks. Now, we resolve symlinks. - **Now requires external data path to exist.** - Replace the `(base_dir, model_dir)` function parameters with just `model_dir`. `base_dir` was always derived from `model_dir`. - Skip validation for WASM builds that do not have a filesystem. - Return `Status` instead of throwing exceptions when `std::filesystem` functions fail. - Updates `Graph::ConvertInitializersIntoOrtValues()`: - Prevents unnecessary calls to `ValidateExternalData()` for external data paths that have already been validated. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
… / GetCompatibilityInfoFromModelBytes (microsoft#27565) ### Description <!-- Describe your changes. --> This change adds C# and Python language bindings and tests for the recently-introduced GetCompatibilityInfoFromModel / GetCompatibilityInfoFromModelBytes API. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> microsoft#27015 introduced a new API to facilitate getting the model compatibility information from the metadata of a model (either a file or the model bytes). For convenience, we should ideally have some other language bindings included to make consumption a little easier. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description The code contains some unreferenced variables or functions, which will generate a warning. ### Motivation and Context The change only removes two warnings, but it will cause compilation failure when a Werror enabled.
…oft#27590) ### Description Extend `FusionRotaryEmbeddings` to handle Qwen3's on-the-fly rotary position embedding computation, where cos/sin values are computed from `inv_freq` at runtime instead of being looked up from a pre-computed cache. This is a follow-up to microsoft#27556 (Qwen3 basic model type support). Depends on microsoft#27556. Part of microsoft#25083. ### Motivation and Context Qwen3 models (ranked 4th on MTEB) compute RoPE differently from existing supported models (Phi, LLaMA, etc.). Instead of pre-computing cos/sin caches and looking them up via `Gather(cache, position_ids)`, Qwen3 computes them on-the-fly: ```python freqs = inv_freq_expanded @ position_ids_expanded # MatMul emb = torch.cat((freqs, freqs), dim=-1) # Concat cos = emb.cos() * attention_scaling # Cos, Mul sin = emb.sin() * attention_scaling # Sin, Mul ``` Additionally, TorchScript exports of Qwen3 insert `Cast` nodes in the `rotate_half` pattern (from `torch.floor_divide` tracing), which the existing path patterns don't account for. ### Changes **`fusion_rotary_attention.py`:** - Add Cast-tolerant `rotate_half` path patterns (`rotate_half_x2_path_2_3`, `_2_4`, `rotate_half_x1_path_2_3`, `_2_4`) that allow 1-2 Cast nodes between Unsqueeze and Div in the dynamic Slice index computation - Add `sin_path_5` / `cos_path_5` patterns matching the on-the-fly computation: `MatMul → Transpose → Concat → Cos/Sin → Mul(scaling) → Unsqueeze → Mul`, with optional Cast variant (the optimizer's earlier Cast fusion pass may remove the Cast) - Add `create_cos_sin_cache_from_on_the_fly_rope()` helper that extracts `inv_freq` weights, computes cos/sin caches as model initializers, and traces `position_ids` from the graph - Handle per-layer vs shared node removal correctly (only remove per-layer Unsqueeze/outer Mul nodes; shared MatMul/Cos/Sin nodes are pruned automatically by the optimizer) **`qwen3_model_generator.py`:** - Add `include_rope=True` parameter to `create_qwen3_decoder_layer()` - Generate full on-the-fly RoPE computation graph: `inv_freq` initializer, `position_ids` input, MatMul/Transpose/Concat/Cos/Sin/Mul nodes, and `rotate_half` pattern with dynamic Slice indices (including Cast nodes from floor division) - Apply RoPE to both Q and K paths **`test_attention_fusion.py`:** - Add `test_qwen3_rotary_embedding_fusion` verifying 2 RotaryEmbedding nodes are fused along with 3 SimplifiedLayerNormalization and 1 SkipSimplifiedLayerNormalization ### Verification - **Unit tests**: All 15 `test_attention_fusion.py` tests pass (14 existing + 1 new) - **Real model**: Verified on Qwen3-Embedding-0.6B (28 layers): 56 RotaryEmbedding nodes fused (28 layers × 2 per layer for Q and K), reducing total node count from 7416 → 4661 (37% reduction) - **No regressions**: All changes are additive alternative path patterns — existing models that use dynamic Slice indices or cache-based RoPE never hit the new paths - **Lint**: `lintrunner -a` clean on all modified files
It seems some outputs on the spanned list may be nullptrs. By checking for nullptr and skipping them if found, it does not seem to disturb proper execution on models. That check was not required for legacy EP using the Host API.
…fused nodes in different GraphViews (microsoft#27666) ### Description Fixes a bug where `PluginExecutionProvider::GetCapability()` incorrectly assigned duplicate MetaDef IDs to fused nodes that live in different GraphViewer instances (e.g., the then/else branches of an If node). The root cause was that `GetCapability()` created a new `ModelMetadefIdGenerator` on every invocation. Since the graph partitioner calls `GetCapability()` once per subgraph, the generator's monotonic counter reset each time, producing colliding IDs across subgraphs. This caused session creation to fail with: > Failed to add kernel for example_ep_9433721956998717990_0 example_ep example_ep: Conflicting with a registered kernel with op versions. the since version is: 1 #### Fix - Promoted `ModelMetadefIdGenerator` to an instance member of `PluginExecutionProvider` so the same generator is reused across all `GetCapability()` calls, ensuring unique MetaDef IDs. - This is also consistent with how existing provider-bridge EPs create and use a single generator instance. - **Bonus perf improvement**: No longer recomputes the entire model's hash on every call to `GetCapability()`. #### Testing Example EP changes: - Refactored `SaveConstantInitializers()` → `TrySaveConstantInitializer()` to save initializers per-node-input instead of via `graph.GetInitializers()`, which doesn't return initializers defined in parent or sibling subgraphs. - Extracted `CopiesConstantInitializers()` helper to deduplicate the condition for drop_constant_initializers. Unit testing: - Added unit test called `CompilingPluginEp_MultiSubgraphs_DuplicateMetaDefIdBug` — runs an If model with Mul nodes in both branches, verifying that both fused nodes receive unique MetaDef IDs and the session creates/runs successfully. Credit to @apwojcik for [finding the bug.](microsoft#27608)
### Description Fix 2 bugs in emdawnwebgpu. 1. Fix incorrect handling for device lost. See also: - issue: [Unexpected exit on device lost handler [492350387] - Chromium](https://issues.chromium.org/issues/492350387) - PR: [emdawnwebgpu: Add runtimeKeepalive for device.lost handler by fs-eire · Pull Request #57 · google/d…](google/dawn#57) (but dawn does not accept PR with copilot as co-author, so just for reference) 2. Fix wrong call to WGPUBufferImpl constructor. See also: - issue: [Incorrect WGPUBufferImpl constructor called from importJsBuffer](https://issues.chromium.org/issues/492539247) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- [x] Analyze CI failure: `EPContextNode_ForeignSourceSkipped` assertion
expects "OpenVINOExecutionProvider" in error but actual error from
TransformerMemcpyImpl doesn't include it
- [x] Fix `tensorrt_basic_test.cc`: Remove overly-specific
"OpenVINOExecutionProvider" assertion from
`EPContextNode_ForeignSourceSkipped`
- [x] Fix `nv_ep_context_test.cc`: Remove same overly-specific assertion
from `EPContextNode_ForeignSourceSkipped` (proactive)
- [x] Run code review (no actionable findings)
- [x] Run CodeQL security check (no findings)
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>NvTensorRTRTXExecutionProvider::GetCapability claims
EPContext nodes belonging to other EPs, causing crash on multi-GPU
systems</issue_title>
<issue_description>### Describe the issue
On multi-GPU systems where both `OpenVINOExecutionProvider` and
`NvTensorRTRTXExecutionProvider` are registered,
loading an EPContext model produced by OpenVINO causes an access
violation (0xC0000005) or
"Could not find an implementation for EPContext(1)" error.
The root cause is that `NvExecutionProvider::GetCapability()` in
`nv_execution_provider.cc` claims **all**
`EPContext` nodes without checking the `source` attribute:
```cpp
// nv_execution_provider.cc ~line 2019
const bool is_context_node = node && !node->OpType().empty() && node->OpType() == EPCONTEXT_OP;
if (is_context_node) {
// Claims any EPContext node — even those produced by OpenVINO, QNN, etc.
result.push_back(ComputeCapability::Create(std::move(sub_graph)));
}
```
The `EPContext` contrib op schema defines an optional `source` attribute
specifically for EP identification
(`contrib_defs.cc`). Other EPs already check this attribute:
- **OpenVINO EP** checks `source == kOpenVINOExecutionProvider` in
`EPCtxHandler::CheckForOVEPCtxNode()`
- **QNN EP** checks `cache_source == "qnnexecutionprovider" ||
cache_source == "qnn"` in `PartitionCtxModel()`
The NvTensorRTRTX EP neither checks `source` when claiming EPContext
nodes in `GetCapability()`,
nor writes `source` when creating EPContext nodes in `CreateCtxNode()`.
### Proposed fix
Add a `source` attribute check to `NvExecutionProvider::GetCapability()`
before claiming EPContext nodes:
```cpp
const bool is_context_node = node && !node->OpType().empty() && node->OpType() == EPCONTEXT_OP;
if (is_context_node) {
// Only claim EPContext nodes that belong to this EP.
// If the SOURCE attribute is present and doesn't match, skip the node.
const auto& attrs = node->GetAttributes();
if (attrs.count(SOURCE) > 0 &&
attrs.at(SOURCE).s() != kNvTensorRTRTXExecutionProvider) {
continue;
}
// ... claim the node
}
```
This requires adding `static const std::string SOURCE = "source";` to
`onnx_ctx_model_helper.h`
(matching the existing constant in QNN EP's
`builder/onnx_ctx_model_helper.h` and OpenVINO EP's
`onnx_ctx_model_helper.h`).
**Additionally**, `CreateCtxNode()` in `onnx_ctx_model_helper.cc` should
be updated to write the
`source` attribute (set to `kNvTensorRTRTXExecutionProvider`) when
producing EPContext models,
following the same pattern as OpenVINO EP's `AddOVEPCtxNodeToGraph()`.
This ensures NvTensorRTRTX
EPContext models are properly tagged for the future.
### Urgency
This is a **P1 blocker for MLCommons MLPerf Client v1.6** testing on
multi-GPU laptop systems
(Intel iGPU + NVIDIA dGPU). See:
https://github.com/mlcommons/mlperf_client_dev/issues/976
### To reproduce
**System:** Any system with both an Intel GPU (OpenVINO EP) and NVIDIA
GPU (NvTensorRTRTX EP)
1. Register both OpenVINO EP and NvTensorRTRTX EP with ORT
2. Load an EPContext model with `source=OpenVINOExecutionProvider`
(e.g., Phi-3.5 compiled by OpenVINO)
3. Create a session with auto EP selection (`PREFER_GPU`) or manual
multi-EP ordering
**Expected:** OpenVINO EP claims its own EPContext node; NvTensorRTRTX
EP skips it
**Actual:** NvTensorRTRTX EP claims the OpenVINO EPContext node → access
violation crash (0xC0000005)
A self-contained min-repro (C++ source + Python model generator) is
available. The repro uses
`RegisterExecutionProviderLibrary` to register both EP DLLs, creates a
synthetic EPContext model
with `source=OpenVINOExecutionProvider`, and demonstrates the crash.
### Tested fix
We have built and tested the fix described above against:
- Synthetic EPContext model with `source=OpenVINOExecutionProvider`
- Real Phi-3.5 EPContext model from MLPerf (`openvino_model_dy.onnx`)
Both pass with the fix applied: NvTensorRTRTX EP correctly skips the
OpenVINO EPContext node
(`"[NvTensorRTRTX EP] No graph will run on Nv execution provider"`), and
OpenVINO EP successfully
creates the session.
The classic `TensorRTExecutionProvider` has the same issue
(`tensorrt_execution_provider.cc` line 2593)
but is less commonly deployed alongside other EPs.
### Environment
- ORT version: 1.24.3 (rel-1.24.3 tag)
- OS: Windows 11
- Hardware: Intel Core i9-11950H (iGPU) + NVIDIA GeForce RTX 3080 Laptop
GPU (dGPU)
- EP packages: OpenVINO EP
(MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8 1.8.63.0),
NvTensorRTRTX EP (MicrosoftCorporationII.WinML.NVIDIA.TRT-RTX.EP.2
0.0.26.0)
</issue_description>
<agent_instructions>Please provide a fix and add unit
test.</agent_instructions>
## Comments on the Issue (you are @copilot in this section)
<comments>
</comments>
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixes microsoft#27622
<!-- START COPILOT CODING AGENT TIPS -->
---
💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
# Description This PR addresses several build warnings and a build error in the CUDA provider, primarily focused on improving the stability of Debug builds. ## Changes ### CUDA Provider Fixes - **Fix signedness comparison warnings**: - In [tile.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/tile.cc), changed the `axis` loop variable type from `size_t` to `int32_t` to match `input_rank`. - In [pad.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/pad.cc), converted `p_pads->size()` to `int32_t` using `narrow` and updated the loop variable type to resolve signedness warnings across template instantiations. - **Fix GQA build error**: - Added a missing include for `common.cuh` in [group_query_attention_qkv.cuh](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh). This resolves the `identifier "CUDA_KERNEL_ASSERT" is undefined` error encountered in Debug builds. ### Test Improvements - **Rotary Embedding Tests**: - Skipped out-of-bounds position ID tests in [rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc) and [test/contrib_ops/rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc) for Debug builds. This is necessary because CUDA device-side asserts (enabled in Debug mode) can poison the CUDA context when encountering out-of-bounds indices, causing subsequent tests to fail. ### Minor Cleanup - Simplified initializer list usage in [graph_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/ir/graph_test.cc) to avoid build error like: ``` inlined from ‘constexpr void std::vector<_Tp, _Alloc>::resize(size_type) [with _Tp = onnxruntime::NodeArg*; _Alloc = std::allocator<onnxruntime::NodeArg*>]’ at /usr/include/c++/13.2.0/bits/stl_vector.h:1013:21, inlined from ‘virtual void onnxruntime::test::GraphTest_GraphConstruction_CheckGraphInputOutputOrderMaintained_Test::TestBody()’ at /home/tlwu/git/onnxruntime/onnxruntime/test/ir/graph_test.cc:1214:16: /usr/include/c++/13.2.0/bits/stl_uninitialized.h:1132:28: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ forming offset 8 is out of the bounds [0, 8] [-Werror=array-bounds=] 1132 | __builtin_memmove(__result, __first, __count * sizeof(_Tp)); ```
ankitm3k
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Daily backmerge from ORT main to ovep-develop. Do NOT squash or rebase - use merge commit only.