Skip to content

Sync with Microsoft ONNX Runtime - 18032026#980

Merged
ankitm3k merged 40 commits intoovep-developfrom
sync_msft_18032026
Mar 18, 2026
Merged

Sync with Microsoft ONNX Runtime - 18032026#980
ankitm3k merged 40 commits intoovep-developfrom
sync_msft_18032026

Conversation

@Jaswanth51
Copy link

Daily backmerge from ORT main to ovep-develop. Do NOT squash or rebase - use merge commit only.

dependabot bot and others added 30 commits March 11, 2026 04:19
Bumps [tar](https://github.com/isaacs/node-tar) from 7.5.9 to 7.5.11.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/isaacs/node-tar/commit/bf776f673164215074b62749e0fe80e5834588f4"><code>bf776f6</code></a>
7.5.11</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/f48b5fa3b7985ddab96dc0f2125a4ffc9911b6ad"><code>f48b5fa</code></a>
prevent escaping symlinks with drive-relative paths</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/97cff15d3539a37a4095eb3d287147d9d77c2dc3"><code>97cff15</code></a>
docs: more security info</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/2b72abc1d47c3570e1ad95c9ab557fc4c2e6e4b1"><code>2b72abc</code></a>
7.5.10</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/7bc755dd85e623c0279e08eb3784909e6d7e4b9f"><code>7bc755d</code></a>
parse root off paths before sanitizing .. parts</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/c8cb84629dee649feedde03f2f4ea48f2e44e778"><code>c8cb846</code></a>
update deps</li>
<li>See full diff in <a
href="https://github.com/isaacs/node-tar/compare/v7.5.9...v7.5.11">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tar&package-manager=npm_and_yarn&previous-version=7.5.9&new-version=7.5.11)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…soft#27587)

### Description

Add support for `LoggingManager::HasDefaultLogger()`.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#27599)

## Summary

This PR adds a new `OrtApi` entry point for reading repeated string
attributes from `OrtKernelInfo`:

- `KernelInfoGetAttributeArray_string`

It also wires that support through the C++ wrapper so callers can use:

- `Ort::ConstKernelInfo::GetAttributes<std::string>(...)`

## Problem

The existing kernel info APIs already support scalar and array attribute
access for numeric types, but there was no C API for reading
string-array attributes from `OrtKernelInfo`.

That created a gap for code paths that rely on repeated string
attributes in kernel metadata, such as:

- custom op / kernel consumers using the public C API
- C++ wrapper callers expecting `GetAttributes<std::string>` to work
end-to-end
- plugin EP scenarios that need to compile existing kernels against the
adapter/C API surface

One concrete case is CUDA plugin EP RNN support, where the RNN kernels
read the `activations` attribute via
`GetAttrs<std::string>("activations", ...)`. The adapter path needed a
corresponding ORT C API to expose that data.

## Changes

### C API

Added `OrtApi::KernelInfoGetAttributeArray_string` to fetch repeated
string attributes from `OrtKernelInfo`.

Behavior:

- If `out == nullptr`, the API returns the attribute count in `size`.
- Otherwise, the API allocates the pointer array and each UTF-8 string
with the provided `OrtAllocator`.
- For empty attributes, `*out` is set to `nullptr` and `*size` is set to
`0`.
- The caller frees each string and the pointer array with the same
allocator.

### Implementation

Added the implementation in the ORT session/custom-op API layer by:

- reading the underlying attribute with
`OpKernelInfo::GetAttrs<std::string>`
- copying the result into allocator-owned C-style string storage for the
public API

### C++ wrapper

Completed C++ wrapper support so
`Ort::ConstKernelInfo::GetAttributes<std::string>(name)` works through
the new C API.

The wrapper follows the standard two-call pattern:

1. query the number of strings
2. allocate and fetch the returned string array
3. copy into `std::vector<std::string>` and release allocator-owned
memory

### Tests

Added framework tests covering:

- non-empty string-array attributes
- empty string-array attributes
- missing attribute failure path
- C++ wrapper access through `Ort::ConstKernelInfo`

## Files Changed

- `include/onnxruntime/core/session/onnxruntime_c_api.h`
- `include/onnxruntime/core/session/onnxruntime_cxx_api.h`
- `include/onnxruntime/core/session/onnxruntime_cxx_inline.h`
- `onnxruntime/core/session/custom_ops.cc`
- `onnxruntime/core/session/onnxruntime_c_api.cc`
- `onnxruntime/core/session/ort_apis.h`
- `onnxruntime/test/framework/kernel_info_test.cc`

## Why This Change

This closes a real API gap in kernel attribute access and makes the
public API surface more consistent with the existing numeric attribute
helpers.

It also unblocks plugin/adapter-based kernel code that depends on
repeated string attributes without requiring those kernels to
special-case plugin builds.

For example, porting rnn operator to cuda plugin EP will need this API.

## Validation

Validated with new unit coverage in `kernel_info_test.cc` for:

- `KernelInfoGetAttributeArray_string` with populated attributes
- `KernelInfoGetAttributeArray_string` with empty attributes
- missing-attribute error handling
- `Ort::ConstKernelInfo::GetAttributes<std::string>` parity with the C
API
…→fp32) patterns (microsoft#27614)

### Description

Extends the QDQ selector-action `DQ → MatMul → MatMulNBits` fusion in
two ways:

**1. Support 2-bit and 8-bit quantized weights**

The existing fusion only handled 4-bit (`Int4x2`/`UInt4x2`) DQ weights.
This PR broadens it to also support 2-bit (`Int2x4`/`UInt2x4`) and 8-bit
(`int8`/`uint8`) quantized weights.

- qdq_selectors.cc: Added `Is2BitIntType`, `Is8BitIntType`, and
`IsNBitsIntType` helpers. Updated `DQMatMulNodeGroupSelector::Check` to
accept 2/4/8-bit weight types.
- qdq_actions.cc: Added `DQWeightBits` and `IsDQWeightSigned` helpers to
dispatch the correct bit-width and signedness for MLAS transpose and
MatMulNBits attributes.
- `q4_dq.cpp` (MLAS): Added 8-bit `GetElem`/`SetElem` specializations
and an 8-bit `TransposeColumnWiseQuantized` path. Added 6 new template
instantiations for 2-bit (signed/unsigned, float/float16) and 8-bit
(signed/unsigned, float/float16).

**2. Handle `Cast(fp16→fp32)` between DQ and MatMul (FP16 model
fusion)**

FP16 models often have `DQ(int4→fp16) → Cast(fp16→fp32) → MatMul(fp32)`
patterns that the existing selector couldn't match. This PR adds a new
`DQCastMatMulToMatMulNBitsSelector` / `DQCastMatMulToMatMulNBitsAction`
pair that:

- Matches the `DQ → Cast(fp16→fp32) → MatMul` pattern on input B.
- Creates a `MatMulNBits` node operating in the DQ scale dtype (fp16).
- Always inserts `Cast` on input A (to DQ dtype) and `Cast` on output
(DQ dtype to MatMul output dtype), relying on ORT's existing
`CastElimination` optimizer to remove redundant back-to-back casts in
subsequent passes.
- Removes the original DQ, Cast (on B), and MatMul nodes.

### Motivation and Context

- Many quantized models (e.g., from Olive, AutoAWQ) use 2-bit or 8-bit
quantization, but the `DQ → MatMulNBits` fusion only supported 4-bit
weights, leaving these models unoptimized.
- FP16 models produce `DQ(→fp16) → Cast(fp16→fp32) → MatMul` patterns
because the DQ output type matches the scale type (fp16), but the MatMul
operates in fp32. Without handling the intermediate Cast, the fusion was
blocked entirely for these models.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…t#27597)

## Description

Fix out-of-bounds read in the RotaryEmbedding operator when
user-provided `position_ids` values exceed the cos/sin cache bounds
(`max_sequence_length`).

### Problem

When `position_ids` contains values that are negative or >=
`max_sequence_length`, the kernel computes `cache_offset = position_id *
half_rotary_embedding_dim` and reads out-of-bounds from `cos_cache` /
`sin_cache`. This can cause undefined behavior (incorrect results,
crashes, or memory corruption).

### Fix

**CPU (`rotary_embedding.cc`):**
- Added upfront validation of all `position_ids` values before the
parallel computation loop. Returns an `INVALID_ARGUMENT` error if any
value is out of range `[0, max_sequence_length)`.
- Validation is only applied when `position_ids_format != 0` (i.e., when
position_ids are explicitly provided). When `position_ids` is not
provided (format 0), the cache is shaped `(B, S, H/2)` and the index `b
* S + s` is always in-bounds by construction.

**CUDA (`rotary_embedding_impl.cu`):**
- Plumbed the previously-unused `max_sequence_length` parameter through
to the kernel.
- Added a bounds check inside the `position_ids_format != 0` branch.
Out-of-bounds position IDs cause the kernel to pass through the input
unchanged (errors cannot be propagated from GPU kernels).
- The bounds check is scoped to the `position_ids_format != 0` branch
only. When format is 0 (no position_ids), the cache is `(B*S, H/2)` and
`b_s_index = b * S + s` is deterministically valid — applying the check
unconditionally would incorrectly reject all batches beyond the first
since `max_sequence_length == sequence_length` in that case.

### Tests

Added three CPU test cases for the ONNX domain `RotaryEmbedding` op:
- `RotaryEmbedding_PositionIds_ExceedsMaxSeqLen` — position_id far
exceeding cache size
- `RotaryEmbedding_PositionIds_Negative` — negative position_id
- `RotaryEmbedding_PositionIds_OOB_InBatch` — OOB position_id in a
multi-batch, multi-sequence scenario

### Motivation and Context

Security hardening — prevent out-of-bounds memory access from untrusted
model inputs.
### Description

Forge/config folders (e.g. `.github`) should not be ignored by tooling
such as `rg`, `fd`, etc.
Mark these as not-ignored via `.ignore`.


### Motivation and Context

Having to put `-uu` on every `rg` search for references in
pipelines/workflows defeats the purpose of ignoring hidden paths.
## Description
This PR implements a device-side bounds check for `batch_indices` in the
RoiAlign CUDA operator. This is a follow-up to
microsoft#27543, which fixed the
same vulnerability in the CPU implementation.

Previously, CheckROIAlignValidInput() only validated `batch_indices`
when they were accessible on the host (CPU). For the CUDA EP,
`batch_indices` reside in GPU memory, so host-side validation would
require an expensive GPU-to-CPU copy, which could also break CUDA graph
capture.

This change:
1.  Passes `batch_size` from the host to the CUDA kernel.
2. Adds a check within the `RoIAlignForward` kernel to ensure `0 <=
batch_index < batch_size`.
3. If an invalid `batch_index` is encountered, the kernel sets the
output value for that specific RoI element to 0 and returns early for
that thread.

## Impact
-   **Vulnerability fixed:** Heap out-of-bounds read on GPU.
- **Performance:** Negligible impact as it's a simple range check within
the existing kernel.
-   **Compatibility:** No changes to ONNX models or public APIs.

## Validation
-   Existing `RoiAlignTest` suite.
- Added two new test cases: `BatchIndicesOutOfRange_CUDA` and
`BatchIndicesNegative_CUDA` to verify that the CUDA provider correctly
handles out-of-range batch indices.
- Verified that the CUDA provider handles opset 10 without falling back
to the CPU EP for these tests.
…rtDevice` aliases (microsoft#27594)

# Summary

This change replaces the Python-side PCI vendor ID constants in
`onnxruntime_inference_collection.py` with a
public `OrtDeviceVendorId` enum, exports that enum from
`onnxruntime.__init__`, and continues using vendor-aware
`OrtDevice` construction for well-known aliases like `"cuda"`, `"dml"`,
and `"cann"`.

The goal is to make vendor IDs reusable across Python APIs without
duplicating raw integer constants while still
fixing the plugin EP allocator lookup issue for
`OrtValue.ortvalue_from_numpy(..., "cuda", ...)`.

# Problem

The Python wrapper now needs vendor IDs in more than one place.

Keeping them as standalone integer constants is workable for a narrow
fix, but it does not give Python callers a
clear public API for vendor identity. As more APIs accept or return
vendor IDs, users would have to either:

- remember the raw PCI ID values
- depend on private implementation details
- repeat ad hoc constants in their own code

That is not a good public surface for something that is now part of
regular Python device construction flows.

# Why We Need This Change

Dynamic EP registration is intended to let a Python package gain
hardware capability without requiring that hardware
support to be built into the package itself.

That only works if the Python-side device description matches the device
identity used by dynamically registered EP
allocators and data transfers.

Without the underlying vendor-aware alias behavior:

- registering the CUDA plugin library succeeds
- sessions can use `CudaPluginExecutionProvider`
- but Python cannot create CUDA `OrtValue`s with
`OrtValue.ortvalue_from_numpy(..., "cuda", 0)`

At the same time, without a public enum:

- vendor IDs remain scattered as raw integers
- Python callers do not have a clean symbolic way to specify
vendor-specific `"gpu"` / `"npu"` devices
- future vendor-aware APIs would keep expanding the same constant-style
pattern

# Example Use Case

Our immediate use case is the CUDA plugin EP flow from Python.

We register `libonnxruntime_providers_cuda_plugin.so` from Python and
create sessions with
`CudaPluginExecutionProvider`. That part works.

Stage 4 of the plugin flow needs to create GPU-resident `OrtValue`s:

```python
onnxruntime.OrtValue.ortvalue_from_numpy(array, "cuda", 0)
```

Before the vendor-aware alias fix, that failed in a CPU-only Python
package even after the CUDA plugin was
registered, because the Python wrapper constructed a generic GPU
`OrtDevice` without the NVIDIA vendor ID.

With this change, Python also has a public enum for vendor IDs, so
callers can write explicit vendor-aware code
when using generic device names:

```python
onnxruntime.OrtValue.ortvalue_from_numpy(
    array,
    "gpu",
    0,
    onnxruntime.OrtDeviceVendorId.NVIDIA,
)
```

# Fix

The change does three things:

1. Replace the Python-side PCI vendor ID constants with an `IntEnum`
named `OrtDeviceVendorId`.
2. Export `OrtDeviceVendorId` from `onnxruntime.__init__` so it is part
of the public Python API.
3. Keep the vendor-aware alias behavior in `OrtDevice.make(...)` so that
the historical shorthand aliases:
   - `"cuda"` -> `OrtDeviceVendorId.NVIDIA`
   - `"dml"` -> `OrtDeviceVendorId.MICROSOFT`
   - `"cann"` -> `OrtDeviceVendorId.HUAWEI`

use the 4-argument `C.OrtDevice(...)` constructor with an explicit
vendor ID.

Generic device names like `"gpu"` and `"npu"` continue to behave as
before unless the caller explicitly provides a
vendor ID, and callers can now use either an integer or
`OrtDeviceVendorId`.

# Why This Approach

An enum is the better public API here because it:

- keeps Python aligned with the core runtime vendor ID definitions in
`ortdevice.h`
- preserves integer compatibility because `IntEnum` still works
naturally with the pybind layer
- gives users readable, discoverable names instead of undocumented raw
PCI IDs
- scales better as vendor-aware device APIs become more common

This keeps the original plugin fix intact while improving the Python API
shape instead of just adding more module
constants.

# Validation

Validated in the Python layer by:

- confirming the new enum-based implementation preserves vendor-aware
alias handling for `"cuda"`, `"dml"`, and `"cann"`
- exporting `OrtDeviceVendorId` from the top-level `onnxruntime` package
- adding Python test coverage that checks `OrtDevice.make("cuda", 0)`
resolves to the NVIDIA vendor ID via the enum
- running `python -m compileall` on the updated Python files

Targeted pytest execution could not be completed in this workspace
because the local source tree does not provide an
importable `onnxruntime.capi` module without a built package.

# Notes

This PR keeps backward compatibility for existing Python call sites:

- shorthand aliases like `"cuda"` continue to work
- explicit `vendor_id` arguments can still be passed as integers
- callers now also have the option to use
`onnxruntime.OrtDeviceVendorId`
### Description

Use `_tpause` function defined in `waitpkgintrin.h` instead of calling
the compiler built-in function (`__builtin_ia32_tpause`) directly.

### Motivation and Context

The [`_tpause`][intel-intrinsics-guide] is independent of the compiler,
whereas its implementation via the built-in function
`__builtin_ia32_tpause` varies by compiler. Therefore, it is advisable
not to use it directly. For example, [GCC][waitpkgintrin-gcc] and
[LLVM][waitpkgintrin-llvm] have different arguments, leading to
portability issues.

[intel-intrinsics-guide]:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=tpause&techs=Other&ig_expand=6888
[waitpkgintrin-gcc]:
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/waitpkgintrin.h;h=42c6b0cd02866eccdfe3308f4792f17fe8c6ae38;hb=HEAD#l51
[waitpkgintrin-llvm]:
https://github.com/llvm/llvm-project/blob/a682073ae7a49de4b95498ba01b9ea32e6b5f607/clang/lib/Headers/waitpkgintrin.h#L33-L38
…rite (microsoft#27544)

### Description
<!-- Describe your changes. -->
This pull request refactors several tensor operation kernels
(`GatherND`, `ScatterND`, and `GatherGrad`) to improve type safety and
consistency in parallelized code execution. The main change is replacing
`int` loop indices with `ptrdiff_t` to avoid overflow.

### Parallelization and Type Safety Improvements

* Updated lambda functions and parallel loop indices in `gather_nd.cc`
(`GatherNDBase::PrepareForCompute`, `GatherND::GatherNumber`, and
`GatherND::GatherString`) to use `ptrdiff_t` instead of `int64_t`, and
replaced index arithmetic with explicit casts to maintain correctness.
[[1]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL96-R100)
[[2]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL121-R121)
[[3]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL192-R216)
* Refactored `scatter_nd.cc` (`ScatterNDDispatchTarget`) to use
`ptrdiff_t` for loop indices and index arithmetic in all reduction
cases, ensuring consistent type usage in parallel execution.
* Modified `gather_grad.cc` (`GatherGrad::ComputeImpl`) to use
`ptrdiff_t` for parallel loop indices, aligning with the changes in
other tensor kernels.




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Another same issue was fixed in
microsoft#27444
…environments (microsoft#27591)

### Description

GPU device discovery on Linux relies exclusively on
`/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes
containers, `nvidia-drm` is typically not loaded—only the base NVIDIA
driver is needed for CUDA compute. No DRM entries means no
`OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so
`GetEpDevices` never matches the CUDA EP.

Adds a fallback path in `GetGpuDevices()` that scans
`/sys/bus/pci/devices/` when DRM yields zero GPUs:

- **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class
code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA
datacenter GPUs) per the [PCI Code and ID Assignment
Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement)
(base class 03h). Accepts an injectable sysfs root path for testability.
- **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly
from the PCI device sysfs path and populates `OrtHardwareDevice` with
`pci_bus_id` and discrete GPU metadata. Note: `card_idx` is
intentionally omitted from PCI-discovered devices since
`directory_iterator` traversal order is unspecified and cannot be made
consistent with DRM's `cardN` ordering.
- **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI
scan

The PCI detection functions are exposed via a new
`onnxruntime::pci_device_discovery` namespace (declared in
`core/platform/linux/pci_device_discovery.h`) so they can be tested
hermetically with fake sysfs directories.

The fallback only activates when DRM finds nothing, so no behavioral
change on systems where DRM works.

Also adds:
- A cross-platform `GpuDevicesHaveValidProperties` test that validates
GPU device type and vendor ID when GPUs are present. The test
intentionally does not assert on `device_id` since some platforms (e.g.,
Apple Silicon) do not populate it.
- Comprehensive hermetic Linux unit tests
(`test/platform/linux/pci_device_discovery_test.cc`) that create fake
sysfs directory structures to exercise the PCI fallback path, covering
VGA/3D controller detection, non-GPU filtering, empty/missing paths,
multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata.
Tests use the `ASSERT_STATUS_OK()` macro from
`test/util/include/asserts.h` and use `CreateFakePciDevice` to set up
complete fake PCI device directories for both `DetectGpuPciPaths` and
`GetGpuDeviceFromPci` tests.

### Motivation and Context

CUDA EP registration fails on AKS (Azure Kubernetes Service) because the
NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA
driver, but does not load `nvidia-drm`. The existing
`/sys/class/drm`-only detection path returns no GPU devices, blocking
`GetEpDevices` from returning the CUDA EP. The same setup works on
bare-metal Linux where DRM is loaded.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description

Fixes multiple issues that related to crash and memory leak.

1. Fix an uncommon situation that `BucketCacheManager` may hold pending
buffers while cleaning up the WebGPU context, which causes memory leak.
2. Change the WebGPU default instance from a RAII wrapper
(wgpu::Instance) to a raw pointer (WGPUInstance) so that it will not be
destructed automatically at process exit, which may cause a crash due to
accessing DXC code while dxcompiler.dll already unloaded.
3. Fix a crash in a situation that the default ORT logger destructed
before a WebGPU device, so that the device callbacks are guarded by
condition `logging::LoggingManager::HasDefaultLogger()`.

Also includes a few fixes related to Node.js binding.
1. the `OrtEnv` was used as a function local variable. This is
problematic because the destruction of OrtEnv may be too late, where
some DLLs are already unloaded. (The order of DLL unloading at process
exit is not totally controllable). Change it to:
- if OrtEnv is constructed on main thread, a cleanup hook will be
registered when Node.js starts to exit. If the callback is not called
(eg. uncaught exception is thrown), the OrtEnv will not be released.
- if OrtEnv is constructed on worker thread, just leave it and allow it
to leak at exit.
2. because of (1), if OrtEnv is released already, do not release any
active sessions (they are object wraps that destructed later than
registered hooks).

All of the changes above should have covered different scenarios but
ensures:
- if any resource is intentionally leaked, it must be at process exit.
- if it's not at process exit, resources lifecycle should be managed
correctly.
- best efforts (but not guarantee) to release resources safely, as to be
friendly to the memory leak detector.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…om .cc to headers (microsoft#27617)

## Description

This PR refactors several CPU operator helper functions by moving their
implementations from `.cc` files into `.h` headers, using the `#ifdef
SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for
the **CUDA Plugin EP** work, where CUDA kernels are built into a
standalone shared library (`libonnxruntime_providers_cuda_plugin.so`)
that cannot link against the CPU provider's `.cc` object files.

### Why This Refactoring Is Needed

The CUDA Plugin EP compiles CUDA operator kernels into a separate shared
library that communicates with the ORT core through the ORT EP Plugin
API. In this architecture, kernel source files **cannot** depend on
framework-internal symbols that live in the CPU provider static library
(`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base
classes and call shared helper/validation methods (e.g.,
`SliceBase::PrepareForCompute`, `SplitBase::PrepareForCompute`,
`ScatterND::ValidateShapes`, `TileOp::IsTileMemcpy`,
`PadBase::ComputePads`) whose implementations currently live in CPU
`.cc` files.

In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are
accessed through the `ProviderHostCPU` DLL-boundary virtual table
bridge. However, the plugin EP does not use this bridge — it uses EP API
adapters and force-included headers instead. To make these helpers
available in the plugin build without duplicating code, this PR moves
the implementations into headers as `inline` functions under `#ifndef
SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path
retains the existing declaration-only signatures that route through
`ProviderHostCPU`.

This pattern has already been successfully applied to other operators
(e.g., `Einsum`). This PR extends it to the remaining operators that
need it.

## Summary of Changes

### Helper functions moved from `.cc` to `.h` (inline under `#ifndef
SHARED_PROVIDER`)

| Operator | File | Functions Moved |
|----------|------|-----------------|
| **Slice** | `cpu/tensor/slice.h` | `SliceBase::FlattenOutputDims`,
`SliceBase::PrepareForCompute` (both overloads),
`SliceBase::FillVectorsFromInput`, `slice_detail::CopyInputData<T>` |
| **Split** | `cpu/tensor/split.h` | `SplitBase::PrepareForCompute` |
| **ScatterND** | `cpu/tensor/scatter_nd.h` |
`ScatterND::ValidateShapes` |
| **Tile** | `cpu/tensor/tile.h` | `TileOp::IsTileMemcpy` |
| **Pad** | `cpu/tensor/padbase.h` | `PadBase::ComputePadsImpl` (new
template method replacing `ComputePads` for cross-context compatibility)
|
| **BiasGelu** | `contrib_ops/cpu/bert/bias_gelu_helper.h` |
`bias_gelu_helper::CheckInputs` (templatized on context type) |
| **EmbedLayerNorm** | `contrib_ops/cpu/bert/embed_layer_norm_helper.h`
| `embed_layer_norm::CheckInputs` (templatized on context type) |
| **NonMaxSuppression** | `cpu/object_detection/non_max_suppression.h` +
new `non_max_suppression_helper.h` | `NonMaxSuppressionBase` refactored
into `NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>`
template for plugin compatibility |

### Deleted `.cc` files (implementations moved to headers)

- `contrib_ops/cpu/bert/bias_gelu_helper.cc`
- `contrib_ops/cpu/bert/embed_layer_norm_helper.cc`

### Provider bridge additions

- Added `Tensor::DataAsSpan<int32_t>()` support through the shared
provider interface (`provider_interfaces.h`, `provider_wrappedtypes.h`,
`provider_bridge_ort.cc`). This was needed because
`slice_detail::CopyInputData<int32_t>` calls
`Tensor::DataAsSpan<int32_t>()`, which was not previously bridged.

### CUDA-side updates

- `cuda/tensor/slice.h`: Updated `Slice` constructor to use the new
`SliceBase(info, dynamic, 0)` overload (template-based constructor
compatible with both adapter and real `OpKernelInfo`).
- `cuda/tensor/pad.cc`: Updated call from `PadBase::ComputePads` to
`PadBase::ComputePadsImpl`.
- `cuda/tensor/scatter_nd.cc`: Templatized
`InitializeElementCountsAndInputDimsSpanOrGpu` on `KernelContextType`
(also fixed typo: `InitiliazeElement...` → `InitializeElement...`).
- `cuda/object_detection/non_max_suppression.h`: Updated to use
`NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext>` instead of
`NonMaxSuppressionBase`.

### New file

- `cpu/object_detection/non_max_suppression_helper.h`: Contains the
template-based `NonMaxSuppressionBaseImpl` class, separating it from the
CPU-specific `NonMaxSuppression` kernel registration.

## Testing

- Existing unit tests cover all affected operators (Slice, Split,
ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression).
- No behavioral changes — all function logic is identical; only the
location (header vs. source) and linkage (inline vs. external) changed.
- The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged —
declarations remain and route through the existing `ProviderHostCPU`
bridge.

## Motivation and Context

This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels
as a standalone shared library that can be updated independently of the
ORT core. The refactoring enables ~10 additional CUDA operators to
compile in the plugin build by making their CPU-side validation and
preparation helpers available as header-inline functions.
…sembly. (microsoft#27575)

### Description
Introduce an optimized POWER10 PackA implementation leveraging VSX
builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes
per row per iteration.

### Motivation and Context
Performance improvements observed in prompt processing:
- 14% speedup (batch size 1)
- 6% speedup (batch size 4)
- 4% speedup (batch size 8)

Tested with granite-3.1-8b

---------

Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
### Description
<!-- Describe your changes. -->

Update all builds to C++20. Previously, C++20 was only enabled for MacOS builds.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Update to C++20 standard to enable use of new language and standard library features.
### Description

Fix integer overflow in Col2Im shape calculation by using SafeInt for
attacker-controlled dimension arithmetic
…ft#27556)

### Description

Add `qwen3` to the Python transformer optimizer's model type registry,
enabling graph optimization for Qwen3 models (e.g.,
Qwen3-Embedding-0.6B, ranked 4th on MTEB).

### Motivation

Fixes microsoft#25083

Running `optimum-cli export onnx --optimize O3` on Qwen3 models fails
with:
```
ValueError: Unsupported model type: qwen3
```
This PR resolves that by registering the model type and fixing a fusion
gap that blocked normalization fusions.

### Changes

**Model type registration** (`optimizer.py`):
- Add `"qwen3": (Gpt2OnnxModel, "pytorch", 0)` to `MODEL_TYPES`
- Uses `Gpt2OnnxModel` (not `BertOnnxModel`) because its
`fuse_attention()` calls `FusionRotaryAttention`, which searches on
`SkipSimplifiedLayerNormalization` anchors — needed for RMSNorm-based
models

**Fusion option defaults** (`fusion_options.py`):
- Disable `EmbedLayerNormalization` (decoder-only, no BERT-style
embedding)
- Set `AttentionMaskFormat.NoMask` (causal masking is implicit)

**SkipLayerNormalization fusion fallback** (`fusion_skiplayernorm.py`):
- When symbolic shape inference fails (common with dynamo-exported
models), the fusion previously returned early, skipping all
`SkipLayerNormalization` / `SkipSimplifiedLayerNormalization` fusions
- Now it falls through with the safe default `skip_index=1` (second Add
input is skip), since both inputs are already verified as
non-initializer dynamic tensors (lines 88-90)
- This enables `SkipSimplifiedLayerNormalization` fusion on Qwen3 models
where shape inference fails

**Test** (`test_attention_fusion.py`, `qwen3_model_generator.py`):
- Synthetic Qwen3 decoder layer graph with pre-attention RMSNorm, Q/K/V
projections, QK-Norm, simplified attention, output projection, residual
connection, and post-attention RMSNorm
- Verifies 3× `SimplifiedLayerNormalization` (pre-attn, Q-norm, K-norm)
+ 1× `SkipSimplifiedLayerNormalization` (residual + post-attn RMSNorm)

**Verified on real model**: Running the optimizer on an exported
Qwen3-Embedding-0.6B (2-layer) reduces nodes from 208 → 150 (28%
reduction). All 9 RMSNorm patterns fuse correctly: 5×
`SimplifiedLayerNormalization` + 4× `SkipSimplifiedLayerNormalization`.

**Scope note**: Full RotaryEmbedding + MultiHeadAttention fusion for
Qwen3's dynamo-exported graphs requires additional pattern matching work
(static Slice indices, on-the-fly sin/cos computation, QK-Norm in Q/K
paths, GQA expansion). That will be addressed in a follow-up PR.

### Test Plan

- [x]
`test_attention_fusion.py::TestFusion::test_qwen3_normalization_fusion`
passes
- [x] All 14 existing tests in `test_attention_fusion.py` pass (no
regressions)
- [x] All 4 tests in `test_optimizer_huggingface_bert.py` pass (bert,
distillbert, roberta, xlm_roberta — no regressions from the
SkipLayerNorm fallback change)
- [x] `lintrunner -a` clean
…operator (microsoft#27201)

### Description
1. Supports volumetric input grid sampling in the CUDA EP `GridSample`
operator (i.e.) 5-D input tensor a.k.a 3-D spatial data
2. Registers the CUDA `GridSample` operator for opsets 20 and 22
3. Supports both NCHW and NHWC layouts for volumetric inputs
4. Does not support `cubic` mode for volumetric inputs for now and this
is consistent with the CPU version of the implementation and hence will
not cause "functional regression" (i.e.) `cubic` mode for 3-D spatial
data is not supported on CPU and CUDA before and after this change. This
is a TODO for the future.
5. There are enough unit tests in `grid_sample_test.cc` to cover the
volumetric input case and this is run in both NCHW (NCDHW for volumetric
case) and NHWC (NDHWC for volumetric case) layouts for the CUDA EP

### Motivation and Context
Resolve microsoft#21382
Resolve microsoft#18942
Resolve microsoft#16581
Resolve microsoft#18313

Related CPU PRs (for opset 20 and opset 22):
microsoft#17744 &&
microsoft#23344
…-build kernel safety (microsoft#27613)

This pull request introduces several optimizations and safety
improvements to CUDA kernels used in attention, rotary embedding, and
tensor scatter operations for ONNX Runtime's LLM support. The main focus
is on reducing decode overhead, improving memory safety, and ensuring
correct handling of edge cases in mask and index validation. The most
important changes are grouped below by theme.

### Flash Attention & KV Cache Optimization

* Replaced the previous pattern of zero-filling and strided copy for KV
cache updates with a single fused kernel (`LaunchConcatNewToPastKV`),
eliminating redundant memory writes and reducing decode overhead in
`attention.cc`. This streamlines the cache update process and improves
performance.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL313-R350)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL396-R436)
* Updated the documentation and comments to clarify the new fused kernel
approach and its performance benefits, as well as the handling of
sequence lengths for cache and mask conversion.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL190-R192)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL417-R446)

### Mask Validation & Handling

* Improved mask validation in `attention_mask_impl.cu` and
`attention_mask_impl.h` by clarifying that CUDA_KERNEL_ASSERT is only
active in debug builds. In release builds, non-contiguous masks produce
safe output by counting only leading True values, ensuring memory safety
and correctness even with invalid masks.
[[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L11-R22)
[[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R100)
[[3]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L32-R35)

### Rotary Embedding Improvements

* Switched rotary embedding kernel dispatch to use `OrtToCudaType` for
BFloat16, enabling native hardware arithmetic (`__nv_bfloat16`) on SM80+
GPUs for improved performance and correctness.
[[1]](diffhunk://#diff-411fdb2010086b3a0ad9b048bb0d0fd7721a0e8d33d9ad396d709254973448c2R5)
[[2]](diffhunk://#diff-411fdb2010086b3a0ad9b048bb0d0fd7721a0e8d33d9ad396d709254973448c2L67-R71)
[[3]](diffhunk://#diff-b0846d38debfc56c4c9fbb52ae7a201323ec1eab36853cf3627838fce4bb98feR13)
[[4]](diffhunk://#diff-b0846d38debfc56c4c9fbb52ae7a201323ec1eab36853cf3627838fce4bb98feR168-R176)
* Added explicit kernel instantiation for `__nv_bfloat16` in rotary
embedding implementation, ensuring proper support for native CUDA types.

### TensorScatter Safety Enhancements

* Enhanced validation and memory safety for `write_indices` in tensor
scatter operations by adding in-kernel clamping of invalid indices and
clarifying behavior in comments. This prevents out-of-bounds writes and
preserves CUDA graph compatibility.
[[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R80)
[[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41R32-R40)

These changes collectively improve performance, robustness, and safety
for CUDA-based LLM operations in ONNX Runtime.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ty (microsoft#27637)

### Description

Accept pre-1.25 names `"WebGPU_Buffer"`/`"WebNN_Tensor"` as aliases in
`CreateMemoryInfo` and normalize them to the current short names
`"WebGPU_Buf"`/`"WebNN_Ten"`.

This is the **reverse** of
microsoft#27475 (which added forward
compatibility in the 1.24.x patch branch).

### Motivation and Context

Released onnxruntime-genai still uses the old (pre-1.25) long names when
calling `CreateMemoryInfo`. Without this change, those calls fail with
`ORT_INVALID_ARGUMENT` on main branch.

### Key Design Decision

When an old name is detected, it is **normalized** to the current short
constant (e.g., `"WebGPU_Buffer"` -> `"WebGPU_Buf"`). This is critical
because downstream code (e.g., `external_data_loader.cc`,
`webgpu_context.cc`) compares `OrtMemoryInfo.name` against the current
constants. Simply passing through the old name would cause those
comparisons to fail.

### Changes
- `onnxruntime/core/framework/allocator.cc`: Accept and normalize legacy
names in `CreateMemoryInfo`
- `onnxruntime/test/shared_lib/test_allocator.cc`: Add test verifying
legacy names are accepted and normalized

### See Also
- microsoft#27207 (original rename)
- microsoft#27475 (forward compat in
1.24.x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description

Added the `-Wl,-z,max-page-size=16384` linker flag to the
`onnxruntimejsi` target in `js/react_native/android/CMakeLists.txt` to
support **16KB memory page sizes**.

### **Motivation and Context**

Starting with **Android 15**, Google Play requires apps to be compatible
with 16KB page size devices.

By default, the Android NDK builds shared libraries with 4KB ELF segment
alignment. Without this explicit flag, `libonnxruntimejsi.so` fails the
Play Store's 16KB alignment check, leading to potential crashes or
installation failures on supported hardware.

### **Changes**

- Updated `js/react_native/android/CMakeLists.txt`.
- Applied `target_link_options` to `onnxruntimejsi` to enforce
`max-page-size=16384` (16KB).

### **References**

- [[Android Developer Guide: Support 16 KB page
sizes](https://developer.android.com/guide/practices/page-sizes#fix-cmake)](https://developer.android.com/guide/practices/page-sizes#fix-cmake)

@fs-eire
…27602)

We have some warnings for unused function on Linux and a sprintf warning on Windows that is blocking our CI.
…sion.py (microsoft#27642)

# Description

This PR addresses a build error and subsequent test failures related to
recent changes in GridSample and the transformer optimizer. Related PRs:
microsoft#27201, microsoft#27556.

## Changes

### 1. Fix GridSample Build Error
- Removed an unused local variable `mode_str` in
`onnxruntime/core/providers/cuda/tensor/grid_sample.cc` that was causing
a warning (treated as error) about shadowing a member variable.
- Ref:
[`grid_sample.cc`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/core/providers/cuda/tensor/grid_sample.cc#L54)

### 2. Update GridSample Tests
- Updated
`onnxruntime/test/providers/cpu/tensor/grid_sample_test_custom.inc` to
use default execution providers in `RunTests` instead of a hardcoded
opset version, ensuring compatibility across different environments.

### 3. Revert Transformer Fusion Fallback
- Reverted a recent change in
`onnxruntime/python/tools/transformers/fusion_skiplayernorm.py` that
enabled a fallback for `SkipLayerNormalization` fusion when symbolic
shape inference fails.
- This revert was necessary to avoid regressions in GPT-2 tests where
model definitions contain typos that intentionally (or coincidentally)
break shape inference.
- Ref:
[`fusion_skiplayernorm.py`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/python/tools/transformers/fusion_skiplayernorm.py#L113)

### 4. Restore Transformer Test Parity
- Updated
`onnxruntime/test/python/transformers/test_attention_fusion.py`
specifically `test_qwen3_normalization_fusion` to match the expected
node counts after reverting the fusion fallback.
- Ref:
[`test_attention_fusion.py`](https://github.com/microsoft/onnxruntime/blob/c979a2407f/onnxruntime/test/python/transformers/test_attention_fusion.py#L398)

## Verification

- `build_cuda.sh` completed successfully.
- `onnxruntime/test/python/transformers/test_attention_fusion.py` passes
with "OK".
- `lintrunner -a` reports no issues.
…de from .cc to headers (Part 2) (microsoft#27628)

## Description

This PR continues the refactoring effort started in PR microsoft#27617, moving
additional CPU operator helper function implementations from `.cc` files
into `.h` headers using the `#ifdef SHARED_PROVIDER` / `#else` inline
pattern. This is a prerequisite for the **CUDA Plugin EP** work, where
CUDA kernels are built into a standalone shared library
(`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the
CPU provider's `.cc` object files.

### Why This Refactoring Is Needed

The CUDA Plugin EP compiles CUDA operator kernels into a separate shared
library that communicates with the ORT core through the ORT EP Plugin
API. In this architecture, kernel source files **cannot** depend on
framework-internal symbols that live in the CPU provider static library
(`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base
classes and call shared helper/validation methods whose implementations
currently live in CPU `.cc` files.

In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are
accessed through the `ProviderHostCPU` DLL-boundary virtual table
bridge. However, the plugin EP does not use this bridge — it uses EP API
adapters and force-included headers instead. To make these helpers
available in the plugin build without duplicating code, this PR moves
the implementations into headers as `inline` functions under `#ifndef
SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path
retains the existing declaration-only signatures that route through
`ProviderHostCPU`.

### Refactoring Patterns Used

1. **Inline move**: Function body moved from `.cc` to `.h`, wrapped in
`#ifndef SHARED_PROVIDER` with `inline` linkage. The `#ifdef
SHARED_PROVIDER` path keeps the original declaration.
2. **Template-on-context**: Methods like `PrepareCompute`,
`PrepareForCompute`, and `GetPresent` are templatized on
`KernelContextType` so they work with both `OpKernelContext` (in-tree)
and the plugin EP's adapter context.
3. **Template-on-info**: Constructors and initialization methods (e.g.,
`RoiAlignBase`, `CropBase`, `SpaceDepthBase`) are templatized on
`KernelInfoType` with `info.template GetAttr<T>(...)` calls, making them
compatible with both `OpKernelInfo` and the plugin's
`OpKernelInfoAdapter`.
4. **Helper extraction**: Free helper functions (e.g.,
`CheckROIAlignValidInput`, `GetAxis`, `AdjustOutputSizeAsPolicy`) moved
inline into headers.

## Summary of Changes

### Helper functions moved from `.cc` to `.h` (inline under `#ifndef
SHARED_PROVIDER`)

| Operator | Header File | Functions Moved |
|----------|-------------|-----------------|
| **AttentionBase** | `contrib_ops/cpu/bert/attention_base.h` |
`AttentionBase::CheckInputs` (both overloads),
`AttentionBase::CheckMask`, `AttentionBase::GetPresent` (templatized on
`TOpKernelContext`) |
| **LongformerAttentionBase** |
`contrib_ops/cpu/bert/longformer_attention_base.h` |
`LongformerAttentionBase::CheckInputs` |
| **CumSum** | `cpu/math/cumsum.h` | `GetAxis` (free function) |
| **RoiAlign** | `cpu/object_detection/roialign.h` |
`CheckROIAlignValidInput` (free function), `RoiAlignBase` constructor
templatized on `TKernelInfo` |
| **Concat** | `cpu/tensor/concatbase.h` |
`ConcatBase::PrepareForCompute` (templatized, delegates to
`PrepareForComputeImpl`) |
| **Gather** | `cpu/tensor/gatherbase.h` |
`GatherBase::PrepareForCompute` (templatized, delegates to
`PrepareForComputeImpl`) |
| **Unsqueeze** | `cpu/tensor/unsqueeze.h` |
`UnsqueezeBase::PrepareCompute` (templatized on `KernelContextType`) |
| **Upsample** | `cpu/tensor/upsamplebase.h` |
`UpsampleBase::AdjustOutputSizeAsPolicy`,
`upsamplebase_helper::AdjustOutputSizeAsPolicy` (free helper) |

### Constructor templatization (for plugin EP adapter compatibility)

| Class | Header File | Change |
|-------|-------------|--------|
| **CropBase** | `contrib_ops/cpu/crop.h` | Constructor templatized on
`KernelInfoType`, `GetAttrsOrDefault` calls use `info.template` syntax |
| **SpaceDepthBase** | `cpu/tensor/space_depth_ops.h` | Constructor
templatized on `KernelInfoType`, `GetAttr` call uses `info.template`
syntax |
| **RoiAlignBase** | `cpu/object_detection/roialign.h` | Constructor
templatized on `TKernelInfo`, all `GetAttr` calls use `info.template`
syntax |

### CUDA-side updates

| File | Change |
|------|--------|
| `cuda/tensor/upsample.cc` | Added explicit template instantiations for
`Upsample<float>`, `Upsample<double>`, `Upsample<MLFloat16>`,
`Upsample<int32_t>`, `Upsample<uint8_t>` (needed because
`AdjustOutputSizeAsPolicy` implementation moved to header) |

### Files with code removed (moved to headers)

| Source File | Lines Removed | Moved To |
|-------------|---------------|----------|
| `contrib_ops/cpu/bert/attention_base.cc` | ~333 | `attention_base.h` |
| `contrib_ops/cpu/bert/longformer_attention_base.cc` | ~133 |
`longformer_attention_base.h` |
| `cpu/math/cumsum.cc` | ~23 | `cumsum.h` |
| `cpu/object_detection/roialign.cc` | ~74 | `roialign.h` |
| `cpu/tensor/concat.cc` | ~8 | `concatbase.h` |
| `cpu/tensor/gather.cc` | ~4 | `gatherbase.h` |
| `cpu/tensor/unsqueeze.cc` | ~51 | `unsqueeze.h` |
| `cpu/tensor/upsample.cc` | ~44 | `upsamplebase.h` |

## Testing

- Existing unit tests cover all affected operators (Attention,
LongformerAttention, CumSum, RoiAlign, Concat, Gather, Unsqueeze,
Upsample, Crop, SpaceToDepth/DepthToSpace).
- No behavioral changes — all function logic is identical; only the
location (header vs. source) and linkage (inline vs. external) changed.
- The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged —
declarations remain and route through the existing `ProviderHostCPU`
bridge.

## Motivation and Context

This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels
as a standalone shared library that can be updated independently of the
ORT core. The refactoring enables additional CUDA operators to compile
in the plugin build by making their CPU-side validation and preparation
helpers available as header-inline functions.

This PR is a direct continuation of PR microsoft#27617 which applied the same
pattern to Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm,
and NonMaxSuppression operators.
…7626)

### Description
Add a Windows VERSIONINFO resource (.rc file) for the Vitis AI provider
DLL, following the same pattern used for CUDA, TensorRT, and QNN EPs
(added in microsoft#24606). This embeds the ORT version into the DLL's PE header
so it shows up in file properties.

### Motivation and Context
Need version in onnxruntime_providers_vitisai.dll to track changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#27520)

### Description
Skip building `custom_op_library` library if CUDA_MINIMAL is enabled 



### Motivation and Context
microsoft#27308 removes cudnn
include for `custom_op_library` target in cmake if CUDA_MINIMAL is
enabled.

In fact, the `custom_op_library` target does not define the
`USE_CUDA_MINIMAL ` macro (no
`target_compile_definitions(custom_op_library PRIVATE
-DUSE_CUDA_MINIMAL)` in onnxruntime_unittests.cmake), so one of the
files, i.e. cuda_context.h included in cuda_ops.cc still includes
cudnn.h and the CI just got lucky to pass becasue cudnn.h is in
--cuda_home. If building locally, it might fail to find cudnn.h.
### Description

For FP16 models with block-quantized weights (`DQ(int4/int2/int8,
fp16_scale) → MatMul(fp16)`), the `DQMatMulToMatMulNBitsSelector` failed
to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP
during graph partitioning, leaving their execution provider unassigned
(empty string `""`). The selector's EP compatibility check rejected
these nodes.

This PR:
- Adds `""` (empty/unassigned EP) to the compatible providers list for
`DQMatMulToMatMulNBitsSelector` so it can match FP16 MatMul nodes not
yet assigned to an EP. The resulting `MatMulNBits` node is assigned to
`kCpuExecutionProvider` by the action (which has both `float` and
`MLFloat16` CPU kernels).
- Adds `""` to the `QDQSelectorActionTransformer` transformer-level
compatible EPs so unassigned nodes reach individual selectors (other
selectors are unaffected since their own provider lists don't include
`""`).
- Removes the `DQCastMatMulToMatMulNBitsSelector` and
`DQCastMatMulToMatMulNBitsAction`, which handled a `DQ → Cast(fp16→fp32)
→ MatMul` pattern that only existed after `InsertCastTransformer` ran.
That fusion only worked incidentally when `FuseInitializersTransformer`
(Level 4) triggered an optimization loop repeat, giving Level 2 QDQ
fusions a second pass — a behavior that didn't occur in all builds
(e.g., minimal/extended-minimal builds without
`FuseInitializersTransformer`).
- Replaces the `DQCastMatMulConvertedToMatMulNBits` test with
`DQMatMulFP16ConvertedToMatMulNBits` that tests the actual scenario:
`DQ(int4, fp16_scale) → MatMul(fp16)` on CPU EP.

### Motivation and Context

FP16 models with block-quantized weights were not getting `DQ →
MatMulNBits` fusion when running on CPU EP in certain ORT builds. The
fusion worked on x64 full builds by luck — `InsertCastTransformer`
created `DQ→Cast→MatMul` patterns, then `FuseInitializersTransformer`
(Level 4) modified FP16 initializers causing the optimization loop to
repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware
selector matched. In builds without `FuseInitializersTransformer` (e.g.,
minimal builds, arm packages), the loop didn't repeat and the fusion
never applied.

The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't
claim FP16 MatMul nodes during partitioning. These nodes have an empty
EP string, which the `QDQSelectorActionTransformer` and `BaseSelector`
both rejected. The fix allows the `DQMatMulToMatMulNBits` selector to
match unassigned nodes directly on the first Level 2 pass, before
`InsertCastTransformer` runs, eliminating the dependency on the
optimization loop repeat.
microsoft#27650)

### Description

Revert QNN SDK logging verbosity changes introduced in
microsoft#24931.

This reverts commit ec4f6bf.


### Motivation and Context

Loggin used to fail on QNN backend destruction (when releasing QNN
context handles) with segmentation faults (even with empty user logging
functions), hence reverting the changes.

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
adrianlizarraga and others added 10 commits March 14, 2026 00:30
### Description
- Updates the `ValidateExternalData()` function:
- Resolves symlinks when validating external data path for **models
loaded from memory**.
- Previously only did a lexical check that did not resolve symlinks.
Now, we resolve symlinks.
  - **Now requires external data path to exist.**
- Replace the `(base_dir, model_dir)` function parameters with just
`model_dir`. `base_dir` was always derived from `model_dir`.
  - Skip validation for WASM builds that do not have a filesystem.
- Return `Status` instead of throwing exceptions when `std::filesystem`
functions fail.
- Updates `Graph::ConvertInitializersIntoOrtValues()`:
- Prevents unnecessary calls to `ValidateExternalData()` for external
data paths that have already been validated.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
… / GetCompatibilityInfoFromModelBytes (microsoft#27565)

### Description
<!-- Describe your changes. -->
This change adds C# and Python language bindings and tests for the
recently-introduced GetCompatibilityInfoFromModel /
GetCompatibilityInfoFromModelBytes API.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#27015 introduced a new API to facilitate getting the model
compatibility information from the metadata of a model (either a file or
the model bytes). For convenience, we should ideally have some other
language bindings included to make consumption a little easier.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description
The code contains some unreferenced variables or functions, which will
generate a warning.



### Motivation and Context
The change only removes two warnings, but it will cause compilation
failure when a Werror enabled.
…oft#27590)

### Description

Extend `FusionRotaryEmbeddings` to handle Qwen3's on-the-fly rotary
position embedding computation, where cos/sin values are computed from
`inv_freq` at runtime instead of being looked up from a pre-computed
cache.

This is a follow-up to microsoft#27556 (Qwen3 basic model type support). Depends
on microsoft#27556.

Part of microsoft#25083.

### Motivation and Context

Qwen3 models (ranked 4th on MTEB) compute RoPE differently from existing
supported models (Phi, LLaMA, etc.). Instead of pre-computing cos/sin
caches and looking them up via `Gather(cache, position_ids)`, Qwen3
computes them on-the-fly:

```python
freqs = inv_freq_expanded @ position_ids_expanded   # MatMul
emb = torch.cat((freqs, freqs), dim=-1)             # Concat
cos = emb.cos() * attention_scaling                  # Cos, Mul
sin = emb.sin() * attention_scaling                  # Sin, Mul
```

Additionally, TorchScript exports of Qwen3 insert `Cast` nodes in the
`rotate_half` pattern (from `torch.floor_divide` tracing), which the
existing path patterns don't account for.

### Changes

**`fusion_rotary_attention.py`:**
- Add Cast-tolerant `rotate_half` path patterns
(`rotate_half_x2_path_2_3`, `_2_4`, `rotate_half_x1_path_2_3`, `_2_4`)
that allow 1-2 Cast nodes between Unsqueeze and Div in the dynamic Slice
index computation
- Add `sin_path_5` / `cos_path_5` patterns matching the on-the-fly
computation: `MatMul → Transpose → Concat → Cos/Sin → Mul(scaling) →
Unsqueeze → Mul`, with optional Cast variant (the optimizer's earlier
Cast fusion pass may remove the Cast)
- Add `create_cos_sin_cache_from_on_the_fly_rope()` helper that extracts
`inv_freq` weights, computes cos/sin caches as model initializers, and
traces `position_ids` from the graph
- Handle per-layer vs shared node removal correctly (only remove
per-layer Unsqueeze/outer Mul nodes; shared MatMul/Cos/Sin nodes are
pruned automatically by the optimizer)

**`qwen3_model_generator.py`:**
- Add `include_rope=True` parameter to `create_qwen3_decoder_layer()`
- Generate full on-the-fly RoPE computation graph: `inv_freq`
initializer, `position_ids` input, MatMul/Transpose/Concat/Cos/Sin/Mul
nodes, and `rotate_half` pattern with dynamic Slice indices (including
Cast nodes from floor division)
- Apply RoPE to both Q and K paths

**`test_attention_fusion.py`:**
- Add `test_qwen3_rotary_embedding_fusion` verifying 2 RotaryEmbedding
nodes are fused along with 3 SimplifiedLayerNormalization and 1
SkipSimplifiedLayerNormalization

### Verification

- **Unit tests**: All 15 `test_attention_fusion.py` tests pass (14
existing + 1 new)
- **Real model**: Verified on Qwen3-Embedding-0.6B (28 layers): 56
RotaryEmbedding nodes fused (28 layers × 2 per layer for Q and K),
reducing total node count from 7416 → 4661 (37% reduction)
- **No regressions**: All changes are additive alternative path patterns
— existing models that use dynamic Slice indices or cache-based RoPE
never hit the new paths
- **Lint**: `lintrunner -a` clean on all modified files
It seems some outputs on the spanned list may be nullptrs. By checking
for nullptr and skipping them if found, it does not seem to disturb
proper execution on models.

That check was not required for legacy EP using the Host API.
…fused nodes in different GraphViews (microsoft#27666)

### Description
Fixes a bug where `PluginExecutionProvider::GetCapability()` incorrectly
assigned duplicate MetaDef IDs to fused nodes that live in different
GraphViewer instances (e.g., the then/else branches of an If node).

The root cause was that `GetCapability()` created a new
`ModelMetadefIdGenerator` on every invocation. Since the graph
partitioner calls `GetCapability()` once per subgraph, the generator's
monotonic counter reset each time, producing colliding IDs across
subgraphs. This caused session creation to fail with:

> Failed to add kernel for example_ep_9433721956998717990_0 example_ep
example_ep: Conflicting with a registered kernel with op versions. the
since version is: 1

#### Fix
- Promoted `ModelMetadefIdGenerator` to an instance member of
`PluginExecutionProvider` so the same generator is reused across all
`GetCapability()` calls, ensuring unique MetaDef IDs.
- This is also consistent with how existing provider-bridge EPs create
and use a single generator instance.
- **Bonus perf improvement**: No longer recomputes the entire model's
hash on every call to `GetCapability()`.

#### Testing

Example EP changes:
- Refactored `SaveConstantInitializers()` →
`TrySaveConstantInitializer()` to save initializers per-node-input
instead of via `graph.GetInitializers()`, which doesn't return
initializers defined in parent or sibling subgraphs.
- Extracted `CopiesConstantInitializers()` helper to deduplicate the
condition for drop_constant_initializers.

Unit testing:
- Added unit test called
`CompilingPluginEp_MultiSubgraphs_DuplicateMetaDefIdBug` — runs an If
model with Mul nodes in both branches, verifying that both fused nodes
receive unique MetaDef IDs and the session creates/runs successfully.


Credit to @apwojcik for [finding the
bug.](microsoft#27608)
### Description

Fix 2 bugs in emdawnwebgpu.

1. Fix incorrect handling for device lost. See also:
- issue: [Unexpected exit on device lost handler [492350387] -
Chromium](https://issues.chromium.org/issues/492350387)
- PR: [emdawnwebgpu: Add runtimeKeepalive for device.lost handler by
fs-eire · Pull Request #57 ·
google/d…](google/dawn#57)
(but dawn does not accept PR with copilot as co-author, so just for
reference)

2. Fix wrong call to WGPUBufferImpl constructor. See also:
- issue: [Incorrect WGPUBufferImpl constructor called from
importJsBuffer](https://issues.chromium.org/issues/492539247)

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- [x] Analyze CI failure: `EPContextNode_ForeignSourceSkipped` assertion
expects "OpenVINOExecutionProvider" in error but actual error from
TransformerMemcpyImpl doesn't include it
- [x] Fix `tensorrt_basic_test.cc`: Remove overly-specific
"OpenVINOExecutionProvider" assertion from
`EPContextNode_ForeignSourceSkipped`
- [x] Fix `nv_ep_context_test.cc`: Remove same overly-specific assertion
from `EPContextNode_ForeignSourceSkipped` (proactive)
- [x] Run code review (no actionable findings)
- [x] Run CodeQL security check (no findings)

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>


----

*This section details on the original issue you should resolve*

<issue_title>NvTensorRTRTXExecutionProvider::GetCapability claims
EPContext nodes belonging to other EPs, causing crash on multi-GPU
systems</issue_title>
<issue_description>### Describe the issue

On multi-GPU systems where both `OpenVINOExecutionProvider` and
`NvTensorRTRTXExecutionProvider` are registered,
loading an EPContext model produced by OpenVINO causes an access
violation (0xC0000005) or
"Could not find an implementation for EPContext(1)" error.

The root cause is that `NvExecutionProvider::GetCapability()` in
`nv_execution_provider.cc` claims **all**
`EPContext` nodes without checking the `source` attribute:

```cpp
// nv_execution_provider.cc ~line 2019
const bool is_context_node = node && !node->OpType().empty() && node->OpType() == EPCONTEXT_OP;
if (is_context_node) {
    // Claims any EPContext node — even those produced by OpenVINO, QNN, etc.
    result.push_back(ComputeCapability::Create(std::move(sub_graph)));
}
```

The `EPContext` contrib op schema defines an optional `source` attribute
specifically for EP identification
(`contrib_defs.cc`). Other EPs already check this attribute:

- **OpenVINO EP** checks `source == kOpenVINOExecutionProvider` in
`EPCtxHandler::CheckForOVEPCtxNode()`
- **QNN EP** checks `cache_source == "qnnexecutionprovider" ||
cache_source == "qnn"` in `PartitionCtxModel()`

The NvTensorRTRTX EP neither checks `source` when claiming EPContext
nodes in `GetCapability()`,
nor writes `source` when creating EPContext nodes in `CreateCtxNode()`.

### Proposed fix

Add a `source` attribute check to `NvExecutionProvider::GetCapability()`
before claiming EPContext nodes:

```cpp
const bool is_context_node = node && !node->OpType().empty() && node->OpType() == EPCONTEXT_OP;
if (is_context_node) {
    // Only claim EPContext nodes that belong to this EP.
    // If the SOURCE attribute is present and doesn't match, skip the node.
    const auto& attrs = node->GetAttributes();
    if (attrs.count(SOURCE) > 0 &&
        attrs.at(SOURCE).s() != kNvTensorRTRTXExecutionProvider) {
        continue;
    }
    // ... claim the node
}
```

This requires adding `static const std::string SOURCE = "source";` to
`onnx_ctx_model_helper.h`
(matching the existing constant in QNN EP's
`builder/onnx_ctx_model_helper.h` and OpenVINO EP's
`onnx_ctx_model_helper.h`).

**Additionally**, `CreateCtxNode()` in `onnx_ctx_model_helper.cc` should
be updated to write the
`source` attribute (set to `kNvTensorRTRTXExecutionProvider`) when
producing EPContext models,
following the same pattern as OpenVINO EP's `AddOVEPCtxNodeToGraph()`.
This ensures NvTensorRTRTX
EPContext models are properly tagged for the future.

### Urgency

This is a **P1 blocker for MLCommons MLPerf Client v1.6** testing on
multi-GPU laptop systems
(Intel iGPU + NVIDIA dGPU). See:
https://github.com/mlcommons/mlperf_client_dev/issues/976

### To reproduce

**System:** Any system with both an Intel GPU (OpenVINO EP) and NVIDIA
GPU (NvTensorRTRTX EP)

1. Register both OpenVINO EP and NvTensorRTRTX EP with ORT
2. Load an EPContext model with `source=OpenVINOExecutionProvider`
(e.g., Phi-3.5 compiled by OpenVINO)
3. Create a session with auto EP selection (`PREFER_GPU`) or manual
multi-EP ordering

**Expected:** OpenVINO EP claims its own EPContext node; NvTensorRTRTX
EP skips it
**Actual:** NvTensorRTRTX EP claims the OpenVINO EPContext node → access
violation crash (0xC0000005)

A self-contained min-repro (C++ source + Python model generator) is
available. The repro uses
`RegisterExecutionProviderLibrary` to register both EP DLLs, creates a
synthetic EPContext model
with `source=OpenVINOExecutionProvider`, and demonstrates the crash.

### Tested fix

We have built and tested the fix described above against:
- Synthetic EPContext model with `source=OpenVINOExecutionProvider`
- Real Phi-3.5 EPContext model from MLPerf (`openvino_model_dy.onnx`)

Both pass with the fix applied: NvTensorRTRTX EP correctly skips the
OpenVINO EPContext node
(`"[NvTensorRTRTX EP] No graph will run on Nv execution provider"`), and
OpenVINO EP successfully
creates the session.

The classic `TensorRTExecutionProvider` has the same issue
(`tensorrt_execution_provider.cc` line 2593)
but is less commonly deployed alongside other EPs.

### Environment

- ORT version: 1.24.3 (rel-1.24.3 tag)
- OS: Windows 11
- Hardware: Intel Core i9-11950H (iGPU) + NVIDIA GeForce RTX 3080 Laptop
GPU (dGPU)
- EP packages: OpenVINO EP
(MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8 1.8.63.0),
NvTensorRTRTX EP (MicrosoftCorporationII.WinML.NVIDIA.TRT-RTX.EP.2
0.0.26.0)
</issue_description>

<agent_instructions>Please provide a fix and add unit
test.</agent_instructions>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft#27622

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
# Description

This PR addresses several build warnings and a build error in the CUDA
provider, primarily focused on improving the stability of Debug builds.

## Changes

### CUDA Provider Fixes
- **Fix signedness comparison warnings**:
- In
[tile.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/tile.cc),
changed the `axis` loop variable type from `size_t` to `int32_t` to
match `input_rank`.
- In
[pad.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/pad.cc),
converted `p_pads->size()` to `int32_t` using `narrow` and updated the
loop variable type to resolve signedness warnings across template
instantiations.
- **Fix GQA build error**:
- Added a missing include for `common.cuh` in
[group_query_attention_qkv.cuh](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh).
This resolves the `identifier "CUDA_KERNEL_ASSERT" is undefined` error
encountered in Debug builds.

### Test Improvements
- **Rotary Embedding Tests**:
- Skipped out-of-bounds position ID tests in
[rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc)
and
[test/contrib_ops/rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc)
for Debug builds. This is necessary because CUDA device-side asserts
(enabled in Debug mode) can poison the CUDA context when encountering
out-of-bounds indices, causing subsequent tests to fail.

### Minor Cleanup
- Simplified initializer list usage in
[graph_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/ir/graph_test.cc)
to avoid build error like:
```
inlined from ‘constexpr void std::vector<_Tp, _Alloc>::resize(size_type) [with _Tp = onnxruntime::NodeArg*; _Alloc = std::allocator<onnxruntime::NodeArg*>]’ at /usr/include/c++/13.2.0/bits/stl_vector.h:1013:21,
    inlined from ‘virtual void onnxruntime::test::GraphTest_GraphConstruction_CheckGraphInputOutputOrderMaintained_Test::TestBody()’ at /home/tlwu/git/onnxruntime/onnxruntime/test/ir/graph_test.cc:1214:16:
/usr/include/c++/13.2.0/bits/stl_uninitialized.h:1132:28: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ forming offset 8 is out of the bounds [0, 8] [-Werror=array-bounds=]
 1132 |           __builtin_memmove(__result, __first, __count * sizeof(_Tp));
```
@ankitm3k ankitm3k merged commit ff04504 into ovep-develop Mar 18, 2026
7 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_18032026 branch March 18, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.