Improve GPU defaults, caching, and pointwise scheduling by Rolaand-Jayz · Pull Request #4668 · ROCm/AMDMIGraphX

Rolaand-Jayz · 2026-03-13T20:01:02Z

Summary

This PR improves several GPU-side MiGraphX defaults and hot paths that were leaving performance on the table, especially on recent RDNA-class parts.

Changes in this series:

auto-tune default GPU stream count by device CU count
select NHWC and MLIR attention defaults by architecture instead of relying on manual env tuning
skip a redundant host bounce on final GPU fallback returns
fix GPU linkage when MIGRAPHX_ENABLE_MLIR=Off
cache repeated GPU feature/device-name probes
cache repeated HIP code compilations
cache repeated MIOpen convolution solution lookups
avoid scheduling undersized async stream partitions
tune untiled GPU pointwise launch bounds by wavefront size instead of falling back to oversized local=1024 launches

The pointwise launch change is the most visible runtime fix in the current OpenProteus path: on wave32 hardware it reduces PReLU-heavy pointwise kernels to local=128, which avoids the oversized launch configuration emitted previously.

Benchmark

Measured on gfx1100 with:

model: 2x_OpenProteus_Compact_i2_70K_fp32.onnx
shape: 8x3x192x192
precision: fp16
command: migraphx-driver perf --migraphx --gpu --enable-offload-copy

Latest pointwise-launch patch vs previous series head:

previous: 11.80 ms total, 11.90 ms mean
this branch: 8.31 ms total, 8.39 ms mean

That is roughly a 30% reduction on the isolated MiGraphX perf path for this workload.

Validation

Locally validated with:

cmake --build ... --target migraphx_gpu driver test_gpu_jit
test_gpu_jit compile_pointwise
test_gpu_jit compile_pointwise_launch_bounds
repeated migraphx-driver compile / migraphx-driver perf runs on gfx1100

Note: local validation was done in an MLIR-disabled build because this environment does not provide rocMLIR; upstream CI should cover the full build matrix.

Changelog Category

Added: New functionality.
Changed: Changes to existing functionality.
Removed: Functionality or support that has been removed. (Compared to a previous release)
Optimized: Component performance that has been optimized or improved.
Resolved Issues: Known issues from a previous version that have been resolved.
Not Applicable: This PR is not to be included in the changelog.

Copilot

Pull request overview

This PR updates MiGraphX’s GPU target defaults and hot paths to improve performance on modern AMD GPUs (notably RDNA), adding architecture-aware heuristics, caching for repeated GPU compilation/solver work, and tuning scheduler/pointwise launch behavior.

Changes:

Add adaptive GPU stream-count defaults (based on compute unit count) and document the new MIGRAPHX_NSTREAMS=0 “adaptive” behavior.
Make NHWC layout and MLIR attention defaults architecture-aware via new gfx_* helpers, and cache repeated device feature/name probes.
Add caching for HIP compilation and MIOpen convolution solution lookups, plus scheduling/pointwise launch tuning and accompanying tests.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/schedule_test.cpp	Adds coverage for new scheduler split-threshold behavior.
test/gpu/jit.cpp	Adds a test ensuring pointwise launch bounds follow wavefront sizing.
test/gpu/device_name.cpp	Adds tests for new architecture-detection helpers and defaults.
test/gpu/context_serialize.cpp	Adds tests for stream-count heuristic and default stream count resolution.
test/gpu/compile_miopen_cache.cpp	New test validating MIOpen convolution solution caching.
test/gpu/compile_hip_cache.cpp	New test validating HIP compilation caching.
test/gpu/adjust_allocation.cpp	Adds test ensuring offload-copy return path avoids redundant host bounce.
src/targets/gpu/target.cpp	Enables NHWC by default on selected archs unless explicitly overridden by env.
src/targets/gpu/schedule_model.cpp	Introduces GPU split-threshold default to reduce undersized partitions.
src/targets/gpu/mlir.cpp	Fixes MLIR-disabled linkage by gating includes and adding stubbed APIs.
src/targets/gpu/lowering.cpp	Reuses host value for final return when result was only copied to GPU for pipeline reasons.
src/targets/gpu/jit/pointwise.cpp	Tunes untiled pointwise local size by wavefront size (capped).
src/targets/gpu/include/migraphx/gpu/schedule_model.hpp	Extends GPU schedule model API with split threshold.
src/targets/gpu/include/migraphx/gpu/device_name.hpp	Adds `gfx_is_navi` and arch-based default helper declarations.
src/targets/gpu/include/migraphx/gpu/convolution.hpp	Adds in-process caching for MIOpen convolution solution selection.
src/targets/gpu/include/migraphx/gpu/context.hpp	Adds adaptive stream-count resolver and changes default ctor behavior.
src/targets/gpu/fuse_mlir.cpp	Switches MLIR attention defaults to new arch helper logic.
src/targets/gpu/device_name.cpp	Adds cached device-info probing and implements new arch helper APIs.
src/targets/gpu/compile_hip.cpp	Adds caching of HIP compilation results to avoid repeated recompiles.
src/schedule.cpp	Plumbs new split-threshold through scheduler partitioning.
src/msgpack.cpp	Adds missing include needed for compilation.
src/include/migraphx/schedule_model.hpp	Extends type-erased scheduler model interface to include split threshold.
docs/reference/MIGraphX-dev-env-vars.rst	Documents new NHWC and adaptive stream-count default behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/gpu/compile_hip_cache.cpp

test/gpu/compile_miopen_cache.cpp

src/targets/gpu/device_name.cpp

anisha-amd

Please add to the changelog as well.

Rolaand-Jayz · 2026-03-16T05:06:53Z

Addressed the inline review items in commit 1d810f1.

Guarded scoped_env_var behind #ifndef _WIN32 in both cache tests so MSVC does not compile setenv/unsetenv.
Changed get_cached_device_info() to return by value to avoid returning a reference into the cached unordered_map after unlocking.
Added CHANGELOG.md entries under Develop / Changed and Develop / Optimized for PR Improve GPU defaults, caching, and pointwise scheduling #4668, covering the architecture-aware GPU defaults and the caching/scheduling/pointwise optimizations.

pfultz2 · 2026-03-17T14:03:45Z

This needs to be split into seperate PRs to be reviewed. Also we need to validate performance against a larger set of models and architectures.

Rolaand-Jayz added 9 commits March 13, 2026 09:30

Auto-tune default GPU stream count

4d025a9

Select NHWC and MLIR attention defaults by arch

8e1771b

Skip redundant host bounce on GPU fallback return

a468a87

Fix MLIR-off GPU linkage in local build

f2785b6

Cache GPU feature probes by device

526c825

Cache repeated HIP code compilations

4ea5a1d

Cache repeated MiOpen convolution solutions

8dc5ce7

Avoid scheduling undersized GPU stream partitions

111dca2

Tune GPU pointwise launch bounds by wavefront

17a8d65

Rolaand-Jayz requested review from a team and causten as code owners March 13, 2026 20:01

Copilot AI review requested due to automatic review settings March 13, 2026 20:01

Copilot started reviewing on behalf of Rolaand-Jayz March 13, 2026 20:02 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

test/gpu/compile_hip_cache.cpp Show resolved Hide resolved

test/gpu/compile_miopen_cache.cpp Show resolved Hide resolved

src/targets/gpu/device_name.cpp Show resolved Hide resolved

anisha-amd reviewed Mar 13, 2026

View reviewed changes

Rolaand-Jayz and others added 4 commits March 13, 2026 15:59

Stabilize quantized GPU NHWC compilation

266c172

Support PReLU in MLIR pointwise fusion

3baacd9

Use MLIR tuning DB without CFG gate

3f7d362

fix: address review feedback for PR 4668

1d810f1

causten requested a review from pfultz2 March 16, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve GPU defaults, caching, and pointwise scheduling#4668

Improve GPU defaults, caching, and pointwise scheduling#4668
Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Rolaand-Jayz:opt/migraphx-performance-pr

Rolaand-Jayz commented Mar 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anisha-amd left a comment

Uh oh!

Rolaand-Jayz commented Mar 16, 2026

Uh oh!

pfultz2 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Rolaand-Jayz commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Validation

Changelog Category

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anisha-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Rolaand-Jayz commented Mar 16, 2026

Uh oh!

pfultz2 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rolaand-Jayz commented Mar 13, 2026 •

edited

Loading