Improve GPU defaults, caching, and pointwise scheduling#4668
Improve GPU defaults, caching, and pointwise scheduling#4668Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates MiGraphX’s GPU target defaults and hot paths to improve performance on modern AMD GPUs (notably RDNA), adding architecture-aware heuristics, caching for repeated GPU compilation/solver work, and tuning scheduler/pointwise launch behavior.
Changes:
- Add adaptive GPU stream-count defaults (based on compute unit count) and document the new
MIGRAPHX_NSTREAMS=0“adaptive” behavior. - Make NHWC layout and MLIR attention defaults architecture-aware via new
gfx_*helpers, and cache repeated device feature/name probes. - Add caching for HIP compilation and MIOpen convolution solution lookups, plus scheduling/pointwise launch tuning and accompanying tests.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/schedule_test.cpp | Adds coverage for new scheduler split-threshold behavior. |
| test/gpu/jit.cpp | Adds a test ensuring pointwise launch bounds follow wavefront sizing. |
| test/gpu/device_name.cpp | Adds tests for new architecture-detection helpers and defaults. |
| test/gpu/context_serialize.cpp | Adds tests for stream-count heuristic and default stream count resolution. |
| test/gpu/compile_miopen_cache.cpp | New test validating MIOpen convolution solution caching. |
| test/gpu/compile_hip_cache.cpp | New test validating HIP compilation caching. |
| test/gpu/adjust_allocation.cpp | Adds test ensuring offload-copy return path avoids redundant host bounce. |
| src/targets/gpu/target.cpp | Enables NHWC by default on selected archs unless explicitly overridden by env. |
| src/targets/gpu/schedule_model.cpp | Introduces GPU split-threshold default to reduce undersized partitions. |
| src/targets/gpu/mlir.cpp | Fixes MLIR-disabled linkage by gating includes and adding stubbed APIs. |
| src/targets/gpu/lowering.cpp | Reuses host value for final return when result was only copied to GPU for pipeline reasons. |
| src/targets/gpu/jit/pointwise.cpp | Tunes untiled pointwise local size by wavefront size (capped). |
| src/targets/gpu/include/migraphx/gpu/schedule_model.hpp | Extends GPU schedule model API with split threshold. |
| src/targets/gpu/include/migraphx/gpu/device_name.hpp | Adds gfx_is_navi and arch-based default helper declarations. |
| src/targets/gpu/include/migraphx/gpu/convolution.hpp | Adds in-process caching for MIOpen convolution solution selection. |
| src/targets/gpu/include/migraphx/gpu/context.hpp | Adds adaptive stream-count resolver and changes default ctor behavior. |
| src/targets/gpu/fuse_mlir.cpp | Switches MLIR attention defaults to new arch helper logic. |
| src/targets/gpu/device_name.cpp | Adds cached device-info probing and implements new arch helper APIs. |
| src/targets/gpu/compile_hip.cpp | Adds caching of HIP compilation results to avoid repeated recompiles. |
| src/schedule.cpp | Plumbs new split-threshold through scheduler partitioning. |
| src/msgpack.cpp | Adds missing include needed for compilation. |
| src/include/migraphx/schedule_model.hpp | Extends type-erased scheduler model interface to include split threshold. |
| docs/reference/MIGraphX-dev-env-vars.rst | Documents new NHWC and adaptive stream-count default behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
anisha-amd
left a comment
There was a problem hiding this comment.
Please add to the changelog as well.
|
Addressed the inline review items in commit 1d810f1.
|
|
This needs to be split into seperate PRs to be reviewed. Also we need to validate performance against a larger set of models and architectures. |
Summary
This PR improves several GPU-side MiGraphX defaults and hot paths that were leaving performance on the table, especially on recent RDNA-class parts.
Changes in this series:
MIGRAPHX_ENABLE_MLIR=Offlocal=1024launchesThe pointwise launch change is the most visible runtime fix in the current OpenProteus path: on wave32 hardware it reduces PReLU-heavy pointwise kernels to
local=128, which avoids the oversized launch configuration emitted previously.Benchmark
Measured on
gfx1100with:2x_OpenProteus_Compact_i2_70K_fp32.onnx8x3x192x192fp16migraphx-driver perf --migraphx --gpu --enable-offload-copyLatest pointwise-launch patch vs previous series head:
11.80 mstotal,11.90 msmean8.31 mstotal,8.39 msmeanThat is roughly a
30%reduction on the isolated MiGraphX perf path for this workload.Validation
Locally validated with:
cmake --build ... --target migraphx_gpu driver test_gpu_jittest_gpu_jit compile_pointwisetest_gpu_jit compile_pointwise_launch_boundsmigraphx-driver compile/migraphx-driver perfruns ongfx1100Note: local validation was done in an MLIR-disabled build because this environment does not provide
rocMLIR; upstream CI should cover the full build matrix.Changelog Category