Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301
Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301stefankoncarevic wants to merge 7 commits intoROCm:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the Rock BlockwiseBroadcastReduceOp lowering to use gpu.subgroup_reduce (with clustered reductions where applicable) and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions, improving inter-thread reduction performance on supported architectures.
Changes:
- Update blockwise broadcast-reduce lowering to select between shuffle+DPP, serial XOR shuffle, and LDS tree fallback paths, with shared helper functions.
- Introduce
rock-subgroup-reduce-to-dpppass and wire it into the backend pipeline beforeconvert-gpu-to-rocdl. - Extend/adjust tests and pipelines to cover the new DPP clustered reduction behavior and pass ordering.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp | Refactors reduction lowering, adds shuffle/DPP paths, helpers, and emits gpu.shuffle + gpu.subgroup_reduce. |
| mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp | New Rock pass to lower gpu.subgroup_reduce into AMD DPP sequences via GPU transform patterns. |
| mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt | Adds the new pass source and links GPU transforms library. |
| mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp | Inserts rock-subgroup-reduce-to-dpp into the backend pipeline after lowering affine. |
| mlir/include/mlir/Dialect/Rock/Passes.td | Declares the new pass and its chip option. |
| mlir/include/mlir/Dialect/Rock/Passes.h | Adds the generated pass decl macro for the new pass. |
| mlir/test/rocmlir-driver/pipelines.mlir | Updates expected printed pipelines to include rock-subgroup-reduce-to-dpp{chip=...}. |
| mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir | Updates lowering checks and parameterizes arch via token substitution. |
| mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir | New integration test covering multiple cluster_size cases and both sum/max reductions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8636b38 to
918f35c
Compare
918f35c to
0c4fed5
Compare
…per extraction Restructure the blockwise reduce rewrite pattern in BlockwiseGemmToThreadwise.cpp to improve clarity, maintainability, and enable DPP-based reductions via gpu.SubgroupReduceOp. Shuffle decision logic: - Introduce has2DThreadLayout guard (mTidPerWave > 0 && nTidPerWave > 0) to clearly separate GEMM-style 2D thread layouts from general cases - Path 1 (Shuffle+DPP): activates when blockSize > nrDimProduct and the per-thread subtile is [1,1] with rDim == 1, using gpu.shuffle to transpose data from WMMA/MFMA strided layout into contiguous DPP-compatible layout - Path 2 (Serial XOR): activates when blockSize <= nrDimProduct, performing log2(rDim) XOR butterfly reduction steps within a wave at stride nTidPerWave - Initial LDS store is deferred: only performed when neither shuffle path applies, avoiding unnecessary LDS traffic for shuffle-eligible configurations Parallel reduction with DPP: - Use gpu.SubgroupReduceOp with cluster_size for DPP-eligible reductions (power-of-2 active threads, cluster_size <= waveSize) - Only the reduction group leader (rtid == 0) writes the result back to LDS, followed by a barrier and broadcast read - Use bitwise AND/SHRU for thread ID decomposition (rtid, nrtid) on the DPP path and for power-of-2 non-reduction dimensions; fall back to DIV/REM for non-power-of-2 cases - Force scalar accumulation (vectorLen = 1) during threadwise pre-reduction on the DPP path to ensure correct element-wise reduction before SubgroupReduceOp Helper extraction: - getPerWaveThreadCounts: promote to static member function; extracts m_tid and n_tid counts from the tid slice view Merge transform - shuffleRearrangeForDPP: encapsulates the gpu.shuffle-based transposition from strided WMMA/MFMA layout to contiguous DPP layout (sourceLane = (lane % clusterSize) * stride + lane / clusterSize) - readReducedResultsFromLDS: consolidates the repeated pattern of barrier + ThreadwiseReadInto from LDS into output registers (and optional extra output) Tree reduction path: - Retained as fallback for non-DPP-eligible configurations (non-power-of-2 thread counts or cluster_size > waveSize) - Scope ceilPowerOf2 computation and treeMaxActiveThreads naming to this path New test: blockwise_reduce_dpp_cluster_sizes.mlir - Integration test covering DPP reduction with cluster sizes 2, 4, 8, 16, 32, 64 - Validates both sum (rand=none, all ones) and max (rand=fixed) reductions - All test configurations use blockSize <= waveSize to ensure single-wave execution on both RDNA (waveSize=32) and CDNA (waveSize=64) - cluster_size=64 falls back to tree reduction on RDNA since 64 > waveSize=32
…ion kernels Remove the shuffle+DPP transpose path and serial XOR butterfly reduction from BlockwiseBroadcastReduceOp lowering. These paths used gpu.shuffle to rearrange data between WMMA/MFMA strided layout and contiguous DPP layout, adding complexity without consistent performance benefit. The DPP reduction path now uses gpu::SubgroupReduceOp directly with cluster_size, which handles cross-lane communication within a wavefront without requiring explicit data rearrangement through shuffle or LDS. Key changes: - Remove shuffleRearrangeForDPP() and all shuffle optimization logic (canUseShuffleOptimization, canUseSerialShuffle, XOR butterfly) - Restrict DPP activation to partial_r > 2, as configurations with partial_r = 2 do not benefit from DPP due to insufficient work to amortize the instruction overhead; these fall back to LDS-Tree - Remove forced scalar vectorization for DPP threadwise reduction - Simplify LDS store to be unconditional (no longer skipped by shuffle)
…ation in NR-Large-Tree path
f8ed8ea to
ac6d253
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a DPP-accelerated reduction path for rock.blockwise_broadcast_reduce by lowering eligible inter-thread reductions to gpu.subgroup_reduce with cluster_size, and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions in the backend pipeline.
Changes:
- Add a DPP-capable lowering path in
BlockwiseGemmToThreadwise.cppthat emitsgpu.subgroup_reducefor eligible configurations, keeping the LDS tree reduction as fallback. - Add
rock-subgroup-reduce-to-dppbackend pass and wire it into the Rock backend pipeline beforeconvert-gpu-to-rocdl. - Add/update lit + integration tests to cover clustered subgroup-reduce scenarios and ensure the new pass is present in dumped pipelines.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp | Emits gpu.subgroup_reduce for eligible blockwise reductions; refactors final LDS-readback into a helper. |
| mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp | New Rock pass that lowers gpu.subgroup_reduce (clustered and non-clustered) to AMD DPP patterns. |
| mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt | Registers new transform and links GPU transform library support. |
| mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp | Inserts rock-subgroup-reduce-to-dpp into the backend gpu.module pipeline. |
| mlir/include/mlir/Dialect/Rock/Passes.td | Declares the new rock-subgroup-reduce-to-dpp pass and its chip option. |
| mlir/include/mlir/Dialect/Rock/Passes.h | Adds the generated pass decl macro for the new pass. |
| mlir/test/rocmlir-driver/pipelines.mlir | Updates pipeline-dump checks to include the new pass for gfx90a/gfx942/gfx950. |
| mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir | Updates lowering checks to reflect the new DPP path IR patterns. |
| mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir | New integration test covering multiple cluster_size values and sum/max reductions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rch DB for wave size - Change canUseDPP condition from >= to == for blockSize vs clusterSize * nonReductionDimSizeProduct to prevent potential out-of-bounds LDS writes by extra threads when blockSize exceeds the exact thread count needed for the DPP layout. - Replace hard-coded chipset major version heuristic in SubgroupReduceToDPP with rock::lookupArchInfo(chip).waveSize for more robust subgroup size derivation. - Update lowering_blockwise_broadcast_reduce test to use dimensions where blockSize == clusterSize * nrDimProd (8 == 2 * 4).
Motivation
The BlockwiseBroadcastReduceOp lowering in BlockwiseGemmToThreadwise.cpp handles the reduction of partial results across threads within a workgroup. In the blockSize > nonReductionDimSizeProduct path, all inter-thread reductions currently use an LDS-based tree reduction loop requiring log2(N) barrier-synchronized LDS round-trips. This works correctly but leaves performance on the table for cases where hardware-accelerated subgroup (wave-level) reduction is available.
This PR adds a DPP-based reduction path using gpu::SubgroupReduceOp with cluster_size for eligible configurations, while keeping the existing LDS-Tree as fallback for all other cases. Works correctly on both CDNA (waveSize=64) and RDNA (waveSize=32) architectures.
Technical Details
Two reduction paths (blockSize > nonReductionDimSizeProduct)
The lowering now selects one of two paths based on DPP eligibility:
DPP path (canUseDPP = true) — new:
All 5 conditions met: power-of-2 reduction threads, more than 1 thread, partial_r > 2, threads fit within a single wave (<= waveSize), and block has enough threads or non-reduction dim is trivial
Threadwise pre-reduction in registers → gpu::SubgroupReduceOp with cluster_size → leader thread (rtid == 0) writes result to LDS → broadcast
Thread layout: Contiguous — rtid = tid & (cluster-1), nrtid = tid >> log2(cluster)
Tree path (existing, unchanged) — fallback:
DPP conditions not met (non-power-of-2 threads, partial_r <= 2, threads exceed waveSize, etc.)
log2(N) LDS tree reduction loop with barrier per step → broadcast
Thread layout: Scattered — rtid = tid / nonReductionDimSizeProduct, nrtid = tid % nonReductionDimSizeProduct
Test Plan
littest suite forreduce/blockwise_reduce/)blockwise_reduce_dpp_cluster_sizes.mlirtest covers cluster sizes 4, 8, 16, 32, 64with both
sumandmaxreductionsTest Result
Submission Checklist