[Gluon][gfx1250] Gemm MXFP4 preshuffled by Boss2002n · Pull Request #2332 · ROCm/aiter

Boss2002n · 2026-03-18T15:18:08Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-18T15:18:33Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2332 --add-label <label>

Copilot

Pull request overview

This PR adds a gfx1250-focused “preshuffled” MXFP4 GEMM path, including new shuffling helpers, a new Gluon kernel for gfx1250, and updated tests/benchmarks to exercise the preshuffled layout.

Changes:

Added gfx1250-specific weight/scale preshuffle logic and a new Gluon-based preshuffle kernel.
Updated GEMM preshuffle wrapper to route gfx1250 to the Gluon kernel and adjusted tests/benchmarks accordingly.
Introduced a new gfx1250 preshuffle tuning config JSON and a new shuffle helper in aiter/ops/shuffle.py.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py	Adds gfx1250 scale shuffling + a new gfx1250 preshuffled GEMM test and switches to `triton.testing.assert_close`.
op_tests/op_benchmarks/triton/bench_gemm_afp4wfp4.py	Renames bench flag to `--preshuffle` (alias `--shuffle`) and benches `gemm_afp4wfp4_preshuffle`.
aiter/ops/triton/gluon/gemm_afp4wfp4.py	Expands device allow-list to include gfx1250 for Gluon AFP4WFP4 config loading.
aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py	Refactors kernel imports, adds gfx1250 Gluon preshuffle dispatch, and modifies preshuffle K/grid handling.
aiter/ops/triton/configs/gemm/gfx1250-GEMM-AFP4WFP4_PRESHUFFLED.json	Replaces gfx1250 preshuffled tuning entries and adds `NUM_BUFFERS` for the new Gluon kernel.
aiter/ops/triton/_gluon_kernels/gemm/basic/gemm_mxfp4.py	Adds new Gluon/TDM-based gfx1250 preshuffled MXFP4 GEMM kernel and associated layouts/depreshuffle views.
aiter/ops/shuffle.py	Adds `shuffle_weight_gfx1250` helper for preshuffling weights into the TDM-friendly layout.

Comments suppressed due to low confidence (2)

aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py:499

grid no longer multiplies by META["NUM_KSPLIT"]. The underlying Triton preshuffle kernel maps pid_k from program_id(axis=0) assuming the launch grid is GRID_MN * NUM_KSPLIT; with the current grid, split-K launches will be incomplete and the reduction path will be wrong. Restore the NUM_KSPLIT * cdiv(M, BM) * cdiv(N, BN) factor for the Triton preshuffle kernel (and handle Gluon separately if needed).

    grid = lambda META: (  # noqa: E731
        (triton.cdiv(M, META["BLOCK_SIZE_M"]) * triton.cdiv(N, META["BLOCK_SIZE_N"])),
    )

aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py:551

The gfx1250 Gluon preshuffle path drops NUM_KSPLIT from the kernel config but still allows config["NUM_KSPLIT"] > 1 earlier, and also sets stride_c_k based on the 2D output y. Since the Gluon kernel implementation is not split-K aware, this needs an explicit guard (e.g., force NUM_KSPLIT=1 / skip split-K allocation) or implement proper split-K semantics for the Gluon path.

    if use_gluon:
        layouts = get_gemm_afp4wfp4_preshuffle_layouts(
            config["num_warps"],
            config["BLOCK_SIZE_M"],
            config["BLOCK_SIZE_N"],
            config["BLOCK_SIZE_K"],
        )

        _DROP_KEYS = (
            "NUM_KSPLIT",
            "SPLITK_BLOCK_SIZE",
            "SPLITK_BLOCK",
            "GROUP_SIZE_M",
            "num_stages",
            "waves_per_eu",
            "matrix_instr_nonkdim",
            "cache_modifier",
        )
        kernel_config = {k: v for k, v in config.items() if k not in _DROP_KEYS}
        # Kernel consumes preshuffled scales directly (address math inverts the shuffle in registers)
        assert M >= 32, "gluon mxfp4 preshuffle path requires M >= 32"
        x_scales = x_scales.contiguous()
        w_scales = w_scales.contiguous()
        _gluon_gemm_mxfp4_preshuffle_gfx1250[grid](
            x_fp4,
            w_preshuf,
            y,
            x_scales,
            w_scales,
            M,
            N,
            K_elems,
            x_fp4.stride(0),
            x_fp4.stride(1),
            w_preshuf.stride(0),
            w_preshuf.stride(1),
            0 if config["NUM_KSPLIT"] == 1 else y.stride(0),
            y.stride(-2),
            y.stride(-1),
            x_scales.stride(0),
            x_scales.stride(1),
            w_scales.stride(0),
            w_scales.stride(1),
            **kernel_config,
            **layouts,
        )
        return y

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PR to main

e041f6b

Boss2002n added 15 commits March 18, 2026 15:30

fix

a402f8a

temp fix

a9645cd

fix

957ea09

fix

bcc2007

fix?

27f7631

working

858ff7d

gfx-12 pass

3136c6e

lint

df46bc0

fix

9ba6bfc

fix

b17c4b0

remove convert layout

e278456

fix

84c0583

Merge branch 'main' into satya/gfx12_mxfp4_gemm

684fc79

Merge branch 'main' into satya/gfx12_mxfp4_gemm

5d8d768

Update arch_info.py

d380fef

Boss2002n changed the title ~~PR to main~~ [Gluon][gfx1250] Gemm MXFP4 preshuffled Mar 24, 2026

Boss2002n self-assigned this Mar 24, 2026

Boss2002n added 11 commits March 29, 2026 03:19

latest

5f9a79d

small fix

9c01398

fix

a2a82a2

fix

db502af

Fix

be0b1ff

waves =2

ea853bf

fix

99630be

fix

f3185b8

optimized config

ec13e9e

fix

5075a87

fix layout cuz A is not preshuf

b6061b1

Boss2002n added 2 commits May 12, 2026 20:19

32x16

289e5f3

update shuffle

03fb8ef

vgokhale reviewed May 13, 2026

View reviewed changes

Comment thread aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py Outdated

vgokhale reviewed May 13, 2026

View reviewed changes

Comment thread aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py

vgokhale reviewed May 13, 2026

View reviewed changes

Comment thread op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py Outdated

vgokhale reviewed May 13, 2026

View reviewed changes

Comment thread op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py Outdated

vgokhale previously approved these changes May 13, 2026

View reviewed changes

fix - depreshuf -scales

4b7222b

Boss2002n dismissed vgokhale’s stale review via 4b7222b May 13, 2026 19:59

Boss2002n added 4 commits May 13, 2026 16:07

Merge branch 'main' into satya/gfx12_mxfp4_gemm

24e4910

address comments

f21584d

black - format

bd773d9

black - format

08e3e1b

Boss2002n marked this pull request as ready for review May 13, 2026 20:15

Boss2002n requested review from a team and Copilot May 13, 2026 20:15

Copilot AI reviewed May 13, 2026

View reviewed changes

Boss2002n added 2 commits May 19, 2026 16:06

.load instead of relaxed shared load

4278ec0

B32_test

c4371d2

Boss2002n force-pushed the satya/gfx12_mxfp4_gemm branch from 96c77c1 to c4371d2 Compare May 20, 2026 21:30

Boss2002n added 7 commits May 20, 2026 17:43

Merge branch 'main' into satya/gfx12_mxfp4_gemm

eefa2cd

formatting

1716df8

fix formatting

f269068

ruff fix

ac2a49e

fix

0039f8b

Merge branch 'main' into satya/gfx12_mxfp4_gemm

9bf916e

remove unused params from 1250 mxfp4 config

a7359e1

vgokhale approved these changes May 22, 2026

View reviewed changes

Boss2002n merged commit e00cf3e into main May 23, 2026
36 of 43 checks passed

Boss2002n deleted the satya/gfx12_mxfp4_gemm branch May 23, 2026 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluon][gfx1250] Gemm MXFP4 preshuffled#2332

[Gluon][gfx1250] Gemm MXFP4 preshuffled#2332
Boss2002n merged 53 commits into
mainfrom
satya/gfx12_mxfp4_gemm

Boss2002n commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Boss2002n commented Mar 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Mar 18, 2026

🏷️ CI Guide

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants