[Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization by k50112113 · Pull Request #3292 · ROCm/aiter

k50112113 · 2026-05-20T21:39:17Z

This PR is co-authored by @azaidy and @k50112113

This PR includes:

Added MXFP8 GEMM (non-shuffled and pre-shuffled), tunned with DSV4 shapes
Added all other mxfp8 GEMM related kernels, including: per_1x32_mxfp8_quant_triton, rmsnorm_mxfp8_quant, dual_rmsnorm_mxfp8_quant, fused_flatten_mxfp8_quant.
A8W4 MOE fused with activation and downcast between the 2 MOEs
Optimized Topk and routing for the A8W4 MOE
Fix fused_flatten_fp8_group_qaunt bug when N is not power of 2

github-actions · 2026-05-20T21:39:35Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3292 --add-label <label>

* add fused_gemm_a16w16_copy_x, remove near-moe torch.to and torch.zeros kernels * tunning * revert fill(0) change

k50112113 added 7 commits May 19, 2026 18:27

add all mxfp8 related kernels

0ebf70b

add ut

cdc7491

add splitk

9f2a6ed

update

68731f6

add tunned config

73d8719

black format

13e7517

merge with main

69db00a

k50112113 requested review from a team and azaidy May 20, 2026 21:39

k50112113 requested a review from lburzawa May 20, 2026 21:39

k50112113 mentioned this pull request May 20, 2026

[Triton] MXFP8 GEMM and A8W4 MOE optimization for DSV4 ROCm/ATOM#861

Open

k50112113 added 5 commits May 21, 2026 10:22

[Triton] merge fused gate + downcast kernel (#3306)

8494fb5

* add fused_gemm_a16w16_copy_x, remove near-moe torch.to and torch.zeros kernels * tunning * revert fill(0) change

fused_gemm_a16w16_quant_x

b86707c

rebase ontop of mxfp8-gemm

c47aa50

bug fix

24f68d5

tune topk config, add ut

7a0832e

k50112113 force-pushed the shaoclee/alizaidy-dsv4-moe branch from bbaece4 to 7a0832e Compare May 21, 2026 16:01

k50112113 added 2 commits May 21, 2026 19:28

add flatten mxfp8 quant and fix fused_flatten_fp8_group_quant bug

dd69ad5

some moe tunning

7c5ad49

k50112113 changed the title ~~[Triton] [ATOM] A8W4 MOE and topk routing~~ [Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization May 21, 2026

k50112113 added 4 commits May 21, 2026 22:27

black

38fc906

clean

a00c3db

CI fix

95bda07

CI fix

ce126ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization#3292

[Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization#3292
k50112113 wants to merge 18 commits into
mainfrom
shaoclee/alizaidy-dsv4-moe

k50112113 commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k50112113 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

k50112113 commented May 20, 2026 •

edited

Loading