Skip to content

[Triton] MXFP8 GEMM and A8W4 MOE optimization for DSV4#861

Open
k50112113 wants to merge 8 commits into
mainfrom
shaoclee/alizaidy-dsv4-moe
Open

[Triton] MXFP8 GEMM and A8W4 MOE optimization for DSV4#861
k50112113 wants to merge 8 commits into
mainfrom
shaoclee/alizaidy-dsv4-moe

Conversation

@k50112113
Copy link
Copy Markdown
Contributor

@k50112113 k50112113 commented May 20, 2026

This PR is co-authored by @azaidy and @k50112113
This PR depends on ROCm/aiter#3292

This PR includes:

  1. Added MXFP8 GEMM and related quant/fusions for DSV4, for example fuse flatten with mxfp8 quant after einsum.
  2. Enabled A8W4 Triton MOE (fused with activation and mxfp8 downcast into MOE GEMM1)
  3. Optimized Triton Topk and routing for the A8W4 MOE

We see +3~7% e2e performance bump across different conc, need to set

export ATOM_FP8_BLOCKSCALE_USE_MXFP8=1
export ATOM_USE_TRITON_MOE=1

lm_eval gsm8k full test set

local-completions ({'model': '/data/deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 2000.0, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9515|±  |0.0059|
|     |       |strict-match    |     3|exact_match|↑  |0.9522|±  |0.0059|

Total token throughput (toks/s):

Conc Baseline This PR, export ATOM_FP8_BLOCKSCALE_USE_MXFP8=1, and export ATOM_USE_TRITON_MOE=1 speedup
4 417.48 439.87 1.054
8 788.78 821.97 1.042
16 1427.29 1473.69 1.033
32 2507.72 2593.96 1.034
64 4088.5 4367.02 1.068

@k50112113 k50112113 changed the title Shaoclee/alizaidy dsv4 moe [Triton] A8W4 MOE for DSV4 May 20, 2026
@k50112113 k50112113 changed the title [Triton] A8W4 MOE for DSV4 [Triton] MXFP8 GEMM and A8W4 MOE optimization for DSV4 May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant