add asmjit AOT kernels for qwen35/Hunyuan3 by tingqli · Pull Request #3309 · ROCm/aiter

tingqli · 2026-05-22T02:47:56Z

Motivation

These kernels are specially designed/optimized for fused MOE (TP8, FP8-per-tensor, FP8-ptpc) problems:

avoid overheads of quantization in small batch cases, by dequantize weights on-the-fly and do computation in bf16 precision.
avoid overheads of moe_sorting in single-token case
fine pipelined 4-wave gate-up kernel
in down-kernel, use store_dwordx4 + reduce-sum instead of atomic_add
in down-kernel, loads A matrix and make it resident in register, loop over output-channel dimensions in 1x4 warps, pipelined MFMAs with store_dwordx4

Technical Details

current tune methods didn't take extra overheads (moe-sorting/quant) into consideration, but fmoe_asmjit_aot introduced some optimizations regarding to these overheads, thus we introduced a new --e2e_tune flag into gemm_moe_tune.py, this mode directly compares fmoe_asmjit_aot's performance against current best fused_moe()'s performance, if former is better, it will be recorded into destination tuned_fmoe file under model_configs.

python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py -i aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_untuned_fmoe.csv -o aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_tuned_fmoe.csv --timeout 300 -v --all --e2e_tune

Test Plan

Test Result

performance improve in e2e tune: Hunyuan fp8-per-tensor (TP8)

performance improve in e2e tune: qwen3.5 fp8-ptpc (TP8)

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authors: Cheng.Luo@amd.com Luwei.Zhou@amd.com

github-actions · 2026-05-22T02:48:09Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3309 --add-label <label>

Copilot

Pull request overview

Adds an asmjit AOT MoE fused implementation and an end-to-end tuning mode so tuning decisions account for real production overheads (sorting/quant/etc.), and wires tuned configs to select the new implementation at runtime.

Changes:

Introduces fused_moe_asmjit_aot and a lightweight HSACO loader/launcher to run AOT kernels.
Adds --e2e_tune path in gemm_moe_tune.py to compare current best fused_moe() vs asmjit-AOT variants and write winners into tuned CSVs.
Extends fused_moe config lookup to dispatch to asmjit-AOT when tuned CSV selects it; adds new model config CSVs for Qwen3.5/Hunyuan3.

Reviewed changes

Copilot reviewed 9 out of 29 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
csrc/cpp_itfs/hsaco_tools.py	New ctypes-based HSACO loader/launcher and kernel symbol discovery.
csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py	Adds e2e tuning mode and hooks to benchmark asmjit-AOT configs.
aiter/utility/base_tuner.py	Adds `--e2e_tune` CLI flag to the shared tuner argument set.
aiter/fused_moe.py	Enables tuned-config dispatch into the asmjit-AOT fused MoE implementation.
aiter/fused_moe_asmjit_aot.py	New asmjit AOT fused MoE implementation + tuning space definition.
aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_untuned_fmoe.csv	New untuned shape list for Qwen3.5 fp8-ptpc MoE.
aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_tuned_fmoe.csv	New tuned results selecting asmjit-AOT kernels for Qwen3.5.
aiter/configs/model_configs/hunyuan3_fp8_per_tensor_untuned_fmoe.csv	New untuned shape list for Hunyuan3 fp8-per-tensor MoE.
aiter/configs/model_configs/hunyuan3_fp8_per_tensor_tuned_fmoe.csv	New tuned results selecting asmjit-AOT kernels for Hunyuan3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tingqli · 2026-05-22T03:22:31Z

+        ExtraType = ctypes.c_void_p * 5
+        kernel_args_size = ctypes.c_uint64(ctypes.sizeof(kernel_args))
+        kernel_config = ExtraType(
+            1, ctypes.addressof(kernel_args), 2, ctypes.addressof(kernel_args_size), 3


no, HIP_LAUNCH_PARAM_END is always 3, please check here https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___global_defs.html#ga86cd80c0b352a6679a7fac89e026f0f7

+        kernel_config = ExtraType(
+            1, ctypes.addressof(kernel_args), 2, ctypes.addressof(kernel_args_size), 3
+        )
+        stream = ctypes.cast(torch.cuda.current_stream(), ctypes.c_void_p)


+        while len(gridDims) < 3:
+            gridDims.append(1)
+        while len(blockDims) < 3:
+            blockDims.append(1)
+        hip_check_error(
+            hip.hipModuleLaunchKernel(
+                p_func,
+                *gridDims,
+                *blockDims,


+    dynamic_syms_raw = subprocess.check_output(
+        ["/opt/rocm/llvm/bin/llvm-objdump", "--dynamic-syms", co_path]
+    ).decode("utf-8")


+        assert kernel_cnt.value > 0
+        kernels = (ctypes.c_void_p * kernel_cnt.value)()


+        return cls(*[eval(p) for p in parts])
+
+


+from aiter.fused_moe_asmjit_aot import fused_moe_asmjit_aot
+from aiter.fused_moe_asmjit_aot import get_tune_space


+            "--e2e_tune",
+            action="store_true",
+            required=False,
+            help="Run an extra round of e2e tuning after main tuning is done, using production-op benchmark as the indicator",


+    if kernelName1.startswith("fused_moe_asmjit_aot"):
+        from aiter.fused_moe_asmjit_aot import fused_moe_asmjit_aot
+
+        return MOEMetadata(
+            None,
+            None,
+            block_m,
+            ksplit,
+            run_1stage,
+            stage0=functools.partial(
+                fused_moe_asmjit_aot, config_string=kernelName1.split("__")[1]
+            ),


+    E, N1, K1 = w1.shape
+    N2, K2 = w2.shape[1], w2.shape[2]
+    TOPK = topk_ids.shape[1]
+    fp8_ptpc = w1.dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz) and (


add asmjit AOT kernels for qwen35/Hunyuan3

877c200

tingqli marked this pull request as ready for review May 22, 2026 02:57

tingqli requested review from a team and Copilot May 22, 2026 02:57

Copilot started reviewing on behalf of tingqli May 22, 2026 02:57 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add asmjit AOT kernels for qwen35/Hunyuan3#3309

add asmjit AOT kernels for qwen35/Hunyuan3#3309
tingqli wants to merge 1 commit into
ROCm:mainfrom
tingqli:fmoe-aot

tingqli commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tingqli May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert kernel_cnt.value > 0
		kernels = (ctypes.c_void_p * kernel_cnt.value)()

		from aiter.fused_moe_asmjit_aot import fused_moe_asmjit_aot
		from aiter.fused_moe_asmjit_aot import get_tune_space

Conversation

tingqli commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 22, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tingqli May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tingqli commented May 22, 2026 •

edited

Loading

tingqli May 22, 2026 •

edited

Loading