add asmjit AOT kernels for qwen35/Hunyuan3#3309
Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Adds an asmjit AOT MoE fused implementation and an end-to-end tuning mode so tuning decisions account for real production overheads (sorting/quant/etc.), and wires tuned configs to select the new implementation at runtime.
Changes:
- Introduces
fused_moe_asmjit_aotand a lightweight HSACO loader/launcher to run AOT kernels. - Adds
--e2e_tunepath ingemm_moe_tune.pyto compare current bestfused_moe()vs asmjit-AOT variants and write winners into tuned CSVs. - Extends
fused_moeconfig lookup to dispatch to asmjit-AOT when tuned CSV selects it; adds new model config CSVs for Qwen3.5/Hunyuan3.
Reviewed changes
Copilot reviewed 9 out of 29 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| csrc/cpp_itfs/hsaco_tools.py | New ctypes-based HSACO loader/launcher and kernel symbol discovery. |
| csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py | Adds e2e tuning mode and hooks to benchmark asmjit-AOT configs. |
| aiter/utility/base_tuner.py | Adds --e2e_tune CLI flag to the shared tuner argument set. |
| aiter/fused_moe.py | Enables tuned-config dispatch into the asmjit-AOT fused MoE implementation. |
| aiter/fused_moe_asmjit_aot.py | New asmjit AOT fused MoE implementation + tuning space definition. |
| aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_untuned_fmoe.csv | New untuned shape list for Qwen3.5 fp8-ptpc MoE. |
| aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_tuned_fmoe.csv | New tuned results selecting asmjit-AOT kernels for Qwen3.5. |
| aiter/configs/model_configs/hunyuan3_fp8_per_tensor_untuned_fmoe.csv | New untuned shape list for Hunyuan3 fp8-per-tensor MoE. |
| aiter/configs/model_configs/hunyuan3_fp8_per_tensor_tuned_fmoe.csv | New tuned results selecting asmjit-AOT kernels for Hunyuan3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ExtraType = ctypes.c_void_p * 5 | ||
| kernel_args_size = ctypes.c_uint64(ctypes.sizeof(kernel_args)) | ||
| kernel_config = ExtraType( | ||
| 1, ctypes.addressof(kernel_args), 2, ctypes.addressof(kernel_args_size), 3 |
There was a problem hiding this comment.
no, HIP_LAUNCH_PARAM_END is always 3, please check here https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___global_defs.html#ga86cd80c0b352a6679a7fac89e026f0f7
| kernel_config = ExtraType( | ||
| 1, ctypes.addressof(kernel_args), 2, ctypes.addressof(kernel_args_size), 3 | ||
| ) | ||
| stream = ctypes.cast(torch.cuda.current_stream(), ctypes.c_void_p) |
| while len(gridDims) < 3: | ||
| gridDims.append(1) | ||
| while len(blockDims) < 3: | ||
| blockDims.append(1) | ||
| hip_check_error( | ||
| hip.hipModuleLaunchKernel( | ||
| p_func, | ||
| *gridDims, | ||
| *blockDims, |
| dynamic_syms_raw = subprocess.check_output( | ||
| ["/opt/rocm/llvm/bin/llvm-objdump", "--dynamic-syms", co_path] | ||
| ).decode("utf-8") |
| assert kernel_cnt.value > 0 | ||
| kernels = (ctypes.c_void_p * kernel_cnt.value)() |
| return cls(*[eval(p) for p in parts]) | ||
|
|
||
|
|
| from aiter.fused_moe_asmjit_aot import fused_moe_asmjit_aot | ||
| from aiter.fused_moe_asmjit_aot import get_tune_space |
| "--e2e_tune", | ||
| action="store_true", | ||
| required=False, | ||
| help="Run an extra round of e2e tuning after main tuning is done, using production-op benchmark as the indicator", |
| if kernelName1.startswith("fused_moe_asmjit_aot"): | ||
| from aiter.fused_moe_asmjit_aot import fused_moe_asmjit_aot | ||
|
|
||
| return MOEMetadata( | ||
| None, | ||
| None, | ||
| block_m, | ||
| ksplit, | ||
| run_1stage, | ||
| stage0=functools.partial( | ||
| fused_moe_asmjit_aot, config_string=kernelName1.split("__")[1] | ||
| ), |
| E, N1, K1 = w1.shape | ||
| N2, K2 = w2.shape[1], w2.shape[2] | ||
| TOPK = topk_ids.shape[1] | ||
| fp8_ptpc = w1.dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz) and ( |
Motivation
These kernels are specially designed/optimized for fused MOE (TP8, FP8-per-tensor, FP8-ptpc) problems:
Technical Details
current tune methods didn't take extra overheads (moe-sorting/quant) into consideration, but fmoe_asmjit_aot introduced some optimizations regarding to these overheads, thus we introduced a new
--e2e_tuneflag into gemm_moe_tune.py, this mode directly compares fmoe_asmjit_aot's performance against current best fused_moe()'s performance, if former is better, it will be recorded into destination tuned_fmoe file under model_configs.python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py -i aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_untuned_fmoe.csv -o aiter/configs/model_configs/qwen3_5_397b_fp8_ptpc_tuned_fmoe.csv --timeout 300 -v --all --e2e_tuneTest Plan
Test Result
performance improve in e2e tune: Hunyuan fp8-per-tensor (TP8)

performance improve in e2e tune: qwen3.5 fp8-ptpc (TP8)

Submission Checklist
Co-authors: Cheng.Luo@amd.com Luwei.Zhou@amd.com