Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9) by xaguilar-amd · Pull Request #3287 · ROCm/aiter

xaguilar-amd · 2026-05-20T16:34:35Z

Summary

Adds tuned fused-MoE kernel selections for the Kimi K2.5 / K2.6 FP4 TP2
shape group (cu_num=256, inter_dim=1024, expert=385, topk=9) to
aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv, tuned on MI355X.

This shape group was previously missing from the config (the file only covered
the TP4 / TP8 shapes).

What changed

Adds 32 new rows under (256, *, 7168, 1024, 385, 9):
- 16 FlyDSL rows (one per token count: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
  1024, 2048, 4096, 8192, 16384, 32768).
- 16 matching flydsl_fallback rows (pure CK two-stage GEMMs) for runtime
  paths where FlyDSL is unavailable.

Motivation

Inference deployments of Kimi K2.5 / K2.6 MXFP4 on TP2 / 256 CU (MI355X)
currently fall through to untuned defaults for inter_dim=1024, since the
table only carries TP4 (inter_dim=512) and TP8 (inter_dim=256) entries
for the 385/9 expert geometry. Adding the TP2 rows lets the existing
fused-MoE dispatch pick the tuned FlyDSL kernels (or the CK fallback) for
this geometry instead.

Technical details

Touches only aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv.
Strictly additive vs. current main: 1 file changed, 32 insertions(+), 0 deletions(-).
No existing TP4 (inter_dim=512) or TP8 (inter_dim=256) row — neither
the FlyDSL nor the flydsl_fallback variant — is modified, renamed, or
reordered.

Signed-off-by: Xavier Aguilar <Xavier.AguilarFruto@amd.com>

github-actions · 2026-05-20T16:37:00Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3287 --add-label <label>

Adds tuned fused-MoE kernel selections for the Kimi K2.5 / K2.6 FP4 TP2

b7349c4

Signed-off-by: Xavier Aguilar <Xavier.AguilarFruto@amd.com>

xaguilar-amd marked this pull request as ready for review May 20, 2026 16:35

xaguilar-amd requested a review from a team May 20, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9)#3287

Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9)#3287
xaguilar-amd wants to merge 1 commit into
ROCm:mainfrom
xaguilar-amd:add_kimi_k2_fp4_tp2_tunings

xaguilar-amd commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xaguilar-amd commented May 20, 2026

Summary

What changed

Motivation

Technical details

Uh oh!

github-actions Bot commented May 20, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant