Skip to content

Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9)#3287

Open
xaguilar-amd wants to merge 1 commit into
ROCm:mainfrom
xaguilar-amd:add_kimi_k2_fp4_tp2_tunings
Open

Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9)#3287
xaguilar-amd wants to merge 1 commit into
ROCm:mainfrom
xaguilar-amd:add_kimi_k2_fp4_tp2_tunings

Conversation

@xaguilar-amd
Copy link
Copy Markdown
Contributor

Summary

Adds tuned fused-MoE kernel selections for the Kimi K2.5 / K2.6 FP4 TP2
shape group (cu_num=256, inter_dim=1024, expert=385, topk=9) to
aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv, tuned on MI355X.

This shape group was previously missing from the config (the file only covered
the TP4 / TP8 shapes).

What changed

  • Adds 32 new rows under (256, *, 7168, 1024, 385, 9):
    • 16 FlyDSL rows (one per token count: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
      1024, 2048, 4096, 8192, 16384, 32768).
    • 16 matching flydsl_fallback rows (pure CK two-stage GEMMs) for runtime
      paths where FlyDSL is unavailable.

Motivation

Inference deployments of Kimi K2.5 / K2.6 MXFP4 on TP2 / 256 CU (MI355X)
currently fall through to untuned defaults for inter_dim=1024, since the
table only carries TP4 (inter_dim=512) and TP8 (inter_dim=256) entries
for the 385/9 expert geometry. Adding the TP2 rows lets the existing
fused-MoE dispatch pick the tuned FlyDSL kernels (or the CK fallback) for
this geometry instead.

Technical details

  • Touches only aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv.
  • Strictly additive vs. current main: 1 file changed, 32 insertions(+), 0 deletions(-).
  • No existing TP4 (inter_dim=512) or TP8 (inter_dim=256) row — neither
    the FlyDSL nor the flydsl_fallback variant — is modified, renamed, or
    reordered.

Signed-off-by: Xavier Aguilar <Xavier.AguilarFruto@amd.com>
@xaguilar-amd xaguilar-amd marked this pull request as ready for review May 20, 2026 16:35
@xaguilar-amd xaguilar-amd requested a review from a team May 20, 2026 16:35
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3287 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant