Skip to content

[FlyDSL][MOE] Enable a8w8 blockscale moe splitk in flydsl #3280

Open
lalala-sh wants to merge 1 commit into
mainfrom
wjx/a8w8_moe_perf
Open

[FlyDSL][MOE] Enable a8w8 blockscale moe splitk in flydsl #3280
lalala-sh wants to merge 1 commit into
mainfrom
wjx/a8w8_moe_perf

Conversation

@lalala-sh
Copy link
Copy Markdown
Contributor

@lalala-sh lalala-sh commented May 20, 2026

Motivation

Enable the FlyDSL backend for a8w8 FP8 blockscale (per_1x128 / per_128x128) MoE in fused_moe, and provide the FlyDSL stage1/2 blockscale kernels + tuner integration + tuned configs for the four dsv3 v3 shapes ((model_dim=7168, inter_dim={256,512}) × (E,topk)={(256,8), (257,9)}).
For small-token decode (M ≤ 4) the FlyDSL 2-stage path now consistently beats the ASM 1-stage blockscale kernel on gfx950 (e.g. M=1: 23.6 us vs 26.6 us, ~13% faster); for medium / large M the tuner still picks the ASM 1-stage where it is faster, so the change is strictly opt-in via the tuned CSV.

Technical Details

Test Plan

Test Result

Submission Checklist

@lalala-sh lalala-sh requested a review from a team May 20, 2026 06:09
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3280 --add-label <label>

if out_dtype not in ("f16", "bf16"):
raise ValueError(f"out_dtype must be 'f16' or 'bf16', got {out_dtype!r}")
# NOTE: don't materialize MLIR types outside an active MLIR Context.
out_mlir = lambda: (lambda ty: ty() if callable(ty) else ty)(T.f16 if out_dtype == "f16" else T.bf16)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <E731> reported by reviewdog 🐶
Do not assign a lambda expression, use a def

Suggested change
out_mlir = lambda: (lambda ty: ty() if callable(ty) else ty)(T.f16 if out_dtype == "f16" else T.bf16)
def out_mlir():
return (lambda ty: ty() if callable(ty) else ty)(T.f16 if out_dtype == "f16" else T.bf16)

elem_type_tag = "bf16"
else:
raise ValueError(f"Unsupported dtype: {dtype_str}")
compute_type = lambda: T.f32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <E731> reported by reviewdog 🐶
Do not assign a lambda expression, use a def

Suggested change
compute_type = lambda: T.f32
def compute_type():
return T.f32

else:
raise ValueError(f"Unsupported dtype: {dtype_str}")
compute_type = lambda: T.f32
i8_type = lambda: T.i8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <E731> reported by reviewdog 🐶
Do not assign a lambda expression, use a def

Suggested change
i8_type = lambda: T.i8
def i8_type():
return T.i8

Comment on lines +2911 to +2912
a_dtype_str = "fp8"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F841> reported by reviewdog 🐶
Local variable a_dtype_str is assigned to but never used

Suggested change
a_dtype_str = "fp8"

@coderfeli
Copy link
Copy Markdown
Collaborator

@lalala-sh ck ? or flydsl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants