Skip to content

Zhimding/fix aot moe pad 0522#3313

Open
coderfeli wants to merge 5 commits into
mainfrom
zhimding/fix_aot_moe_pad_0522
Open

Zhimding/fix aot moe pad 0522#3313
coderfeli wants to merge 5 commits into
mainfrom
zhimding/fix_aot_moe_pad_0522

Conversation

@coderfeli
Copy link
Copy Markdown
Collaborator

Zhimding/fix aot moe pad 0522. Enable aot for dim pad

zhiding512 and others added 2 commits May 22, 2026 11:08
Signed-off-by: root <zhimding@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@coderfeli coderfeli requested a review from a team May 22, 2026 15:28
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3313 --add-label <label>

@coderfeli coderfeli marked this pull request as draft May 22, 2026 15:29
coderfeli and others added 3 commits May 23, 2026 11:08
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@coderfeli coderfeli marked this pull request as ready for review May 23, 2026 12:33
Copy link
Copy Markdown
Collaborator Author

@coderfeli coderfeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plumbing looks fine, but the pad column handling and the test-reference patching feel hacky. A few inline comments.

Comment thread aiter/aot/flydsl/moe.py


def _row_optional_int(row: dict[str, str], name: str, *aliases: str) -> int:
for key in (name, *aliases):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carrying the typo hiddne_pad as a permanent alias bakes the bug into the schema forever. Prefer a one-shot rename of the CSV column + code, and drop the alias. The runtime cache key is (model_dim_pad, inter_dim_pad) ints, so renaming the CSV column does not cause cache misses.

Comment thread aiter/aot/flydsl/moe.py
shape_str = (
f"{kernel_name} "
f"model_dim={model_dim} inter_dim={inter_dim} "
f"hidden_pad={kwargs.get('hidden_pad', 0)} "
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hidden_pad / intermediate_pad are already declared as explicit kwargs on _precompile_to_cache; reading them back through kwargs.get(...) here is inconsistent. Either accept them as named params on compile_one_config or pass them through a single dict — picking one keeps the call graph honest.

Comment thread aiter/fused_moe.py
else:
df["hidden_pad"] = 0
df["hidden_pad"] = df["hidden_pad"].fillna(0).astype(int)
if "intermediate_pad" not in df.columns:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three places now have to know about the pad columns (_normalize_pad_cols, _FILL_DEFAULTS, dedup keys). Consider a single PAD_COLUMNS = ("hidden_pad", "intermediate_pad") constant and iterate, so future pad fields don't have to be added in N spots.

Comment thread aiter/jit/core.py
)

_FILL_DEFAULTS = {"xbf16": 0, "run_1stage": 0, "ksplit": 0}
_FILL_DEFAULTS = {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on the AOT side: please don't enshrine hiddne_pad in _FILL_DEFAULTS. Fix the CSV(s) and drop the misspelled key.

w1[:, -intermediate_pad:, :] = 0
w1[:, inter_dim - intermediate_pad : inter_dim, :] = 0
exp_bias1 = torch.clamp(torch.randn((E, inter_dim * 2), dtype=dtype), -1.0, 1.0)
# Dense torch reference still evaluates padded lanes; keep padded
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zeroing padded bias lanes in the reference path is masking a semantic mismatch — the dense torch reference still evaluates padded lanes, while the kernel does not. A cleaner approach: build the reference at logical dims (model_dim - hidden_pad, inter_dim - intermediate_pad), then pad to physical dims only for the kernel input. That way you don't need to patch bias/output to make the compare pass.

):
if get_gfx() not in ["gfx950"] and qType in [aiter.QuantType.per_1x32]:
return
assert (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert 0 <= hidden_pad < model_dim allows nonsense values like model_dim - 1. The real constraint is probably (model_dim - hidden_pad) % tile_n == 0 (or similar tile alignment). Consider a stronger check, or at minimum a comment stating which invariant this guards.

valid_model_dim = model_dim - hidden_pad
out2_ref_check = out2_ref[:, :valid_model_dim]
out2_ck_check = out2_ck[:, :valid_model_dim]
err = checkAllclose(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing both ref and ck to valid_model_dim here is fine, but combined with the bias zeroing above it's a double patch. If the reference is rebuilt at logical dims (see comment on bias), this slice goes away.

256,16384,3072,512,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,64,0,206.4827,flydsl_moe1_afp4_wfp4_bf16_t64x128x256_w4_bnt0_xcd4,0.0%,285.1907,flydsl_moe2_afp4_wfp4_bf16_t64x256x256_reduce_xcd4,0.6%,491.6734,0,0,1257.9,1535.52,
256,32768,3072,512,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,64,0,367.6363,flydsl_moe1_afp4_wfp4_bf16_t64x128x256_w2_bnt0_xcd4,0.0%,517.2156,flydsl_moe2_afp4_wfp4_bf16_t64x256x256_reduce_xcd4,0.6%,884.8519,0,0,1397.92,1023.87,
cu_num,token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1,block_m,ksplit,us1,kernelName1,err1,us2,kernelName2,err2,us,run_1stage,xbf16,hidden_pad,intermediate_pad,tflops,bw,_tag
256,256,3072,3072,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,32,0,208.4462,flydsl_moe1_afp4_wfp4_bf16_t32x128x256_w2,0.0%,109.1431,flydsl_moe2_afp4_wfp4_bf16_t32x256x256_atomic_bnt2_persist,5.0%,317.5893,0,0,192,192,182.57,11418.01,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 24 rows use the same 3 constants for the pad columns (192 / {192,96,152}). If pad is a function of model_dim / inter_dim + tile alignment, it should be derived in code rather than duplicated in the tuned CSV — otherwise every retune has to remember to fill these columns consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants