Zhimding/fix aot moe pad 0522#3313
Conversation
Signed-off-by: root <zhimding@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
coderfeli
left a comment
There was a problem hiding this comment.
Plumbing looks fine, but the pad column handling and the test-reference patching feel hacky. A few inline comments.
|
|
||
|
|
||
| def _row_optional_int(row: dict[str, str], name: str, *aliases: str) -> int: | ||
| for key in (name, *aliases): |
There was a problem hiding this comment.
Carrying the typo hiddne_pad as a permanent alias bakes the bug into the schema forever. Prefer a one-shot rename of the CSV column + code, and drop the alias. The runtime cache key is (model_dim_pad, inter_dim_pad) ints, so renaming the CSV column does not cause cache misses.
| shape_str = ( | ||
| f"{kernel_name} " | ||
| f"model_dim={model_dim} inter_dim={inter_dim} " | ||
| f"hidden_pad={kwargs.get('hidden_pad', 0)} " |
There was a problem hiding this comment.
hidden_pad / intermediate_pad are already declared as explicit kwargs on _precompile_to_cache; reading them back through kwargs.get(...) here is inconsistent. Either accept them as named params on compile_one_config or pass them through a single dict — picking one keeps the call graph honest.
| else: | ||
| df["hidden_pad"] = 0 | ||
| df["hidden_pad"] = df["hidden_pad"].fillna(0).astype(int) | ||
| if "intermediate_pad" not in df.columns: |
There was a problem hiding this comment.
Three places now have to know about the pad columns (_normalize_pad_cols, _FILL_DEFAULTS, dedup keys). Consider a single PAD_COLUMNS = ("hidden_pad", "intermediate_pad") constant and iterate, so future pad fields don't have to be added in N spots.
| ) | ||
|
|
||
| _FILL_DEFAULTS = {"xbf16": 0, "run_1stage": 0, "ksplit": 0} | ||
| _FILL_DEFAULTS = { |
There was a problem hiding this comment.
Same comment as on the AOT side: please don't enshrine hiddne_pad in _FILL_DEFAULTS. Fix the CSV(s) and drop the misspelled key.
| w1[:, -intermediate_pad:, :] = 0 | ||
| w1[:, inter_dim - intermediate_pad : inter_dim, :] = 0 | ||
| exp_bias1 = torch.clamp(torch.randn((E, inter_dim * 2), dtype=dtype), -1.0, 1.0) | ||
| # Dense torch reference still evaluates padded lanes; keep padded |
There was a problem hiding this comment.
Zeroing padded bias lanes in the reference path is masking a semantic mismatch — the dense torch reference still evaluates padded lanes, while the kernel does not. A cleaner approach: build the reference at logical dims (model_dim - hidden_pad, inter_dim - intermediate_pad), then pad to physical dims only for the kernel input. That way you don't need to patch bias/output to make the compare pass.
| ): | ||
| if get_gfx() not in ["gfx950"] and qType in [aiter.QuantType.per_1x32]: | ||
| return | ||
| assert ( |
There was a problem hiding this comment.
assert 0 <= hidden_pad < model_dim allows nonsense values like model_dim - 1. The real constraint is probably (model_dim - hidden_pad) % tile_n == 0 (or similar tile alignment). Consider a stronger check, or at minimum a comment stating which invariant this guards.
| valid_model_dim = model_dim - hidden_pad | ||
| out2_ref_check = out2_ref[:, :valid_model_dim] | ||
| out2_ck_check = out2_ck[:, :valid_model_dim] | ||
| err = checkAllclose( |
There was a problem hiding this comment.
Slicing both ref and ck to valid_model_dim here is fine, but combined with the bias zeroing above it's a double patch. If the reference is rebuilt at logical dims (see comment on bias), this slice goes away.
| 256,16384,3072,512,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,64,0,206.4827,flydsl_moe1_afp4_wfp4_bf16_t64x128x256_w4_bnt0_xcd4,0.0%,285.1907,flydsl_moe2_afp4_wfp4_bf16_t64x256x256_reduce_xcd4,0.6%,491.6734,0,0,1257.9,1535.52, | ||
| 256,32768,3072,512,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,64,0,367.6363,flydsl_moe1_afp4_wfp4_bf16_t64x128x256_w2_bnt0_xcd4,0.0%,517.2156,flydsl_moe2_afp4_wfp4_bf16_t64x256x256_reduce_xcd4,0.6%,884.8519,0,0,1397.92,1023.87, | ||
| cu_num,token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1,block_m,ksplit,us1,kernelName1,err1,us2,kernelName2,err2,us,run_1stage,xbf16,hidden_pad,intermediate_pad,tflops,bw,_tag | ||
| 256,256,3072,3072,128,4,ActivationType.Swiglu,torch.bfloat16,torch.float4_e2m1fn_x2,torch.float4_e2m1fn_x2,QuantType.per_1x32,1,0,32,0,208.4462,flydsl_moe1_afp4_wfp4_bf16_t32x128x256_w2,0.0%,109.1431,flydsl_moe2_afp4_wfp4_bf16_t32x256x256_atomic_bnt2_persist,5.0%,317.5893,0,0,192,192,182.57,11418.01, |
There was a problem hiding this comment.
All 24 rows use the same 3 constants for the pad columns (192 / {192,96,152}). If pad is a function of model_dim / inter_dim + tile alignment, it should be derived in code rather than duplicated in the tuned CSV — otherwise every retune has to remember to fill these columns consistently.
Zhimding/fix aot moe pad 0522. Enable aot for dim pad