Skip to content

[Feat] Align quant and fused layernorm kernels with aiter/triton#549

Open
cschenjunlin wants to merge 6 commits into
mainfrom
cjl/fused_quant_layernorm
Open

[Feat] Align quant and fused layernorm kernels with aiter/triton#549
cschenjunlin wants to merge 6 commits into
mainfrom
cjl/fused_quant_layernorm

Conversation

@cschenjunlin
Copy link
Copy Markdown
Contributor

@cschenjunlin cschenjunlin commented May 20, 2026

Motivation

Align quant and fused layernorm kernels with aiter/triton

Technical Details

Test Plan

Test Result

Tested on MI308+ROCm7.1:

quant rmsnorm performance compare:

====================================================================================================
Perf Compare (gpu us): FlyDSL vs AIter
====================================================================================================
op         shape              dtype  FlyDSL(gpu us)  AIter(gpu us)    speedup
layernorm_dq 64x256             f32              29.2           37.3      1.28x
layernorm_dq 128x1024           f32              28.9           37.7      1.30x
layernorm_dq 32x128             f16              28.8           36.8      1.28x
layernorm_dq 64x2000            f32              29.7           39.3      1.32x
layernorm_dq 16x512             bf16             29.7           39.0      1.32x
layernorm_dq 1024x8192          bf16             65.8           55.0      0.84x
layernorm_dq 32768x8192         bf16          1,934.4        1,483.2      0.77x
layernorm_sq 64x256             f32              32.6           40.8      1.25x
layernorm_sq 128x1024           f32              33.2           42.2      1.27x
layernorm_sq 32x128             f16              33.2           40.3      1.21x
layernorm_sq 64x2000            f32              32.6           42.7      1.31x
layernorm_sq 16x512             bf16             32.1           42.3      1.32x
layernorm_sq 1024x8192          bf16             70.1           58.2      0.83x
layernorm_sq 32768x8192         bf16          2,113.5        1,559.7      0.74x
====================================================================================================

Submission Checklist

  • fused_add_layernorm_kernel
  • quant_layernorm_kernel
  • quant_fused_add_layernorm_kernel

Use the current fx.* numeric and register helper style in layernorm quant variants so they stay consistent with main's RMSNorm cleanup.
Comment thread kernels/layernorm_kernel.py Outdated
RED_SLOTS = max(1, (BLOCK_THREADS + WARP_SIZE - 1) // WARP_SIZE)
elem_bits = 32 if dtype_str == "f32" else 16

allocator = SmemAllocator(None, arch=arch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update to use SharedAllocator? Old interface SmemAllocator may be deprecated in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the SmemAllocator with SharedAllocator in all the varient kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants