Enable per-token scaled FP4 grouped gemm on B200 by jwfromm · Pull Request #356 · meta-pytorch/MSLK

jwfromm · 2026-05-22T21:44:06Z

Summary: Adds templating to allow per-token FP4 grouped gemm kernel to run on GB200 as well as GB300. This is done in a purely static way so it has no impact on performance. The new GB200 functionality has similar perf to the standard global scale grouped gemm.

Differential Revision: D106103018

Summary: Adds templating to allow per-token FP4 grouped gemm kernel to run on GB200 as well as GB300. This is done in a purely static way so it has no impact on performance. The new GB200 functionality has similar perf to the standard global scale grouped gemm. Differential Revision: D106103018

meta-codesync · 2026-05-22T21:44:16Z

@jwfromm has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106103018.

meta-cla Bot added the cla signed label May 22, 2026

meta-codesync Bot added fb-exported meta-exported labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable per-token scaled FP4 grouped gemm on B200#356

Enable per-token scaled FP4 grouped gemm on B200#356
jwfromm wants to merge 1 commit into
meta-pytorch:mainfrom
jwfromm:export-D106103018

jwfromm commented May 22, 2026

Uh oh!

meta-codesync Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jwfromm commented May 22, 2026

Uh oh!

meta-codesync Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant