Fused CuteDSL kernel for block selection scoring (#301) by jduprat · Pull Request #301 · meta-pytorch/MSLK

jduprat · 2026-04-02T05:48:00Z

Summary:

Replace ~10 kernel launches per query tile with a GEMM-based scoring pipeline
using the Q_mean algebraic identity: mean(Q @ K) = mean(Q) @ K, reducing
scoring from a (q_tile_size, D) x (D, N_cmp) GEMM per tile to a single
(D,) . (D, N_cmp) GEMV — 256x fewer FLOPs.

Key components:

_compute_q_mean(): Single PyTorch kernel computes per-tile mean of Q in
fp32. Supports both 4D fixed-length and 3D varlen (with cu_seqlens).
_score_and_topk(): GQA-aware bmm that folds GQA groups into the M
dimension of the GEMM, avoiding K_cmp expansion from H_kv to H heads:
(BH_kv, n_tilesgroups, D) @ (B*H_kv, D, N_cmp). cuBLAS GEMM.
fused_score_and_select_blocks(): Unified entry for selected branch.
fused_score_and_select_all(): Computes GEMM once, derives indices for
both selected and compressed branches (avoids duplicate GEMM).

Chunked processing (64 Q-tiles per chunk) bounds peak memory.
All scoring in fp32 for numerical stability with bf16/fp16 inputs.

NSA becomes faster than dense FA4 at ~21K tokens with this optimization;
reaches 20.8x speedup at 1M.
{F1987648198}

Differential Revision: D99181843

meta-codesync · 2026-04-02T05:48:07Z

@jduprat has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99181843.

Summary: Replace ~10 kernel launches per query tile with a GEMM-based scoring pipeline using the Q_mean algebraic identity: mean(Q @ K) = mean(Q) @ K, reducing scoring from a (q_tile_size, D) x (D, N_cmp) GEMM per tile to a single (D,) . (D, N_cmp) GEMV — 256x fewer FLOPs. Key components: - _compute_q_mean(): Single PyTorch kernel computes per-tile mean of Q in fp32. Supports both 4D fixed-length and 3D varlen (with cu_seqlens). - _score_and_topk(): GQA-aware bmm that folds GQA groups into the M dimension of the GEMM, avoiding K_cmp expansion from H_kv to H heads: (B*H_kv, n_tiles*groups, D) @ (B*H_kv, D, N_cmp). cuBLAS GEMM. - fused_score_and_select_blocks(): Unified entry for selected branch. - fused_score_and_select_all(): Computes GEMM once, derives indices for both selected and compressed branches (avoids duplicate GEMM). Chunked processing (64 Q-tiles per chunk) bounds peak memory. All scoring in fp32 for numerical stability with bf16/fp16 inputs. NSA becomes faster than dense FA4 at ~24K tokens with this optimization; reaches 12.5x speedup at 512K (was barely faster before due to scoring bottleneck). Performance chart will be updated after Diff 4. Differential Revision: D99181843

Summary: Establish the NSA (Native Sparse Attention) module with reference implementations, compact block-sparse metadata format, and the FA4-based forward pass orchestrator. Three attention branches combined via learned gating: 1. Compressed: FA4 on mean-pooled KV (short sequence) 2. Selected: FA4 with block sparsity (top-k important blocks per Q-tile) 3. Sliding window: FA4 with window_size_left Key components: - compress.py: compress_kv() — mean-pool + optional learned projection - select.py: score_and_select_blocks() — tiled scoring with O(N) peak memory - gating.py: compute_gates() + gate_and_combine() — sigmoid gating, chunked - sparsity_masks.py: build_fa4_block_sparse_tensors() — compact index format (last dim = k selected blocks, not n_blocks_k total). Handles both expansion (compress_block_size >= n_block_size) and contraction (with sort + dedup). - nsa_forward.py: nsa_forward() orchestrator + _fa4_fwd() wrapper - reference.py: Pure PyTorch differentiable reference for correctness validation FA4 dependency: imports from mslk.attention.flash_attn.interface shim (tries internal fork, falls back to upstream flash_attn). Uses compress_factor for compressed causal masking (not mask_mod). All non-FA4 accumulation paths use fp32 for numerical stability with bf16/fp16. No performance impact — this is the foundation diff (reference implementations only, no CuteDSL fused kernels yet). Performance chart N/A for this diff. Differential Revision: D99181841

Summary: Replace the multi-kernel PyTorch gating (compute_gates + gate_and_combine) with a single fused CuteDSL kernel (fused_gate_and_combine) — 4-7x faster on B200. Key design: - One warp (32 threads) per (b,n,h) row — each warp handles one output position - Warp-shuffle butterfly reduction for 3 gate dot-products (no shared memory) - elems_per_thread = D // 32, staying in registers (4 for D=128) - Sigmoid via log2-exp2 trick: uses fast hardware exp2 - All accumulation in Float32 for numerical stability with bf16/fp16 inputs - In-memory compile cache keyed by (dtype, D, has_gate_weight) When gate_proj_weight is None, skips the CuteDSL kernel entirely and returns a simple (O_cmp + O_slc + O_sld) / 3 average — avoids kernel launch overhead for the ungated case. Returns (output, gates) tuple so gates are available for the backward pass. PyTorch reference implementations (compute_gates, gate_and_combine) retained for testing and fallback. No performance chart yet — gating alone is not the bottleneck. Chart will be updated after fused scoring (Diff 3) and fused compression (Diff 4). Differential Revision: D99181847

Summary: Replace ~10 kernel launches per query tile with a GEMM-based scoring pipeline using the Q_mean algebraic identity: mean(Q @ K) = mean(Q) @ K, reducing scoring from a (q_tile_size, D) x (D, N_cmp) GEMM per tile to a single (D,) . (D, N_cmp) GEMV — 256x fewer FLOPs. Key components: - _compute_q_mean(): Single PyTorch kernel computes per-tile mean of Q in fp32. Supports both 4D fixed-length and 3D varlen (with cu_seqlens). - _score_and_topk(): GQA-aware bmm that folds GQA groups into the M dimension of the GEMM, avoiding K_cmp expansion from H_kv to H heads: (B*H_kv, n_tiles*groups, D) @ (B*H_kv, D, N_cmp). cuBLAS GEMM. - fused_score_and_select_blocks(): Unified entry for selected branch. - fused_score_and_select_all(): Computes GEMM once, derives indices for both selected and compressed branches (avoids duplicate GEMM). Chunked processing (64 Q-tiles per chunk) bounds peak memory. All scoring in fp32 for numerical stability with bf16/fp16 inputs. NSA becomes faster than dense FA4 at ~24K tokens with this optimization; reaches 12.5x speedup at 512K (was barely faster before due to scoring bottleneck). Performance chart will be updated after Diff 4. Differential Revision: D99181843

Summary: Replace ~10 kernel launches per query tile with a GEMM-based scoring pipeline using the Q_mean algebraic identity: mean(Q @ K) = mean(Q) @ K, reducing scoring from a (q_tile_size, D) x (D, N_cmp) GEMM per tile to a single (D,) . (D, N_cmp) GEMV — 256x fewer FLOPs. Key components: - _compute_q_mean(): Single PyTorch kernel computes per-tile mean of Q in fp32. Supports both 4D fixed-length and 3D varlen (with cu_seqlens). - _score_and_topk(): GQA-aware bmm that folds GQA groups into the M dimension of the GEMM, avoiding K_cmp expansion from H_kv to H heads: (B*H_kv, n_tiles*groups, D) @ (B*H_kv, D, N_cmp). cuBLAS GEMM. - fused_score_and_select_blocks(): Unified entry for selected branch. - fused_score_and_select_all(): Computes GEMM once, derives indices for both selected and compressed branches (avoids duplicate GEMM). Chunked processing (64 Q-tiles per chunk) bounds peak memory. All scoring in fp32 for numerical stability with bf16/fp16 inputs. NSA becomes faster than dense FA4 at ~21K tokens with this optimization; reaches 20.8x speedup at 1M. {F1987648198} Differential Revision: D99181843

Summary: Pull Request resolved: meta-pytorch#301 Replace ~10 kernel launches per query tile with a GEMM-based scoring pipeline using the Q_mean algebraic identity: mean(Q @ K) = mean(Q) @ K, reducing scoring from a (q_tile_size, D) x (D, N_cmp) GEMM per tile to a single (D,) . (D, N_cmp) GEMV — 256x fewer FLOPs. Key components: - _compute_q_mean(): Single PyTorch kernel computes per-tile mean of Q in fp32. Supports both 4D fixed-length and 3D varlen (with cu_seqlens). - _score_and_topk(): GQA-aware bmm that folds GQA groups into the M dimension of the GEMM, avoiding K_cmp expansion from H_kv to H heads: (B*H_kv, n_tiles*groups, D) @ (B*H_kv, D, N_cmp). cuBLAS GEMM. - fused_score_and_select_blocks(): Unified entry for selected branch. - fused_score_and_select_all(): Computes GEMM once, derives indices for both selected and compressed branches (avoids duplicate GEMM). Chunked processing (64 Q-tiles per chunk) bounds peak memory. All scoring in fp32 for numerical stability with bf16/fp16 inputs. NSA becomes faster than dense FA4 at ~21K tokens with this optimization; reaches 20.8x speedup at 1M. {F1987648198} Differential Revision: D99181843

meta-cla Bot added the cla signed label Apr 2, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 2, 2026

jduprat added 2 commits April 2, 2026 07:06

meta-codesync Bot changed the title ~~Fused CuteDSL kernel for block selection scoring~~ Fused CuteDSL kernel for block selection scoring (#301) Apr 3, 2026

jduprat force-pushed the export-D99181843 branch from 106b4b7 to 23a4079 Compare April 3, 2026 00:48

jduprat force-pushed the export-D99181843 branch from 23a4079 to 795e72f Compare April 3, 2026 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused CuteDSL kernel for block selection scoring (#301)#301

Fused CuteDSL kernel for block selection scoring (#301)#301
jduprat wants to merge 3 commits into
meta-pytorch:mainfrom
jduprat:export-D99181843

jduprat commented Apr 2, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jduprat commented Apr 2, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jduprat commented Apr 2, 2026 •

edited by meta-codesync Bot

Loading