docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76
Open
user-23xyz wants to merge 2 commits intoTheTom:mainfrom
Open
docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76user-23xyz wants to merge 2 commits intoTheTom:mainfrom
user-23xyz wants to merge 2 commits intoTheTom:mainfrom
Conversation
…kernels First implementation of fused quantized KV attention on Apple Silicon Metal. Reads packed 3-bit K/V directly inside the attention dot product — the decompressed FP16 tensors never touch device memory. Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early exit. Interacts with TriAttention V3 (stacks: eviction × compression). Includes: methodology, 6 Metal kernels, NIAH results, bug reports (MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75. Code: github.com/user-23xyz/forgeattention (MIT)
Adds fused 3-bit KV dequantization inside attention dot product for MLX/Metal. Reads packed K/V directly in the kernel — no FP16 intermediate in device memory. New files: - turboquant/mlx_fused_attention.py: 6 Metal kernels (QK, tiled SV, flash decode, sparse SV, Phase 1 score, Phase 2 sparse attend with tile early exit) - turboquant/mlx_calibration.py: per-head budget calibration + redundancy selection - tests/test_fused_attention.py: 5 tests (rotation roundtrip, compress/decompress, fused QK correctness, multi-head, all bit widths) - docs/papers/forgeattention-fused-metal-kernels.md: methodology + results Results on M4 Mini 16GB: - 82% per-layer memory reduction - 0.99x baseline decode speed - NIAH to 300K (E2B) / 100K (E4B) - Per-head sparse attention with tile early exit (98% tiles skip at top-1024)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ForgeAttention eliminates the FP16 materialization bottleneck on Apple Silicon by fusing PlanarQuant 3-bit dequantization directly inside the Metal attention kernel. Decompressed FP16 values never touch device memory.
Interaction with TriAttention V3
Orthogonal to V3: V3 evicts tokens (fewer tokens), ForgeAttention compresses storage (fewer bits). They multiply.
We independently confirm the V3 hybrid failure mode on Qwen3.5 (Section 5). PR #75's budget scaling formula is directly applicable.
What's in this PR
A paper-style document in
docs/papers/covering methodology, 6 Metal kernels, NIAH results (following your Section 3.3 protocol), bugs found, and interaction analysis.Code (MIT): github.com/user-23xyz/forgeattention