docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels by user-23xyz · Pull Request #76 · TheTom/turboquant_plus

user-23xyz · 2026-04-10T23:57:56Z

Summary

ForgeAttention eliminates the FP16 materialization bottleneck on Apple Silicon by fusing PlanarQuant 3-bit dequantization directly inside the Metal attention kernel. Decompressed FP16 values never touch device memory.

82% per-layer KV cache memory reduction
0.99x baseline decode speed (zero overhead)
300K context on 16GB M4 Mini — flat memory pressure, NIAH verified
Per-head adaptive sparse attention with tile-level early exit

Interaction with TriAttention V3

Orthogonal to V3: V3 evicts tokens (fewer tokens), ForgeAttention compresses storage (fewer bits). They multiply.

We independently confirm the V3 hybrid failure mode on Qwen3.5 (Section 5). PR #75's budget scaling formula is directly applicable.

What's in this PR

A paper-style document in docs/papers/ covering methodology, 6 Metal kernels, NIAH results (following your Section 3.3 protocol), bugs found, and interaction analysis.

Code (MIT): github.com/user-23xyz/forgeattention

…kernels First implementation of fused quantized KV attention on Apple Silicon Metal. Reads packed 3-bit K/V directly inside the attention dot product — the decompressed FP16 tensors never touch device memory. Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early exit. Interacts with TriAttention V3 (stacks: eviction × compression). Includes: methodology, 6 Metal kernels, NIAH results, bug reports (MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75. Code: github.com/user-23xyz/forgeattention (MIT)

Adds fused 3-bit KV dequantization inside attention dot product for MLX/Metal. Reads packed K/V directly in the kernel — no FP16 intermediate in device memory. New files: - turboquant/mlx_fused_attention.py: 6 Metal kernels (QK, tiled SV, flash decode, sparse SV, Phase 1 score, Phase 2 sparse attend with tile early exit) - turboquant/mlx_calibration.py: per-head budget calibration + redundancy selection - tests/test_fused_attention.py: 5 tests (rotation roundtrip, compress/decompress, fused QK correctness, multi-head, all bit widths) - docs/papers/forgeattention-fused-metal-kernels.md: methodology + results Results on M4 Mini 16GB: - 82% per-layer memory reduction - 0.99x baseline decode speed - NIAH to 300K (E2B) / 100K (E4B) - Per-head sparse attention with tile early exit (98% tiles skip at top-1024)

user-23xyz and others added 2 commits April 10, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76

docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76
user-23xyz wants to merge 2 commits intoTheTom:mainfrom
user-23xyz:feat/forgeattention-fused-metal-kernels

user-23xyz commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

user-23xyz commented Apr 10, 2026

Summary

Interaction with TriAttention V3

What's in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant