Skip to content

docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76

Open
user-23xyz wants to merge 2 commits intoTheTom:mainfrom
user-23xyz:feat/forgeattention-fused-metal-kernels
Open

docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels#76
user-23xyz wants to merge 2 commits intoTheTom:mainfrom
user-23xyz:feat/forgeattention-fused-metal-kernels

Conversation

@user-23xyz
Copy link
Copy Markdown

Summary

ForgeAttention eliminates the FP16 materialization bottleneck on Apple Silicon by fusing PlanarQuant 3-bit dequantization directly inside the Metal attention kernel. Decompressed FP16 values never touch device memory.

  • 82% per-layer KV cache memory reduction
  • 0.99x baseline decode speed (zero overhead)
  • 300K context on 16GB M4 Mini — flat memory pressure, NIAH verified
  • Per-head adaptive sparse attention with tile-level early exit

Interaction with TriAttention V3

Orthogonal to V3: V3 evicts tokens (fewer tokens), ForgeAttention compresses storage (fewer bits). They multiply.

We independently confirm the V3 hybrid failure mode on Qwen3.5 (Section 5). PR #75's budget scaling formula is directly applicable.

What's in this PR

A paper-style document in docs/papers/ covering methodology, 6 Metal kernels, NIAH results (following your Section 3.3 protocol), bugs found, and interaction analysis.

Code (MIT): github.com/user-23xyz/forgeattention

user-23xyz and others added 2 commits April 10, 2026 18:53
…kernels

First implementation of fused quantized KV attention on Apple Silicon Metal.
Reads packed 3-bit K/V directly inside the attention dot product — the
decompressed FP16 tensors never touch device memory.

Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH
on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early
exit. Interacts with TriAttention V3 (stacks: eviction × compression).

Includes: methodology, 6 Metal kernels, NIAH results, bug reports
(MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75.

Code: github.com/user-23xyz/forgeattention (MIT)
Adds fused 3-bit KV dequantization inside attention dot product for MLX/Metal.
Reads packed K/V directly in the kernel — no FP16 intermediate in device memory.

New files:
- turboquant/mlx_fused_attention.py: 6 Metal kernels (QK, tiled SV, flash decode,
  sparse SV, Phase 1 score, Phase 2 sparse attend with tile early exit)
- turboquant/mlx_calibration.py: per-head budget calibration + redundancy selection
- tests/test_fused_attention.py: 5 tests (rotation roundtrip, compress/decompress,
  fused QK correctness, multi-head, all bit widths)
- docs/papers/forgeattention-fused-metal-kernels.md: methodology + results

Results on M4 Mini 16GB:
- 82% per-layer memory reduction
- 0.99x baseline decode speed
- NIAH to 300K (E2B) / 100K (E4B)
- Per-head sparse attention with tile early exit (98% tiles skip at top-1024)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant