Skip to content

UPSTREAM PR #21452: metal : add GATED_LINEAR_ATTN op#1334

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21452-metal-gla-op
Open

UPSTREAM PR #21452: metal : add GATED_LINEAR_ATTN op#1334
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21452-metal-gla-op

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 5, 2026

Note

Source pull request: ggml-org/llama.cpp#21452

Add Metal backend support for GGML_OP_GATED_LINEAR_ATTN (GLA). Supports head_size 64 and 128, f32 only.

Tested with test-backend-ops (7/7 passed):

  • M5 Max 128GB (Apple10)
  • M2 Mac Mini (Apple8)

Overview

Adds Metal backend support for GGML_OP_GATED_LINEAR_ATTN, which currently falls back to CPU on Apple devices.

The kernel follows the RWKV WKV6 Metal structure: one thread per head element, threadgroup shared memory for k/q/gate, and a float4-vectorized inner loop.

supports_op restricts execution to F32 and head sizes 64 or 128.

No dedicated performance benchmarks are currently configured for this op.

Additional information

docs/ops.md and docs/ops/Metal.csv were regenerated using the required workflow (test-backend-ops support --output csv and ./scripts/create_ops_docs.py). The resulting diff includes updates to multiple ops due to recent changes in main; the only functional change in this PR is support for GATED_LINEAR_ATTN.

Mentions #14909

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI used for research and guidance, all code manually reviewed and understood.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 5, 2026

No meaningful performance changes were detected across 125473 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 55afbee to ef0eff4 Compare April 12, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19
Add Metal backend support for GGML_OP_GATED_LINEAR_ATTN (GLA).
Supports head_size 64 and 128, f32 only.

Tested with test-backend-ops (7/7 passed):
- M5 Max 128GB (Apple10)
- M2 Mac Mini (Apple8)
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 20, 2026

No meaningful performance changes were detected across 46867 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize.

💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants