ggml-metal: add Metal kernel for ggml_roll by stephencox-ict · Pull Request #21782 · ggml-org/llama.cpp

stephencox-ict · 2026-04-11T23:49:11Z

Overview

Add a native Metal kernel for GGML_OP_ROLL (circular shift). This op currently has no Metal implementation, so it falls back to CPU on Apple Silicon, creating graph splits at every call site.

The Gemma 4 audio conformer uses two ggml_roll calls per layer across 12 layers, resulting in 73 graph splits on Metal. With this kernel, all conformer ops stay on the GPU.

The shader follows the same wrapping logic as the CPU implementation: for each element, compute the source index as (dst_idx - shift) mod dim_size per dimension.

Files changed:

ggml-metal-impl.h - kargs struct
ggml-metal-device.m - register GGML_OP_ROLL as supported
ggml-metal-device.cpp - pipeline name mapping
ggml-metal-ops.h - declare dispatch function
ggml-metal-ops.cpp - dispatch function (sets args, binds buffers, dispatches threadgroups)
ggml-metal.metal - kernel_roll shader

Ref: #21421 (Gemma 4 audio conformer PR where this was requested)

Additional information

Tested locally on CPU (the logic matches). Needs validation on macOS Metal via CI or manual testing.

The dispatch uses the same pattern as kernel_concat: threadgroups over (ne01, ne02, ne03) with threads iterating over ne00.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude Code was used to help research the Metal kernel patterns in the codebase and draft the implementation. All code was reviewed and verified against the CPU reference implementation.

Add native Metal GPU support for the ROLL operation, which performs circular shifts along tensor dimensions. Previously this op had no Metal kernel, causing CPU fallbacks and graph splits on Apple Silicon. The kernel uses the same wrap-around index logic as the CPU implementation: for each element, compute the source index as (dst_idx - shift) mod dim_size for each dimension. Files changed: - ggml-metal-impl.h: add ggml_metal_kargs_roll struct - ggml-metal-device.m: register GGML_OP_ROLL as supported - ggml-metal-device.cpp: add pipeline name mapping - ggml-metal-ops.h: declare ggml_metal_op_roll - ggml-metal-ops.cpp: dispatch function - ggml-metal.metal: kernel_roll shader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stephencox-ict · 2026-04-11T23:51:33Z

@ngxson can you give this a test? I do not have the hardware

ggml-gh-bot · 2026-04-11T23:53:11Z

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

stephencox-ict requested a review from a team as a code owner April 11, 2026 23:49

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-metal: add Metal kernel for ggml_roll#21782

ggml-metal: add Metal kernel for ggml_roll#21782
stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
stephencox-ict:ggml-roll-metal

stephencox-ict commented Apr 11, 2026

Uh oh!

stephencox-ict commented Apr 11, 2026

Uh oh!

ggml-gh-bot bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stephencox-ict commented Apr 11, 2026

Overview

Additional information

Requirements

Uh oh!

stephencox-ict commented Apr 11, 2026

Uh oh!

ggml-gh-bot bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants