Skip to content

ggml-metal: add Metal kernel for ggml_roll#21782

Open
stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
stephencox-ict:ggml-roll-metal
Open

ggml-metal: add Metal kernel for ggml_roll#21782
stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
stephencox-ict:ggml-roll-metal

Conversation

@stephencox-ict
Copy link
Copy Markdown

Overview

Add a native Metal kernel for GGML_OP_ROLL (circular shift). This op currently has no Metal implementation, so it falls back to CPU on Apple Silicon, creating graph splits at every call site.

The Gemma 4 audio conformer uses two ggml_roll calls per layer across 12 layers, resulting in 73 graph splits on Metal. With this kernel, all conformer ops stay on the GPU.

The shader follows the same wrapping logic as the CPU implementation: for each element, compute the source index as (dst_idx - shift) mod dim_size per dimension.

Files changed:

  • ggml-metal-impl.h - kargs struct
  • ggml-metal-device.m - register GGML_OP_ROLL as supported
  • ggml-metal-device.cpp - pipeline name mapping
  • ggml-metal-ops.h - declare dispatch function
  • ggml-metal-ops.cpp - dispatch function (sets args, binds buffers, dispatches threadgroups)
  • ggml-metal.metal - kernel_roll shader

Ref: #21421 (Gemma 4 audio conformer PR where this was requested)

Additional information

Tested locally on CPU (the logic matches). Needs validation on macOS Metal via CI or manual testing.

The dispatch uses the same pattern as kernel_concat: threadgroups over (ne01, ne02, ne03) with threads iterating over ne00.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude Code was used to help research the Metal kernel patterns in the codebase and draft the implementation. All code was reviewed and verified against the CPU reference implementation.

Add native Metal GPU support for the ROLL operation, which performs
circular shifts along tensor dimensions. Previously this op had no
Metal kernel, causing CPU fallbacks and graph splits on Apple Silicon.

The kernel uses the same wrap-around index logic as the CPU
implementation: for each element, compute the source index as
(dst_idx - shift) mod dim_size for each dimension.

Files changed:
- ggml-metal-impl.h: add ggml_metal_kargs_roll struct
- ggml-metal-device.m: register GGML_OP_ROLL as supported
- ggml-metal-device.cpp: add pipeline name mapping
- ggml-metal-ops.h: declare ggml_metal_op_roll
- ggml-metal-ops.cpp: dispatch function
- ggml-metal.metal: kernel_roll shader

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stephencox-ict stephencox-ict requested a review from a team as a code owner April 11, 2026 23:49
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 11, 2026
@stephencox-ict
Copy link
Copy Markdown
Author

@ngxson can you give this a test? I do not have the hardware

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 11, 2026

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants