ggml-metal: add Metal kernel for ggml_roll#21782
Open
stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
Open
ggml-metal: add Metal kernel for ggml_roll#21782stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
stephencox-ict wants to merge 1 commit intoggml-org:masterfrom
Conversation
Add native Metal GPU support for the ROLL operation, which performs circular shifts along tensor dimensions. Previously this op had no Metal kernel, causing CPU fallbacks and graph splits on Apple Silicon. The kernel uses the same wrap-around index logic as the CPU implementation: for each element, compute the source index as (dst_idx - shift) mod dim_size for each dimension. Files changed: - ggml-metal-impl.h: add ggml_metal_kargs_roll struct - ggml-metal-device.m: register GGML_OP_ROLL as supported - ggml-metal-device.cpp: add pipeline name mapping - ggml-metal-ops.h: declare ggml_metal_op_roll - ggml-metal-ops.cpp: dispatch function - ggml-metal.metal: kernel_roll shader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
@ngxson can you give this a test? I do not have the hardware |
|
Hi @stephencox-ict, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Add a native Metal kernel for
GGML_OP_ROLL(circular shift). This op currently has no Metal implementation, so it falls back to CPU on Apple Silicon, creating graph splits at every call site.The Gemma 4 audio conformer uses two
ggml_rollcalls per layer across 12 layers, resulting in 73 graph splits on Metal. With this kernel, all conformer ops stay on the GPU.The shader follows the same wrapping logic as the CPU implementation: for each element, compute the source index as
(dst_idx - shift) mod dim_sizeper dimension.Files changed:
ggml-metal-impl.h- kargs structggml-metal-device.m- registerGGML_OP_ROLLas supportedggml-metal-device.cpp- pipeline name mappingggml-metal-ops.h- declare dispatch functionggml-metal-ops.cpp- dispatch function (sets args, binds buffers, dispatches threadgroups)ggml-metal.metal-kernel_rollshaderRef: #21421 (Gemma 4 audio conformer PR where this was requested)
Additional information
Tested locally on CPU (the logic matches). Needs validation on macOS Metal via CI or manual testing.
The dispatch uses the same pattern as
kernel_concat: threadgroups over (ne01, ne02, ne03) with threads iterating over ne00.Requirements