Subgraph-level Offloading in llama.cpp #21392

taimur-10x · 2026-04-03T20:40:17Z

taimur-10x
Apr 3, 2026
Collaborator

Background

From what I understand, llama.cpp's current scheduling model operates at the tensor level at each step. Operator assignment happens in two phases:

Weight tensors are pinned during the load_tensors phase in llama_load_model_from_file to devices that can handle the corresponding buffer type, based on GPU and CPU buft_lists.
The scheduler later on expands these assignments to activation tensors and nearby operators, covering the remaining unassigned operators

Problem

There does not seem to be a way right now to suggest that an entire functional subgraph (for example, a full attention block) be offloaded to a particular device. The current assignment is based on weight placement and operator support capabilities (determined from the device backend itself). There seems to be no user-facing mechanism to manually split the graph at a subgraph level, and no way for a backend to declare that it only wants to implement a particular functional region of the graph.

Potential Use Case

If a user has an RPC server running that is backed by some specialized hardware, they might want to route a particular subgraph (for example, the attention block) to the RPC node, with the rest of the compute handled locally. Right now, the RPC backend is hard-coded as a GPU type backend and always returns true for supports_op .

The subgraph offloading mechanism does not have to be necessarily limited to the RPC backend, but could be used by any locally registered backend as well. However, the RPC backend currently seems like the natural way to prototype and demo this idea.

Discussion

Is this something that is feasible within the current architecture, or are there any constraints in the graph and scheduler system that would make this difficult to support?

I'm interested in contributing towards this effort. Looking forward to any guidance or feedback from the community on the approach and how it could be made part of mainline Llama.cpp.

sher-ali1 · 2026-04-09T08:33:37Z

sher-ali1
Apr 9, 2026

for RPC in support_op you can request RPC server to ask if the actual backend on the RPC server supports that particular op and cache that result.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subgraph-level Offloading in llama.cpp #21392

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Subgraph-level Offloading in llama.cpp #21392

Uh oh!

taimur-10x Apr 3, 2026 Collaborator

Background

Problem

Potential Use Case

Discussion

Replies: 1 comment

Uh oh!

sher-ali1 Apr 9, 2026

taimur-10x
Apr 3, 2026
Collaborator

sher-ali1
Apr 9, 2026