Subgraph-level Offloading in llama.cpp #21392
taimur-10x
started this conversation in
Ideas
Replies: 1 comment
-
|
for RPC in support_op you can request RPC server to ask if the actual backend on the RPC server supports that particular op and cache that result. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background
From what I understand, llama.cpp's current scheduling model operates at the tensor level at each step. Operator assignment happens in two phases:
load_tensorsphase inllama_load_model_from_fileto devices that can handle the corresponding buffer type, based on GPU and CPUbuft_lists.Problem
There does not seem to be a way right now to suggest that an entire functional subgraph (for example, a full attention block) be offloaded to a particular device. The current assignment is based on weight placement and operator support capabilities (determined from the device backend itself). There seems to be no user-facing mechanism to manually split the graph at a subgraph level, and no way for a backend to declare that it only wants to implement a particular functional region of the graph.
Potential Use Case
If a user has an RPC server running that is backed by some specialized hardware, they might want to route a particular subgraph (for example, the attention block) to the RPC node, with the rest of the compute handled locally. Right now, the RPC backend is hard-coded as a GPU type backend and always returns true for
supports_op.The subgraph offloading mechanism does not have to be necessarily limited to the RPC backend, but could be used by any locally registered backend as well. However, the RPC backend currently seems like the natural way to prototype and demo this idea.
Discussion
Is this something that is feasible within the current architecture, or are there any constraints in the graph and scheduler system that would make this difficult to support?
I'm interested in contributing towards this effort. Looking forward to any guidance or feedback from the community on the approach and how it could be made part of mainline Llama.cpp.
Beta Was this translation helpful? Give feedback.
All reactions