Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #21686 .
The problem is that the meta backend always assumes a mirrored layout for the graph inputs created by the backend scheduler. However, with options like
-nkvothis is not correct because in order to correctly align the KV cache with the split weights surrounding it the KV cache also needs to be split. The meta backend determines split states by going up the chain ofggml_tensor::srcuntil it finds statically allocated weights that have a fixed split state. So the issue can be fixed by finding the original tensor that is being copied to the meta backend and propagating the split state from there (since this logic only depends on tensor ops and shapes). In the meta backend this requiresThis PR implements 1 by comparing the tensor name and 2 by setting
ggml_tensor::src[0]of the graph input to the original tensor; since the graph input hasGGML_OP_NONEthis should be safe. I am not happy with this implementation though.For the split states of weights I am currently comparing tensor names which is I think a hacky and bad way to do it. I've been thinking that it would be better to, when creating the statically allocated tensors for the model, create a map of
ggml_tensor *->enum llm_tensorand to use that to determine which tensors should receive which split state. But this would not work for the dynamically allocated tensors of the backend scheduler. One solution would be to set a flag for those tensors. But long-term I've been thinking whether it would maybe make sense to add something like aGGML_OP_BACKEND_COPYand to do the data copies as part of the ggml graph. That would also give the meta backend a natural way to handle this edge case.Requirements