Open
Conversation
|
I tried the mainline TP with the 31b. It's F16 cache only and for some reason eats itself when doing perplexity run at the exact same point. Base completes but IT does not. If my main gpu is 0, the card drops down to PCIE 2x until reboot. Could have found a weak riser on my system but any other order also crashes at iteration 260. |
Owner
Author
|
@Ph0rk0z I wanted to try the |
|
For me it's |
|
And here are some speed tests for the 31b, same quant, same settings except for swa-full and name of the TP param. I didn't apply the PR yet. Mainline - tensor/swa-full
IK - split graph
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Took the idea from PR 21764 in llama.cpp. Or rather, as the idea is obvious, the motivation to do it. The idea is that if a compute graph is reused, one can skip the checks if the graph properties have changed when using CUDA graphs. Up to 6% performance gains for TG are being claimed in the mainline PR, so that gave me the motivation to do it for
ik_llama.cpp.The outcome is rather disappointing - sub-1% gains (measured on a 2x3090 system).
Having noticed that the new tensor parallel option in
llama.cpp(-sm tensor) has been merged (PR 19378), I was curious to see how it does now, 2 months after the [PR was first submitted. I made some observations about this effort back in February, so let's see what we get today.A complete evaluation would be interesting, but here just a quick check with the Gemma4 models on a 2x3090 system. The results are below, and speak for themselves.
Gemma4-26B-A4B-IQ4_XS
llama.cpp
ik_llama.cpp