Skip to content

Add reuse property to ggml_cgraph#1617

Open
ikawrakow wants to merge 1 commit intomainfrom
ik/graph_reuse_field
Open

Add reuse property to ggml_cgraph#1617
ikawrakow wants to merge 1 commit intomainfrom
ik/graph_reuse_field

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

Took the idea from PR 21764 in llama.cpp. Or rather, as the idea is obvious, the motivation to do it. The idea is that if a compute graph is reused, one can skip the checks if the graph properties have changed when using CUDA graphs. Up to 6% performance gains for TG are being claimed in the mainline PR, so that gave me the motivation to do it for ik_llama.cpp.

The outcome is rather disappointing - sub-1% gains (measured on a 2x3090 system).

Having noticed that the new tensor parallel option in llama.cpp (-sm tensor) has been merged (PR 19378), I was curious to see how it does now, 2 months after the [PR was first submitted. I made some observations about this effort back in February, so let's see what we get today.

A complete evaluation would be interesting, but here just a quick check with the Gemma4 models on a 2x3090 system. The results are below, and speak for themselves.

Gemma4-26B-A4B-IQ4_XS

llama.cpp

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.434 4716.75 1.080 118.49
2048 128 2048 0.458 4472.55 1.109 115.42
2048 128 4096 0.463 4425.22 1.126 113.63
2048 128 6144 0.472 4336.73 1.132 113.05
2048 128 8192 0.485 4226.80 1.139 112.41
2048 128 10240 0.491 4172.78 1.140 112.27
2048 128 12288 0.501 4087.37 1.137 112.62
2048 128 14336 0.513 3992.38 1.141 112.18
2048 128 16384 0.521 3933.51 1.145 111.79
2048 128 18432 0.530 3863.57 1.150 111.29
2048 128 20480 0.543 3770.79 1.153 111.02
2048 128 22528 0.550 3723.11 1.158 110.53
2048 128 24576 0.563 3639.96 1.162 110.19
2048 128 26624 0.573 3572.90 1.165 109.91
2048 128 28672 0.582 3519.25 1.169 109.50
2048 128 30720 0.591 3464.45 1.172 109.20

ik_llama.cpp

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.327 6265.35 0.841 152.26
2048 128 2048 0.290 7057.27 0.866 147.84
2048 128 4096 0.301 6812.09 0.870 147.07
2048 128 6144 0.312 6555.22 0.877 146.00
2048 128 8192 0.324 6319.15 0.885 144.66
2048 128 10240 0.336 6092.81 0.890 143.81
2048 128 12288 0.348 5885.90 0.892 143.53
2048 128 14336 0.361 5677.94 0.901 142.06
2048 128 16384 0.374 5481.73 0.909 140.75
2048 128 18432 0.386 5308.63 0.927 138.12
2048 128 20480 0.397 5159.07 0.929 137.83
2048 128 22528 0.409 5009.88 0.930 137.60
2048 128 24576 0.418 4894.51 0.930 137.61
2048 128 26624 0.432 4744.07 0.938 136.53
2048 128 28672 0.444 4607.92 0.939 136.30
2048 128 30720 0.456 4487.35 0.949 134.95

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 11, 2026

I tried the mainline TP with the 31b. It's F16 cache only and for some reason eats itself when doing perplexity run at the exact same point. Base completes but IT does not. If my main gpu is 0, the card drops down to PCIE 2x until reboot. Could have found a weak riser on my system but any other order also crashes at iteration 260.

@ikawrakow
Copy link
Copy Markdown
Owner Author

ikawrakow commented Apr 11, 2026

@Ph0rk0z I wanted to try the llama.cpp split mode tensor on a few more models, but in my case it crashed with Qwen-3.5-35B-A3B, so I gave up and posted only the above results. It crashes with this assert:

ggml/src/ggml-backend-meta.cpp:1276: GGML_ASSERT(split_state.n_segments == 1) failed

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 11, 2026

For me it's

CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/supermicro/ai/llama.cpp.main/ggml/src/ggml-cuda/ggml-cuda.cu:1504
  cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, CUDA_R_16F, ne00, src1_ptr, CUDA_R_16F, ne10, &beta_f16, dst_f16.get(), CUDA_R_16F, ldc, CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 11, 2026

And here are some speed tests for the 31b, same quant, same settings except for swa-full and name of the TP param. I didn't apply the PR yet.

Mainline - tensor/swa-full

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.521 1964.13 5.021 50.99
1024 256 1024 0.541 1893.61 5.056 50.63
1024 256 2048 0.555 1843.72 5.136 49.85
1024 256 3072 0.571 1794.35 5.213 49.11
1024 256 4096 0.586 1747.41 5.296 48.34
1024 256 5120 0.603 1697.94 5.357 47.79
1024 256 6144 0.618 1658.29 5.430 47.14
1024 256 7168 0.633 1617.65 5.480 46.71
1024 256 8192 0.647 1581.89 5.550 46.12
1024 256 9216 0.664 1541.27 5.625 45.52
1024 256 10240 0.678 1510.22 5.693 44.97
1024 256 11264 0.693 1476.60 5.774 44.34
1024 256 12288 0.712 1439.11 5.839 43.84
1024 256 13312 0.725 1412.07 5.941 43.09
1024 256 14336 0.740 1384.24 5.970 42.88
1024 256 15360 0.754 1357.54 6.027 42.48
1024 256 16384 0.770 1329.16 6.110 41.90
1024 256 17408 0.788 1299.77 6.157 41.58
1024 256 18432 0.801 1278.47 6.221 41.15
1024 256 19456 0.818 1251.71 6.290 40.70
1024 256 20480 0.832 1230.17 6.351 40.31
1024 256 21504 0.849 1206.48 6.411 39.93
1024 256 22528 0.863 1187.11 6.458 39.64
1024 256 23552 0.879 1165.30 6.533 39.19
1024 256 24576 0.895 1144.35 6.596 38.81
1024 256 25600 0.908 1128.15 6.649 38.50
1024 256 26624 0.925 1107.27 6.761 37.87
1024 256 27648 0.939 1090.76 6.828 37.49
1024 256 28672 0.955 1072.21 6.878 37.22
1024 256 29696 0.972 1053.89 6.975 36.70
1024 256 30720 0.987 1038.01 7.051 36.31
1024 256 31744 1.000 1023.57 7.101 36.05

IK - split graph

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.551 1858.34 4.345 58.92
1024 256 1024 0.468 2185.76 4.448 57.55
1024 256 2048 0.474 2160.78 4.438 57.69
1024 256 3072 0.483 2120.64 4.451 57.52
1024 256 4096 0.491 2086.79 4.488 57.04
1024 256 5120 0.497 2058.80 4.500 56.89
1024 256 6144 0.506 2025.25 4.516 56.69
1024 256 7168 0.513 1995.76 4.528 56.54
1024 256 8192 0.521 1966.55 4.541 56.37
1024 256 9216 0.529 1937.09 4.563 56.10
1024 256 10240 0.537 1906.63 4.572 56.00
1024 256 11264 0.544 1880.65 4.593 55.73
1024 256 12288 0.553 1852.93 4.604 55.60
1024 256 13312 0.560 1827.05 4.625 55.35
1024 256 14336 0.568 1804.16 4.632 55.27
1024 256 15360 0.576 1778.48 4.647 55.09
1024 256 16384 0.584 1753.80 4.714 54.30
1024 256 17408 0.592 1731.16 4.730 54.13
1024 256 18432 0.599 1710.52 4.748 53.92
1024 256 19456 0.607 1687.50 4.755 53.83
1024 256 20480 0.618 1656.97 4.771 53.65
1024 256 21504 0.624 1640.95 4.782 53.54
1024 256 22528 0.631 1623.23 4.795 53.39
1024 256 23552 0.639 1602.26 4.803 53.30
1024 256 24576 0.647 1582.18 4.807 53.25
1024 256 25600 0.656 1561.67 4.830 53.01
1024 256 26624 0.665 1539.90 4.831 53.00
1024 256 27648 0.672 1524.19 4.852 52.76
1024 256 28672 0.679 1508.12 4.857 52.71
1024 256 29696 0.687 1489.86 4.860 52.68
1024 256 30720 0.696 1472.29 4.872 52.55
1024 256 31744 0.704 1453.92 4.880 52.46
llama_print_timings:        load time =   18401.67 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   18663.35 ms / 32768 tokens (    0.57 ms per token,  1755.74 tokens per second)
llama_print_timings:        eval time =  149364.76 ms /  8192 runs   (   18.23 ms per token,    54.85 tokens per second)
llama_print_timings:       total time =  168050.38 ms / 40960 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants