Add reuse property to ggml_cgraph by ikawrakow · Pull Request #1617 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-04-11T14:16:04Z

Took the idea from PR 21764 in llama.cpp. Or rather, as the idea is obvious, the motivation to do it. The idea is that if a compute graph is reused, one can skip the checks if the graph properties have changed when using CUDA graphs. Up to 6% performance gains for TG are being claimed in the mainline PR, so that gave me the motivation to do it for ik_llama.cpp.

The outcome is rather disappointing - sub-1% gains (measured on a 2x3090 system).

Having noticed that the new tensor parallel option in llama.cpp (-sm tensor) has been merged (PR 19378), I was curious to see how it does now, 2 months after the [PR was first submitted. I made some observations about this effort back in February, so let's see what we get today.

A complete evaluation would be interesting, but here just a quick check with the Gemma4 models on a 2x3090 system. The results are below, and speak for themselves.

Gemma4-26B-A4B-IQ4_XS

llama.cpp

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.434	4716.75	1.080	118.49
2048	128	2048	0.458	4472.55	1.109	115.42
2048	128	4096	0.463	4425.22	1.126	113.63
2048	128	6144	0.472	4336.73	1.132	113.05
2048	128	8192	0.485	4226.80	1.139	112.41
2048	128	10240	0.491	4172.78	1.140	112.27
2048	128	12288	0.501	4087.37	1.137	112.62
2048	128	14336	0.513	3992.38	1.141	112.18
2048	128	16384	0.521	3933.51	1.145	111.79
2048	128	18432	0.530	3863.57	1.150	111.29
2048	128	20480	0.543	3770.79	1.153	111.02
2048	128	22528	0.550	3723.11	1.158	110.53
2048	128	24576	0.563	3639.96	1.162	110.19
2048	128	26624	0.573	3572.90	1.165	109.91
2048	128	28672	0.582	3519.25	1.169	109.50
2048	128	30720	0.591	3464.45	1.172	109.20

ik_llama.cpp

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.327	6265.35	0.841	152.26
2048	128	2048	0.290	7057.27	0.866	147.84
2048	128	4096	0.301	6812.09	0.870	147.07
2048	128	6144	0.312	6555.22	0.877	146.00
2048	128	8192	0.324	6319.15	0.885	144.66
2048	128	10240	0.336	6092.81	0.890	143.81
2048	128	12288	0.348	5885.90	0.892	143.53
2048	128	14336	0.361	5677.94	0.901	142.06
2048	128	16384	0.374	5481.73	0.909	140.75
2048	128	18432	0.386	5308.63	0.927	138.12
2048	128	20480	0.397	5159.07	0.929	137.83
2048	128	22528	0.409	5009.88	0.930	137.60
2048	128	24576	0.418	4894.51	0.930	137.61
2048	128	26624	0.432	4744.07	0.938	136.53
2048	128	28672	0.444	4607.92	0.939	136.30
2048	128	30720	0.456	4487.35	0.949	134.95

Ph0rk0z · 2026-04-11T15:03:03Z

I tried the mainline TP with the 31b. It's F16 cache only and for some reason eats itself when doing perplexity run at the exact same point. Base completes but IT does not. If my main gpu is 0, the card drops down to PCIE 2x until reboot. Could have found a weak riser on my system but any other order also crashes at iteration 260.

ikawrakow · 2026-04-11T15:07:48Z

@Ph0rk0z I wanted to try the llama.cpp split mode tensor on a few more models, but in my case it crashed with Qwen-3.5-35B-A3B, so I gave up and posted only the above results. It crashes with this assert:

ggml/src/ggml-backend-meta.cpp:1276: GGML_ASSERT(split_state.n_segments == 1) failed

Ph0rk0z · 2026-04-11T19:17:48Z

For me it's

CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at /home/supermicro/ai/llama.cpp.main/ggml/src/ggml-cuda/ggml-cuda.cu:1504
  cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, CUDA_R_16F, ne00, src1_ptr, CUDA_R_16F, ne10, &beta_f16, dst_f16.get(), CUDA_R_16F, ldc, CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

Ph0rk0z · 2026-04-11T21:34:24Z

And here are some speed tests for the 31b, same quant, same settings except for swa-full and name of the TP param. I didn't apply the PR yet.

Mainline - tensor/swa-full

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.521	1964.13	5.021	50.99
1024	256	1024	0.541	1893.61	5.056	50.63
1024	256	2048	0.555	1843.72	5.136	49.85
1024	256	3072	0.571	1794.35	5.213	49.11
1024	256	4096	0.586	1747.41	5.296	48.34
1024	256	5120	0.603	1697.94	5.357	47.79
1024	256	6144	0.618	1658.29	5.430	47.14
1024	256	7168	0.633	1617.65	5.480	46.71
1024	256	8192	0.647	1581.89	5.550	46.12
1024	256	9216	0.664	1541.27	5.625	45.52
1024	256	10240	0.678	1510.22	5.693	44.97
1024	256	11264	0.693	1476.60	5.774	44.34
1024	256	12288	0.712	1439.11	5.839	43.84
1024	256	13312	0.725	1412.07	5.941	43.09
1024	256	14336	0.740	1384.24	5.970	42.88
1024	256	15360	0.754	1357.54	6.027	42.48
1024	256	16384	0.770	1329.16	6.110	41.90
1024	256	17408	0.788	1299.77	6.157	41.58
1024	256	18432	0.801	1278.47	6.221	41.15
1024	256	19456	0.818	1251.71	6.290	40.70
1024	256	20480	0.832	1230.17	6.351	40.31
1024	256	21504	0.849	1206.48	6.411	39.93
1024	256	22528	0.863	1187.11	6.458	39.64
1024	256	23552	0.879	1165.30	6.533	39.19
1024	256	24576	0.895	1144.35	6.596	38.81
1024	256	25600	0.908	1128.15	6.649	38.50
1024	256	26624	0.925	1107.27	6.761	37.87
1024	256	27648	0.939	1090.76	6.828	37.49
1024	256	28672	0.955	1072.21	6.878	37.22
1024	256	29696	0.972	1053.89	6.975	36.70
1024	256	30720	0.987	1038.01	7.051	36.31
1024	256	31744	1.000	1023.57	7.101	36.05

IK - split graph

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.551	1858.34	4.345	58.92
1024	256	1024	0.468	2185.76	4.448	57.55
1024	256	2048	0.474	2160.78	4.438	57.69
1024	256	3072	0.483	2120.64	4.451	57.52
1024	256	4096	0.491	2086.79	4.488	57.04
1024	256	5120	0.497	2058.80	4.500	56.89
1024	256	6144	0.506	2025.25	4.516	56.69
1024	256	7168	0.513	1995.76	4.528	56.54
1024	256	8192	0.521	1966.55	4.541	56.37
1024	256	9216	0.529	1937.09	4.563	56.10
1024	256	10240	0.537	1906.63	4.572	56.00
1024	256	11264	0.544	1880.65	4.593	55.73
1024	256	12288	0.553	1852.93	4.604	55.60
1024	256	13312	0.560	1827.05	4.625	55.35
1024	256	14336	0.568	1804.16	4.632	55.27
1024	256	15360	0.576	1778.48	4.647	55.09
1024	256	16384	0.584	1753.80	4.714	54.30
1024	256	17408	0.592	1731.16	4.730	54.13
1024	256	18432	0.599	1710.52	4.748	53.92
1024	256	19456	0.607	1687.50	4.755	53.83
1024	256	20480	0.618	1656.97	4.771	53.65
1024	256	21504	0.624	1640.95	4.782	53.54
1024	256	22528	0.631	1623.23	4.795	53.39
1024	256	23552	0.639	1602.26	4.803	53.30
1024	256	24576	0.647	1582.18	4.807	53.25
1024	256	25600	0.656	1561.67	4.830	53.01
1024	256	26624	0.665	1539.90	4.831	53.00
1024	256	27648	0.672	1524.19	4.852	52.76
1024	256	28672	0.679	1508.12	4.857	52.71
1024	256	29696	0.687	1489.86	4.860	52.68
1024	256	30720	0.696	1472.29	4.872	52.55
1024	256	31744	0.704	1453.92	4.880	52.46

llama_print_timings:        load time =   18401.67 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   18663.35 ms / 32768 tokens (    0.57 ms per token,  1755.74 tokens per second)
llama_print_timings:        eval time =  149364.76 ms /  8192 runs   (   18.23 ms per token,    54.85 tokens per second)
llama_print_timings:       total time =  168050.38 ms / 40960 tokens

Add reuse property to ggml_cgraph

6b6f46b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reuse property to ggml_cgraph#1617

Add reuse property to ggml_cgraph#1617
ikawrakow wants to merge 1 commit intomainfrom
ik/graph_reuse_field

ikawrakow commented Apr 11, 2026

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

ikawrakow commented Apr 11, 2026 •

edited

Loading

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikawrakow commented Apr 11, 2026

Gemma4-26B-A4B-IQ4_XS

llama.cpp

ik_llama.cpp

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

ikawrakow commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

Ph0rk0z commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Apr 11, 2026 •

edited

Loading