Skip to content

ggml: add graph_reused#21764

Open
am17an wants to merge 2 commits intoggml-org:masterfrom
am17an:ggml_graph_reuse
Open

ggml: add graph_reused#21764
am17an wants to merge 2 commits intoggml-org:masterfrom
am17an:ggml_graph_reuse

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Apr 11, 2026

Overview

Add reused member variable to ggml_cgraph so backends can take advantage of the graph reuse functionality. Currently when graph_reuse in invoked, the CUDA backend still does the props change check to figure out if the graph has changed or not, where in fact graph_reuse (to my understanding) guarantees this to be true. This helps bypass a mildly expensive O(n) check.

Additional information

Testing: I tested various combinations like --n-cpu-moe, -nkvo and -ngl and verified it works. Additional testing would be welcome.

Results on a 5090:

Model Test t/s cuda_mul_fused t/s ggml_graph_reuse Speedup
gemma4 ?B Q4_0 tg128 219.43 231.50 1.06
gemma4 ?B Q4_0 tg128@d16384 183.23 191.91 1.05
gemma4 ?B Q4_0 tg128@d32768 175.15 182.58 1.04
gpt-oss 20B MXFP4 MoE tg128 313.29 323.01 1.03
gpt-oss 20B MXFP4 MoE tg128@d16384 279.59 288.05 1.03
gpt-oss 20B MXFP4 MoE tg128@d32768 258.80 266.12 1.03
qwen35 27B Q4_K_M tg128 65.04 66.61 1.02
qwen35 27B Q4_K_M tg128@d16384 62.45 63.61 1.02
qwen35 27B Q4_K_M tg128@d32768 59.65 60.70 1.02
qwen35moe 35B.A3B Q4_K_S tg128 194.52 206.56 1.06
qwen35moe 35B.A3B Q4_K_S tg128@d16384 184.11 196.36 1.07
qwen35moe 35B.A3B Q4_K_S tg128@d32768 175.95 187.60 1.07

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for general understanding of the scheduler

@am17an am17an requested review from a team and ggerganov as code owners April 11, 2026 08:37
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 11, 2026
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work from a llama.cpp perspective but I'm not convinced it's the right way to handle this from a ggml perspective. In my opinion we should attach a ggml_guid and write the code around that. With a GUID the contract for "user code" would be "if you change the graph you must change the GUID", it's not clear to me what the correct way to use a reused flag would be.


if (cgraph->version != 0 &&
cgraph->version == graph->last_graph_version &&
(int)graph->node_props.size() == cgraph->n_nodes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there ever be a case where versions match, but node count doesn't?
If not, maybe we can make this an assert?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will add an assert, it should not happen

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 11, 2026

I added a version number, currently we don't have a guid generator in master. If the current version is acceptable, I can add a guid generator however I think a simple static version counter is also fine imo. The idea of versioning that anything that mutates the cgraph should have increment the version number.

}

for (int i = 0; i < sched->n_splits; i++) {
sched->splits[i].graph.version = graph->version;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all splits really be the same uuid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any additional benefit of having a per split uuid as of now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right though, it makes sense to have a per split uuid

void ggml_print_backtrace(void);

static inline uint64_t ggml_graph_next_version(void) {
static uint64_t counter = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be atomic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants