[Mirror] model : refactor QKV into common build_qkv and create_tensor_qkv helpers#94
[Mirror] model : refactor QKV into common build_qkv and create_tensor_qkv helpers#94ngxson wants to merge 2 commits intongxson:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd a new POD Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Model (per-layer)
participant Graph as llm_graph_context::build_qkv
participant GGML as GGML ops
participant Attn as Attention (RoPE / build_attn)
Model->>Graph: build_qkv(layer, cur, n_embd_head, n_head, n_head_kv, il, f_clamp_kqv)
Graph->>GGML: build_lora_mm (fused wqkv OR separate wq/wk/wv)
GGML-->>Graph: projected tensor(s)
Graph->>GGML: optional ggml_add (bias) / ggml_clamp
Graph->>GGML: ggml_view_3d / ggml_reshape_3d => Q, K, V
Graph-->>Model: return {Qcur, Kcur, Vcur}
Model->>GGML: ggml_rope_ext on Qcur/Kcur (and optional norm)
Model->>Attn: build_attn(Qcur, Kcur, Vcur, ...)
Attn-->>Model: attention output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/models/refact.cpp (1)
27-32:⚠️ Potential issue | 🟡 MinorRemove the now-duplicated
cb()calls in this block.
build_qkv()already emitscb(Qcur/Kcur/Vcur)before returning, andcbis the hook used for custom tensor logic such as allocation/offloading. With no tensor mutation between the helper call and Lines 30-32, those callbacks now run twice on the same tensors here. (github.com)✂️ Suggested cleanup
auto [Qcur, Kcur, Vcur] = build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il); - - cb(Qcur, "Qcur", il); - cb(Kcur, "Kcur", il); - cb(Vcur, "Vcur", il); cur = build_attn(inp_attn, model.layers[il].wo, NULL, Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/refact.cpp` around lines 27 - 32, The three redundant cb() invocations after build_qkv are duplicating work because build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) already calls cb(Qcur/Kcur/Vcur) before returning; remove the lines cb(Qcur, "Qcur", il); cb(Kcur, "Kcur", il); cb(Vcur, "Vcur"); so the tensors from build_qkv are only passed through the hook once (leave build_qkv, Qcur/Kcur/Vcur, cb, and the surrounding call-site with model.layers[il] and cur unchanged).
🧹 Nitpick comments (5)
src/models/llada.cpp (1)
32-34: Update the stale projection comment.After this refactor, the block no longer guarantees “separate” projections or “without bias”:
build_qkv()now handles both fused and separate QKV paths and applies optional biases when present. The comment above misdescribes the implementation. (github.com)📝 Suggested wording
- // compute separate Q, K, V projections without bias, matching LLaDALlamaBlock + // build Q/K/V projections via the shared helper, then apply RoPE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/llada.cpp` around lines 32 - 34, The existing comment above the build_qkv call is stale — it states projections are "separate" and "without bias" even though build_qkv now handles both fused and separate QKV paths and applies optional biases; update the inline comment above the call to build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) to accurately state that build_qkv produces Q, K, V projections and supports fused/separate QKV implementations and optional biases when present (i.e., neutral, factual description referencing build_qkv behavior instead of asserting "separate" or "without bias").src/models/starcoder.cpp (1)
35-36: Consider restoring the per-tensor callbacks here.The helper swap looks fine functionally, but this builder now skips the
Qcur/Kcur/Vcurtraces that the other touched builders still emit, which makes graph-dump comparisons less consistent.Suggested change
auto [Qcur, Kcur, Vcur] = build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il); + + cb(Qcur, "Qcur", il); + cb(Kcur, "Kcur", il); + cb(Vcur, "Vcur", il); cur = build_attn(inp_attn,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/starcoder.cpp` around lines 35 - 36, The call to build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) bypasses the per-tensor tracing/callbacks for Qcur/Kcur/Vcur that other builders still emit, causing inconsistent graph dumps; restore those per-tensor callbacks after the helper swap by invoking the same trace/callback functions used elsewhere (the ones that record Qcur, Kcur, Vcur) immediately after build_qkv returns (or integrate the tracing into build_qkv), ensuring the unique symbols Qcur, Kcur, Vcur and the builder path that emits tensor traces are preserved so graph-dump comparisons remain consistent.src/models/falcon-h1.cpp (1)
30-31: Make the K/V head-width invariant explicit here.
build_qkv()now receives onen_embd_headfor all three projections, but this constructor no longer documents that Falcon-H1's K and V head widths match. A localGGML_ASSERT(n_embd_head == hparams.n_embd_head_k())near the declaration would make future asymmetric variants fail fast instead of being reshaped silently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/falcon-h1.cpp` around lines 30 - 31, Add an explicit invariant check that the K/V head width matches the Q head width before calling build_qkv: insert a GGML_ASSERT comparing n_embd_head and hparams.n_embd_head_k() (e.g., GGML_ASSERT(n_embd_head == hparams.n_embd_head_k())) near the declaration/assignment of n_embd_head (immediately before the build_qkv(...) call) so any future asymmetric-head changes fail fast instead of silently reshaping.src/models/gemma2-iswa.cpp (1)
34-35: Assert the symmetric K/V head-width assumption before using the shared helper.This file now funnels Q/K/V through a single
n_embd_head, but it does not guard thathparams.n_embd_head_v()matches the K head width chosen here. Adding a local assert would document that Gemma2-ISWA still satisfies the helper's reshape assumption.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/gemma2-iswa.cpp` around lines 34 - 35, Add a local assertion before calling build_qkv to verify the symmetric K/V head-width assumption: check that hparams.n_embd_head_v() equals the K head width used here (n_head_kv) so the shared helper’s reshape assumptions hold; place the assert near the call site that invokes build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) and reference hparams.n_embd_head_v(), n_head_kv, n_embd_head and build_qkv in the message for clarity.src/llama-graph.cpp (1)
1027-1052: Minor: Duplicate callbacks with identical names in separate Q/K/V path.In the separate path,
cb()is called 2-3 times per tensor with the same name"Qcur"/"Kcur"/"Vcur":
- After
build_lora_mm(lines 1028, 1034, 1040)- After bias add if bias exists (lines 1031, 1037, 1043)
- After reshape at the end (lines 1050-1052)
In contrast, the fused path only has the final callbacks. This inconsistency could cause confusion when debugging tensor states or using callbacks for profiling.
Consider removing the intermediate callbacks in the separate path to match the fused path behavior, or use distinct names (e.g.,
"Qcur_mm","Qcur_bias") for intermediate states.♻️ Optional: Remove intermediate callbacks to match fused path
} else { // separate Q/K/V path Qcur = build_lora_mm(layer.wq, cur, layer.wq_s); - cb(Qcur, "Qcur", il); if (layer.bq) { Qcur = ggml_add(ctx0, Qcur, layer.bq); - cb(Qcur, "Qcur", il); } Kcur = build_lora_mm(layer.wk, cur, layer.wk_s); - cb(Kcur, "Kcur", il); if (layer.bk) { Kcur = ggml_add(ctx0, Kcur, layer.bk); - cb(Kcur, "Kcur", il); } Vcur = build_lora_mm(layer.wv, cur, layer.wv_s); - cb(Vcur, "Vcur", il); if (layer.bv) { Vcur = ggml_add(ctx0, Vcur, layer.bv); - cb(Vcur, "Vcur", il); } Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens); Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama-graph.cpp` around lines 1027 - 1052, The cb() calls for intermediate Qcur/Kcur/Vcur states are duplicated; after build_lora_mm and after optional ggml_add the code calls cb() with the same names and then again after ggml_reshape_3d, causing inconsistent behavior versus the fused path. Remove the intermediate cb(...) invocations that follow build_lora_mm and the ones immediately after ggml_add for Qcur, Kcur, and Vcur, leaving only the final cb(...) calls after ggml_reshape_3d for each tensor (references: functions/vars build_lora_mm, ggml_add, ggml_reshape_3d, and cb with symbols Qcur/Kcur/Vcur).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llama-model.cpp`:
- Around line 2723-2737: The helper create_tensor_qkv currently ignores the
incoming flags (it uses hardcoded TENSOR_NOT_REQUIRED / TENSOR_SKIP_IF_VIRTUAL)
which causes layers that should be skipped (e.g., NextN with TENSOR_SKIP) to
materialize QKV; update create_tensor_qkv so all create_tensor calls (the merged
path layer.wqkv and its bias layer.bqkv, and the per-split path layer.wq/wk/wv
and their biases layer.bq/bk/bv) pass through the caller-provided flags (or
combine caller flags with TENSOR_NOT_REQUIRED where appropriate) instead of the
hardcoded constants so callers relying on flags (from sites using
create_tensor_qkv) preserve TENSOR_SKIP behavior.
In `@src/models/afmoe.cpp`:
- Around line 44-45: The code redundantly reshapes Vcur after build_qkv; remove
the extra reshape and pass Vcur directly to build_attn (i.e., stop doing the
Vcur.reshape(...) step and use the Vcur returned by build_qkv when calling
build_attn). Update the call site where build_attn(...) currently receives the
reshaped Vcur to use the original Vcur variable (from auto [Qcur, Kcur, Vcur] =
build_qkv(...)), and ensure the function signatures (build_qkv, build_attn)
still match expected 3D tensor shapes (n_head_kv, n_embd_head, seq_len) after
this change.
In `@src/models/mimo2-iswa.cpp`:
- Around line 27-28: The call to build_qkv is using global head counts (n_head,
n_head_kv) but Mimo2-ISWA needs per-layer counts; update the call that builds
Qcur/Kcur/Vcur so it passes hparams.n_head(il) and hparams.n_head_kv(il)
(keeping n_embd_head_k, model.layers[il], cur, il unchanged) so each layer uses
its layer-specific head and head_kv values as done in granite.cpp.
---
Outside diff comments:
In `@src/models/refact.cpp`:
- Around line 27-32: The three redundant cb() invocations after build_qkv are
duplicating work because build_qkv(model.layers[il], cur, n_embd_head, n_head,
n_head_kv, il) already calls cb(Qcur/Kcur/Vcur) before returning; remove the
lines cb(Qcur, "Qcur", il); cb(Kcur, "Kcur", il); cb(Vcur, "Vcur"); so the
tensors from build_qkv are only passed through the hook once (leave build_qkv,
Qcur/Kcur/Vcur, cb, and the surrounding call-site with model.layers[il] and cur
unchanged).
---
Nitpick comments:
In `@src/llama-graph.cpp`:
- Around line 1027-1052: The cb() calls for intermediate Qcur/Kcur/Vcur states
are duplicated; after build_lora_mm and after optional ggml_add the code calls
cb() with the same names and then again after ggml_reshape_3d, causing
inconsistent behavior versus the fused path. Remove the intermediate cb(...)
invocations that follow build_lora_mm and the ones immediately after ggml_add
for Qcur, Kcur, and Vcur, leaving only the final cb(...) calls after
ggml_reshape_3d for each tensor (references: functions/vars build_lora_mm,
ggml_add, ggml_reshape_3d, and cb with symbols Qcur/Kcur/Vcur).
In `@src/models/falcon-h1.cpp`:
- Around line 30-31: Add an explicit invariant check that the K/V head width
matches the Q head width before calling build_qkv: insert a GGML_ASSERT
comparing n_embd_head and hparams.n_embd_head_k() (e.g., GGML_ASSERT(n_embd_head
== hparams.n_embd_head_k())) near the declaration/assignment of n_embd_head
(immediately before the build_qkv(...) call) so any future asymmetric-head
changes fail fast instead of silently reshaping.
In `@src/models/gemma2-iswa.cpp`:
- Around line 34-35: Add a local assertion before calling build_qkv to verify
the symmetric K/V head-width assumption: check that hparams.n_embd_head_v()
equals the K head width used here (n_head_kv) so the shared helper’s reshape
assumptions hold; place the assert near the call site that invokes
build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) and
reference hparams.n_embd_head_v(), n_head_kv, n_embd_head and build_qkv in the
message for clarity.
In `@src/models/llada.cpp`:
- Around line 32-34: The existing comment above the build_qkv call is stale — it
states projections are "separate" and "without bias" even though build_qkv now
handles both fused and separate QKV paths and applies optional biases; update
the inline comment above the call to build_qkv(model.layers[il], cur,
n_embd_head, n_head, n_head_kv, il) to accurately state that build_qkv produces
Q, K, V projections and supports fused/separate QKV implementations and optional
biases when present (i.e., neutral, factual description referencing build_qkv
behavior instead of asserting "separate" or "without bias").
In `@src/models/starcoder.cpp`:
- Around line 35-36: The call to build_qkv(model.layers[il], cur, n_embd_head,
n_head, n_head_kv, il) bypasses the per-tensor tracing/callbacks for
Qcur/Kcur/Vcur that other builders still emit, causing inconsistent graph dumps;
restore those per-tensor callbacks after the helper swap by invoking the same
trace/callback functions used elsewhere (the ones that record Qcur, Kcur, Vcur)
immediately after build_qkv returns (or integrate the tracing into build_qkv),
ensuring the unique symbols Qcur, Kcur, Vcur and the builder path that emits
tensor traces are preserved so graph-dump comparisons remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2a04487a-f98a-4ac4-a410-11dc7aa590d3
📒 Files selected for processing (80)
src/llama-graph.cppsrc/llama-graph.hsrc/llama-model.cppsrc/models/afmoe.cppsrc/models/apertus.cppsrc/models/arcee.cppsrc/models/arctic.cppsrc/models/baichuan.cppsrc/models/bailingmoe.cppsrc/models/bailingmoe2.cppsrc/models/bitnet.cppsrc/models/bloom.cppsrc/models/chameleon.cppsrc/models/chatglm.cppsrc/models/codeshell.cppsrc/models/cohere2-iswa.cppsrc/models/command-r.cppsrc/models/deci.cppsrc/models/deepseek.cppsrc/models/dots1.cppsrc/models/dream.cppsrc/models/ernie4-5-moe.cppsrc/models/ernie4-5.cppsrc/models/eurobert.cppsrc/models/exaone-moe.cppsrc/models/exaone.cppsrc/models/exaone4.cppsrc/models/falcon-h1.cppsrc/models/falcon.cppsrc/models/gemma-embedding.cppsrc/models/gemma.cppsrc/models/gemma2-iswa.cppsrc/models/gemma3.cppsrc/models/glm4-moe.cppsrc/models/glm4.cppsrc/models/gpt2.cppsrc/models/gptneox.cppsrc/models/granite.cppsrc/models/grok.cppsrc/models/grovemoe.cppsrc/models/hunyuan-dense.cppsrc/models/hunyuan-moe.cppsrc/models/internlm2.cppsrc/models/jais.cppsrc/models/jais2.cppsrc/models/jamba.cppsrc/models/llada-moe.cppsrc/models/llada.cppsrc/models/llama-iswa.cppsrc/models/llama.cppsrc/models/maincoder.cppsrc/models/mimo2-iswa.cppsrc/models/mistral3.cppsrc/models/modern-bert.cppsrc/models/nemotron.cppsrc/models/neo-bert.cppsrc/models/openai-moe-iswa.cppsrc/models/orion.cppsrc/models/paddleocr.cppsrc/models/pangu-embedded.cppsrc/models/phi2.cppsrc/models/phi3.cppsrc/models/plamo.cppsrc/models/qwen.cppsrc/models/qwen2.cppsrc/models/qwen2moe.cppsrc/models/qwen2vl.cppsrc/models/qwen3.cppsrc/models/qwen3moe.cppsrc/models/qwen3vl-moe.cppsrc/models/qwen3vl.cppsrc/models/refact.cppsrc/models/rnd1.cppsrc/models/seed-oss.cppsrc/models/smallthinker.cppsrc/models/smollm3.cppsrc/models/stablelm.cppsrc/models/starcoder.cppsrc/models/starcoder2.cppsrc/models/xverse.cpp
bcc69fd to
42eae08
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/models/smallthinker.cpp (1)
48-59:⚠️ Potential issue | 🟡 MinorAdd missing
cb(Vcur, "Vcur", il)callback.The
Vcurvariable is constructed viabuild_qkv()and used inbuild_attn(), but lacks an instrumentation callback. Every other model file insrc/models/includes this callback afterbuild_qkv(). Add it for consistency:cb(Vcur, "Vcur", il);Place this after line 59, alongside the existing
cb(Qcur, ...)andcb(Kcur, ...)calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/smallthinker.cpp` around lines 48 - 59, The code calls build_qkv(...) which produces Vcur but never instruments it; add a callback invocation for Vcur to match Qcur and Kcur instrumentation by inserting cb(Vcur, "Vcur", il) immediately after the existing cb(Kcur, "Kcur", il) call so Vcur is logged/instrumented before it is used (e.g., in build_attn()).
🧹 Nitpick comments (2)
src/llama-graph.h (2)
701-705: Default-initialize the Q/K/V bundle.
llm_graph_qkvis the only new pointer bundle here that does not start from a known state. Initializing these members tonullptrmakes future early-return paths inbuild_qkv()fail obviously instead of handing garbage pointers to the callers.♻️ Suggested tweak
struct llm_graph_qkv { - ggml_tensor * q; // [n_embd_head, n_head, n_tokens] - ggml_tensor * k; // [n_embd_head, n_head_kv, n_tokens] - ggml_tensor * v; // [n_embd_head, n_head_kv, n_tokens] + ggml_tensor * q = nullptr; // [n_embd_head, n_head, n_tokens] + ggml_tensor * k = nullptr; // [n_embd_head, n_head_kv, n_tokens] + ggml_tensor * v = nullptr; // [n_embd_head, n_head_kv, n_tokens] };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama-graph.h` around lines 701 - 705, The llm_graph_qkv struct's pointer members q, k, v must be default-initialized to nullptr so early-return paths in build_qkv() don't propagate uninitialized garbage; update the struct llm_graph_qkv to initialize q = nullptr, k = nullptr, v = nullptr (or use an inline constructor/default member initializers) so any new instance starts in a known null state.
794-802: Add precondition assertion or expand signature for separate K/V head widths.
llm_graph_contexttracksn_embd_head_kandn_embd_head_vseparately, but this function accepts only a singlen_embd_headand uses it for both K and V projections. The implementation lacks anGGML_ASSERTto enforce the implicit precondition that K and V must share the same per-head width. Either add an assertion insrc/llama-graph.cpp::llm_graph_context::build_qkv()to enforce equal K/V widths, or extend the signature to accept separaten_embd_head_kandn_embd_head_vparameters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama-graph.h` around lines 794 - 802, The build_qkv API accepts a single n_embd_head but llm_graph_context tracks n_embd_head_k and n_embd_head_v separately; add a precondition or expand the signature to fix this: either update llm_graph_context::build_qkv (and its declaration llm_graph_qkv build_qkv(...)) to take separate int64_t n_embd_head_k and int64_t n_embd_head_v parameters, or keep the single n_embd_head and add a GGML_ASSERT in build_qkv implementation that verifies n_embd_head == n_embd_head_k && n_embd_head == n_embd_head_v (use the context’s n_embd_head_k and n_embd_head_v symbols) so K and V per-head widths are enforced. Ensure callers are updated if you change the signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/models/smallthinker.cpp`:
- Around line 48-59: The code calls build_qkv(...) which produces Vcur but never
instruments it; add a callback invocation for Vcur to match Qcur and Kcur
instrumentation by inserting cb(Vcur, "Vcur", il) immediately after the existing
cb(Kcur, "Kcur", il) call so Vcur is logged/instrumented before it is used
(e.g., in build_attn()).
---
Nitpick comments:
In `@src/llama-graph.h`:
- Around line 701-705: The llm_graph_qkv struct's pointer members q, k, v must
be default-initialized to nullptr so early-return paths in build_qkv() don't
propagate uninitialized garbage; update the struct llm_graph_qkv to initialize q
= nullptr, k = nullptr, v = nullptr (or use an inline constructor/default member
initializers) so any new instance starts in a known null state.
- Around line 794-802: The build_qkv API accepts a single n_embd_head but
llm_graph_context tracks n_embd_head_k and n_embd_head_v separately; add a
precondition or expand the signature to fix this: either update
llm_graph_context::build_qkv (and its declaration llm_graph_qkv build_qkv(...))
to take separate int64_t n_embd_head_k and int64_t n_embd_head_v parameters, or
keep the single n_embd_head and add a GGML_ASSERT in build_qkv implementation
that verifies n_embd_head == n_embd_head_k && n_embd_head == n_embd_head_v (use
the context’s n_embd_head_k and n_embd_head_v symbols) so K and V per-head
widths are enforced. Ensure callers are updated if you change the signature.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 939bff27-0530-4317-a513-3629f47b85d5
📒 Files selected for processing (80)
src/llama-graph.cppsrc/llama-graph.hsrc/llama-model.cppsrc/models/afmoe.cppsrc/models/apertus.cppsrc/models/arcee.cppsrc/models/arctic.cppsrc/models/baichuan.cppsrc/models/bailingmoe.cppsrc/models/bailingmoe2.cppsrc/models/bitnet.cppsrc/models/bloom.cppsrc/models/chameleon.cppsrc/models/chatglm.cppsrc/models/codeshell.cppsrc/models/cohere2-iswa.cppsrc/models/command-r.cppsrc/models/deci.cppsrc/models/deepseek.cppsrc/models/dots1.cppsrc/models/dream.cppsrc/models/ernie4-5-moe.cppsrc/models/ernie4-5.cppsrc/models/eurobert.cppsrc/models/exaone-moe.cppsrc/models/exaone.cppsrc/models/exaone4.cppsrc/models/falcon-h1.cppsrc/models/falcon.cppsrc/models/gemma-embedding.cppsrc/models/gemma.cppsrc/models/gemma2-iswa.cppsrc/models/gemma3.cppsrc/models/glm4-moe.cppsrc/models/glm4.cppsrc/models/gpt2.cppsrc/models/gptneox.cppsrc/models/granite.cppsrc/models/grok.cppsrc/models/grovemoe.cppsrc/models/hunyuan-dense.cppsrc/models/hunyuan-moe.cppsrc/models/internlm2.cppsrc/models/jais.cppsrc/models/jais2.cppsrc/models/jamba.cppsrc/models/llada-moe.cppsrc/models/llada.cppsrc/models/llama-iswa.cppsrc/models/llama.cppsrc/models/maincoder.cppsrc/models/mimo2-iswa.cppsrc/models/mistral3.cppsrc/models/modern-bert.cppsrc/models/nemotron.cppsrc/models/neo-bert.cppsrc/models/openai-moe-iswa.cppsrc/models/orion.cppsrc/models/paddleocr.cppsrc/models/pangu-embedded.cppsrc/models/phi2.cppsrc/models/phi3.cppsrc/models/plamo.cppsrc/models/qwen.cppsrc/models/qwen2.cppsrc/models/qwen2moe.cppsrc/models/qwen2vl.cppsrc/models/qwen3.cppsrc/models/qwen3moe.cppsrc/models/qwen3vl-moe.cppsrc/models/qwen3vl.cppsrc/models/refact.cppsrc/models/rnd1.cppsrc/models/seed-oss.cppsrc/models/smallthinker.cppsrc/models/smollm3.cppsrc/models/stablelm.cppsrc/models/starcoder.cppsrc/models/starcoder2.cppsrc/models/xverse.cpp
✅ Files skipped from review due to trivial changes (14)
- src/models/qwen2moe.cpp
- src/models/starcoder2.cpp
- src/models/jamba.cpp
- src/models/glm4-moe.cpp
- src/models/grovemoe.cpp
- src/models/cohere2-iswa.cpp
- src/models/bailingmoe.cpp
- src/models/mimo2-iswa.cpp
- src/models/deepseek.cpp
- src/models/bailingmoe2.cpp
- src/models/mistral3.cpp
- src/models/llada-moe.cpp
- src/models/command-r.cpp
- src/models/smollm3.cpp
🚧 Files skipped from review as they are similar to previous changes (32)
- src/models/gemma3.cpp
- src/models/ernie4-5-moe.cpp
- src/models/gemma.cpp
- src/models/eurobert.cpp
- src/models/gptneox.cpp
- src/models/exaone4.cpp
- src/models/plamo.cpp
- src/models/bloom.cpp
- src/models/qwen3vl.cpp
- src/models/jais2.cpp
- src/models/stablelm.cpp
- src/models/hunyuan-moe.cpp
- src/models/gemma-embedding.cpp
- src/models/apertus.cpp
- src/models/paddleocr.cpp
- src/models/qwen2.cpp
- src/models/qwen.cpp
- src/models/qwen3.cpp
- src/models/exaone-moe.cpp
- src/models/granite.cpp
- src/models/afmoe.cpp
- src/models/nemotron.cpp
- src/models/falcon-h1.cpp
- src/models/qwen3vl-moe.cpp
- src/models/xverse.cpp
- src/models/qwen2vl.cpp
- src/models/jais.cpp
- src/llama-graph.cpp
- src/models/seed-oss.cpp
- src/models/neo-bert.cpp
- src/llama-model.cpp
- src/models/internlm2.cpp
42eae08 to
75d759d
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/models/smallthinker.cpp (1)
48-63:⚠️ Potential issue | 🟡 MinorRestore the final
Vcurcallback in this block.
QcurandKcurare still reported after the optional RoPE step, butVcuris no longer emitted here. That makes tensor tracing inconsistent in what is otherwise a no-op refactor.Suggested fix
cb(Qcur, "Qcur", il); cb(Kcur, "Kcur", il); + cb(Vcur, "Vcur", il); cur = build_attn(inp_attn,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/smallthinker.cpp` around lines 48 - 63, The Vcur tensor callback was accidentally removed after building Qcur/Kcur, causing inconsistent tracing; re-add a cb(Vcur, "Vcur", il) right after the Qcur/Kcur callbacks (after the optional use_rope block) so all three Qcur, Kcur, Vcur are reported before calling build_attn; locate the block where build_qkv(...) returns Qcur, Kcur, Vcur and where cb(Qcur, "Qcur", il) and cb(Kcur, "Kcur", il) are invoked and insert the Vcur callback in the same position.src/models/hunyuan-dense.cpp (1)
46-59:⚠️ Potential issue | 🟠 MajorCallback placement for
Kcuris inconsistent withQcurand with the pattern inllama.cpp.The
cb(Kcur, "Kcur", il)is called before RoPE is applied toKcur(lines 50-54), whilecb(Qcur, "Qcur", il)is called after RoPE is applied toQcur. Thellama.cppmodel applies both RoPE operations first, then performs all callbacks in sequence (cb(Qcur),cb(Kcur),cb(Vcur)), maintaining symmetry.Move
cb(Kcur, "Kcur", il)to after the Kcur RoPE application (after line 54) to align with the Qcur pattern and match the established convention inllama.cpp.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/hunyuan-dense.cpp` around lines 46 - 59, The Kcur callback is invoked before applying RoPE, breaking symmetry with Qcur and llama.cpp; move the cb(Kcur, "Kcur", il) call so it runs after Kcur is passed through ggml_rope_ext (and after build_norm if you prefer the post-norm callback pattern used for Qcur/Vcur) — locate the Kcur handling around ggml_rope_ext, build_norm, and the existing cb calls and reposition the cb(Kcur, "Kcur", il) to follow the RoPE application (and normalization) for model.layers[il].attn_k_norm.
♻️ Duplicate comments (1)
src/llama-model.cpp (1)
2723-2736:⚠️ Potential issue | 🟠 MajorPropagate caller
flagsthroughcreate_tensor_qkvagain.This regresses the earlier skip-flag fix:
layer.wqkv/layer.bqkvstill ignoreflags, and the split biases still drop them too. That means the NextN skip callers at Lines 5211, 5269, and 5639 can still materialize QKV tensors in layers that were supposed to stay skipped.🛠️ Proposed fix
auto create_tensor_qkv = [&](llama_layer & layer, int bid, int64_t n_embd_, int64_t n_embd_q_, int64_t n_embd_k_, int64_t n_embd_v_, int flags) { const int64_t n_embd_qkv = n_embd_q_ + n_embd_k_ + n_embd_v_; - layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", bid), {n_embd_, n_embd_qkv}, TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL); + const int qkv_flags = flags | TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL; + layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", bid), {n_embd_, n_embd_qkv}, qkv_flags); if (layer.wqkv) { - layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", bid), {n_embd_qkv}, TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL); + layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", bid), {n_embd_qkv}, qkv_flags); } else { layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", bid), {n_embd_, n_embd_q_}, flags); layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", bid), {n_embd_, n_embd_k_}, flags); layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", bid), {n_embd_, n_embd_v_}, flags); - layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", bid), {n_embd_q_}, TENSOR_NOT_REQUIRED); - layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", bid), {n_embd_k_}, TENSOR_NOT_REQUIRED); - layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", bid), {n_embd_v_}, TENSOR_NOT_REQUIRED); + layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", bid), {n_embd_q_}, flags | TENSOR_NOT_REQUIRED); + layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", bid), {n_embd_k_}, flags | TENSOR_NOT_REQUIRED); + layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", bid), {n_embd_v_}, flags | TENSOR_NOT_REQUIRED); } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama-model.cpp` around lines 2723 - 2736, The helper create_tensor_qkv is ignoring the caller-provided flags for combined QKV and split bias tensors, which breaks the skip behavior; update the create_tensor calls so they use the incoming flags parameter rather than hard-coded TENSOR_NOT_REQUIRED (and TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL) – specifically, pass flags into create_tensor for layer.wqkv and layer.bqkv (instead of TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL) and for layer.bq, layer.bk, layer.bv (instead of TENSOR_NOT_REQUIRED); keep using flags for layer.wq, layer.wk, layer.wv as already done so the skip/virtual semantics propagate correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llama-graph.cpp`:
- Around line 1048-1098: The helper llm_graph_context::build_qkv assumes Q/K/V
all use the same per-head width (n_embd_head) and collapses/reshapes V using
that width, which breaks asymmetric V heads; update the function to accept a
separate n_embd_head_v (or n_embd_k and n_embd_v) parameter and use
n_embd_head_v when computing Vcur’s ggml_view_3d/ggml_reshape_3d and its byte
offsets (and when computing n_embd_kv/n_embd_q if needed), or if asymmetric
heads are unsupported, add an explicit assert at the start of build_qkv that
n_embd_head == n_embd_head_v to fail fast; refer to
llm_graph_context::build_qkv, Qcur/Kcur/Vcur, n_embd_head, n_embd_kv, and the
ggml_view_3d/ggml_reshape_3d calls to locate the changes.
In `@src/llama-model.cpp`:
- Around line 7012-7015: The generic create_tensor_qkv helper must not be used
for asymmetric/doubled-Q attention layouts (e.g., where Q size = n_embd_head_k *
n_head * 2 or where a separate wqkv_gate is used); update the code paths around
the non-recurrent attention branch so instead of calling
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, ...), explicitly
create the independent Q, K, V (and gate if present) tensors via create_tensor
(and preserve layer.wo creation) so the custom split logic in Qwen35, Qwen35MoE,
Qwen3Next, and Step35 graph builders remains compatible and no fused wqkv is
produced by create_tensor_qkv.
---
Outside diff comments:
In `@src/models/hunyuan-dense.cpp`:
- Around line 46-59: The Kcur callback is invoked before applying RoPE, breaking
symmetry with Qcur and llama.cpp; move the cb(Kcur, "Kcur", il) call so it runs
after Kcur is passed through ggml_rope_ext (and after build_norm if you prefer
the post-norm callback pattern used for Qcur/Vcur) — locate the Kcur handling
around ggml_rope_ext, build_norm, and the existing cb calls and reposition the
cb(Kcur, "Kcur", il) to follow the RoPE application (and normalization) for
model.layers[il].attn_k_norm.
In `@src/models/smallthinker.cpp`:
- Around line 48-63: The Vcur tensor callback was accidentally removed after
building Qcur/Kcur, causing inconsistent tracing; re-add a cb(Vcur, "Vcur", il)
right after the Qcur/Kcur callbacks (after the optional use_rope block) so all
three Qcur, Kcur, Vcur are reported before calling build_attn; locate the block
where build_qkv(...) returns Qcur, Kcur, Vcur and where cb(Qcur, "Qcur", il) and
cb(Kcur, "Kcur", il) are invoked and insert the Vcur callback in the same
position.
---
Duplicate comments:
In `@src/llama-model.cpp`:
- Around line 2723-2736: The helper create_tensor_qkv is ignoring the
caller-provided flags for combined QKV and split bias tensors, which breaks the
skip behavior; update the create_tensor calls so they use the incoming flags
parameter rather than hard-coded TENSOR_NOT_REQUIRED (and TENSOR_NOT_REQUIRED |
TENSOR_SKIP_IF_VIRTUAL) – specifically, pass flags into create_tensor for
layer.wqkv and layer.bqkv (instead of TENSOR_NOT_REQUIRED |
TENSOR_SKIP_IF_VIRTUAL) and for layer.bq, layer.bk, layer.bv (instead of
TENSOR_NOT_REQUIRED); keep using flags for layer.wq, layer.wk, layer.wv as
already done so the skip/virtual semantics propagate correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 15e971bd-0a6e-4873-9e72-a0fd7bd74c38
📒 Files selected for processing (80)
src/llama-graph.cppsrc/llama-graph.hsrc/llama-model.cppsrc/models/afmoe.cppsrc/models/apertus.cppsrc/models/arcee.cppsrc/models/arctic.cppsrc/models/baichuan.cppsrc/models/bailingmoe.cppsrc/models/bailingmoe2.cppsrc/models/bitnet.cppsrc/models/bloom.cppsrc/models/chameleon.cppsrc/models/chatglm.cppsrc/models/codeshell.cppsrc/models/cohere2-iswa.cppsrc/models/command-r.cppsrc/models/deci.cppsrc/models/deepseek.cppsrc/models/dots1.cppsrc/models/dream.cppsrc/models/ernie4-5-moe.cppsrc/models/ernie4-5.cppsrc/models/eurobert.cppsrc/models/exaone-moe.cppsrc/models/exaone.cppsrc/models/exaone4.cppsrc/models/falcon-h1.cppsrc/models/falcon.cppsrc/models/gemma-embedding.cppsrc/models/gemma.cppsrc/models/gemma2-iswa.cppsrc/models/gemma3.cppsrc/models/glm4-moe.cppsrc/models/glm4.cppsrc/models/gpt2.cppsrc/models/gptneox.cppsrc/models/granite.cppsrc/models/grok.cppsrc/models/grovemoe.cppsrc/models/hunyuan-dense.cppsrc/models/hunyuan-moe.cppsrc/models/internlm2.cppsrc/models/jais.cppsrc/models/jais2.cppsrc/models/jamba.cppsrc/models/llada-moe.cppsrc/models/llada.cppsrc/models/llama-iswa.cppsrc/models/llama.cppsrc/models/maincoder.cppsrc/models/mimo2-iswa.cppsrc/models/mistral3.cppsrc/models/modern-bert.cppsrc/models/nemotron.cppsrc/models/neo-bert.cppsrc/models/openai-moe-iswa.cppsrc/models/orion.cppsrc/models/paddleocr.cppsrc/models/pangu-embedded.cppsrc/models/phi2.cppsrc/models/phi3.cppsrc/models/plamo.cppsrc/models/qwen.cppsrc/models/qwen2.cppsrc/models/qwen2moe.cppsrc/models/qwen2vl.cppsrc/models/qwen3.cppsrc/models/qwen3moe.cppsrc/models/qwen3vl-moe.cppsrc/models/qwen3vl.cppsrc/models/refact.cppsrc/models/rnd1.cppsrc/models/seed-oss.cppsrc/models/smallthinker.cppsrc/models/smollm3.cppsrc/models/stablelm.cppsrc/models/starcoder.cppsrc/models/starcoder2.cppsrc/models/xverse.cpp
✅ Files skipped from review due to trivial changes (10)
- src/models/glm4-moe.cpp
- src/models/qwen2moe.cpp
- src/models/bailingmoe.cpp
- src/models/gemma.cpp
- src/models/command-r.cpp
- src/models/paddleocr.cpp
- src/models/arcee.cpp
- src/models/neo-bert.cpp
- src/models/starcoder.cpp
- src/models/ernie4-5.cpp
🚧 Files skipped from review as they are similar to previous changes (37)
- src/models/falcon-h1.cpp
- src/models/gemma3.cpp
- src/models/codeshell.cpp
- src/models/chatglm.cpp
- src/models/exaone4.cpp
- src/models/bloom.cpp
- src/models/grovemoe.cpp
- src/models/internlm2.cpp
- src/models/arctic.cpp
- src/models/ernie4-5-moe.cpp
- src/models/apertus.cpp
- src/models/llada-moe.cpp
- src/models/modern-bert.cpp
- src/models/plamo.cpp
- src/models/smollm3.cpp
- src/models/exaone-moe.cpp
- src/models/seed-oss.cpp
- src/models/llama.cpp
- src/models/gptneox.cpp
- src/models/gemma-embedding.cpp
- src/models/nemotron.cpp
- src/models/qwen2vl.cpp
- src/models/rnd1.cpp
- src/models/jais2.cpp
- src/llama-graph.h
- src/models/phi2.cpp
- src/models/qwen.cpp
- src/models/qwen3vl.cpp
- src/models/eurobert.cpp
- src/models/qwen3vl-moe.cpp
- src/models/deepseek.cpp
- src/models/mistral3.cpp
- src/models/glm4.cpp
- src/models/xverse.cpp
- src/models/gemma2-iswa.cpp
- src/models/stablelm.cpp
- src/models/cohere2-iswa.cpp
| if (!hparams.is_recurrent(i)) { | ||
| // Attention layers | ||
| layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", i), { n_embd, n_embd_head_k * n_head * 2 }, 0); | ||
| layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", i), { n_embd, n_embd_k_gqa }, 0); | ||
| layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", i), { n_embd, n_embd_v_gqa }, 0); | ||
| create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, n_embd_k_gqa, n_embd_v_gqa, 0); | ||
| layer.wo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "weight", i), { n_embd_head_k * n_head, n_embd }, 0); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
fd -i '(qwen35|qwen3next|step35).*\.cpp$' src/models | while read -r f; do
echo "=== $f ==="
rg -n -C3 'build_qkv\(|wqkv|wqkv_gate|ATTN_GATE|build_lora_mm' "$f"
echo
doneRepository: ngxson/llama.cpp
Length of output: 10471
🏁 Script executed:
# Check the specific lines in llama-model.cpp
# First, let's see lines around 7012-7015
sed -n '7010,7020p' src/llama-model.cpp
# Then 7076-7079
echo "---"
sed -n '7074,7084p' src/llama-model.cpp
# Then 7139-7142
echo "---"
sed -n '7137,7147p' src/llama-model.cpp
# Then 7239-7240
echo "---"
sed -n '7237,7245p' src/llama-model.cppRepository: ngxson/llama.cpp
Length of output: 3178
🏁 Script executed:
# Find which model architectures these code paths apply to
# Check context around line 7012
sed -n '6990,7020p' src/llama-model.cpp | head -50
# Search for model type conditions around these lines
rg -n 'LLM_ARCH.*QWEN|case LLM_ARCH' src/llama-model.cpp | head -30Repository: ngxson/llama.cpp
Length of output: 3136
🏁 Script executed:
# Find which case statement contains line 7012
sed -n '6800,7012p' src/llama-model.cpp | grep -n 'case LLM_ARCH' | tail -1
# Let's get the full context by searching backwards from line 7012
sed -n '6900,7012p' src/llama-model.cpp | grep -E 'case LLM_ARCH|break;' | tail -5
# More direct approach - look at the broader context
sed -n '6950,7012p' src/llama-model.cpp | grep -E 'case LLM_ARCH|^ }'Repository: ngxson/llama.cpp
Length of output: 263
🏁 Script executed:
# Find architecture for line 7076-7079
sed -n '6950,7076p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1
# Find architecture for line 7139-7142
sed -n '7050,7139p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1
# Find architecture for line 7239-7240
sed -n '7150,7239p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1Repository: ngxson/llama.cpp
Length of output: 164
🏁 Script executed:
# Find and examine the create_tensor_qkv function
rg -n 'void create_tensor_qkv|create_tensor_qkv.*{' src/llama-model.cpp | head -5
# Get the function implementation
sed -n '6600,6700p' src/llama-model.cpp | grep -A 30 'create_tensor_qkv'Repository: ngxson/llama.cpp
Length of output: 6429
🏁 Script executed:
# Find the function definition using a simpler pattern
rg -n 'void create_tensor_qkv' src/llama-model.cpp
# Or search for where it's defined
rg -n 'create_tensor_qkv.*layer.*i.*n_embd' src/llama-model.cpp | head -3
# Let's find the actual function definition
grep -n 'create_tensor_qkv' src/llama-model.cpp | grep -v '(' | head -5
grep -n 'inline.*create_tensor_qkv\|auto.*create_tensor_qkv\|void.*create_tensor_qkv' src/llama-model.cppRepository: ngxson/llama.cpp
Length of output: 474
🏁 Script executed:
# Get the implementation of create_tensor_qkv lambda
sed -n '2723,2770p' src/llama-model.cppRepository: ngxson/llama.cpp
Length of output: 2897
Do not wire doubled-Q and asymmetric-head layouts through the generic create_tensor_qkv helper.
The calls at lines 7014, 7078, 7141 with n_embd_head_k * n_head * 2 for Q, and line 7239 with separate wqkv_gate, represent non-standard attention layouts. The generic create_tensor_qkv function attempts to create a fused wqkv tensor first, which would be incompatible with the custom splitting logic in Qwen35, Qwen35MoE, Qwen3Next, and Step35 graph builders. These models are intentionally excluded from this refactor because their loaders require custom split paths for these asymmetric projections.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llama-model.cpp` around lines 7012 - 7015, The generic create_tensor_qkv
helper must not be used for asymmetric/doubled-Q attention layouts (e.g., where
Q size = n_embd_head_k * n_head * 2 or where a separate wqkv_gate is used);
update the code paths around the non-recurrent attention branch so instead of
calling create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, ...),
explicitly create the independent Q, K, V (and gate if present) tensors via
create_tensor (and preserve layer.wo creation) so the custom split logic in
Qwen35, Qwen35MoE, Qwen3Next, and Step35 graph builders remains compatible and
no fused wqkv is produced by create_tensor_qkv.
ccd1f60 to
67a8492
Compare
…e-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
67a8492 to
d8bf733
Compare
Upstream PR: ggml-org#21245
Summary by CodeRabbit