Skip to content

[Mirror] model : refactor QKV into common build_qkv and create_tensor_qkv helpers#94

Open
ngxson wants to merge 2 commits intongxson:masterfrom
JoursBleu:refactor/build-qkv-helper
Open

[Mirror] model : refactor QKV into common build_qkv and create_tensor_qkv helpers#94
ngxson wants to merge 2 commits intongxson:masterfrom
JoursBleu:refactor/build-qkv-helper

Conversation

@ngxson
Copy link
Copy Markdown
Owner

@ngxson ngxson commented Apr 1, 2026

Upstream PR: ggml-org#21245

Summary by CodeRabbit

  • Refactor
    • Consolidated attention query/key/value preparation into a shared implementation across many model backends to reduce duplicated internal logic.
    • Streamlined internal tracing timing and per-layer instrumentation; no changes to model outputs, public APIs, or user-facing behavior and performance.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Add a new POD llm_graph_qkv and llm_graph_context::build_qkv(...) to centralize per-layer Q/K/V projection (handles fused wqkv and separate wq/wk/wv paths, bias, optional clamping, and callbacks). Replace many per-model inline Q/K/V build sequences with calls to this helper and prefer merged QKV during tensor loading.

Changes

Cohort / File(s) Summary
Graph core
src/llama-graph.h, src/llama-graph.cpp
Add llm_graph_qkv struct and implement llm_graph_context::build_qkv(...) with fused (wqkv) and separate (wq/wk/wv) paths, bias handling, optional clamp, 3D view/reshape creation for Q/K/V, and instrumentation callback invocations.
Model tensor loader
src/llama-model.cpp
Introduce create_tensor_qkv in llama_model::load_tensors to prefer merged QKV weights/bias (wqkv/bqkv) and fall back to separate wq/wk/wv plus biases; replace many per-architecture allocation branches.
Per-model callsite refactor (bulk)
src/models/...
examples: src/models/afmoe.cpp, src/models/apertus.cpp, src/models/bloom.cpp, src/models/falcon.cpp, src/models/gpt2.cpp, src/models/llama.cpp, src/models/glm4.cpp, src/models/starcoder.cpp, src/models/stablelm.cpp, ...
Replace inline per-layer Q/K/V sequences (individual build_lora_mm, optional ggml_add bias, ggml_reshape_3d/ggml_view_3d) with build_qkv(...) returning {Qcur,Kcur,Vcur}. Move or consolidate cb(...) instrumentation typically to after norm/RoPE; preserve downstream RoPE/attention logic.
Clamping & shape adjustments
src/models/mpt.cpp, src/models/dbrx.cpp, src/models/olmo.cpp, ...
Shift optional clamping and clamped-tensor instrumentation into build_qkv (via f_clamp_kqv param) and remove model-local combined-tensor clamp/view math.
Wide cleanup (head/gqa removal)
src/models/* (many files)
Remove many local n_embd_gqa usages, manual byte-offset/stride math, and duplicated reshape/view logic across numerous model files; centralize projection/reshape logic in build_qkv.

Sequence Diagram(s)

sequenceDiagram
    participant Model as Model (per-layer)
    participant Graph as llm_graph_context::build_qkv
    participant GGML as GGML ops
    participant Attn as Attention (RoPE / build_attn)

    Model->>Graph: build_qkv(layer, cur, n_embd_head, n_head, n_head_kv, il, f_clamp_kqv)
    Graph->>GGML: build_lora_mm (fused wqkv OR separate wq/wk/wv)
    GGML-->>Graph: projected tensor(s)
    Graph->>GGML: optional ggml_add (bias) / ggml_clamp
    Graph->>GGML: ggml_view_3d / ggml_reshape_3d => Q, K, V
    Graph-->>Model: return {Qcur, Kcur, Vcur}
    Model->>GGML: ggml_rope_ext on Qcur/Kcur (and optional norm)
    Model->>Attn: build_attn(Qcur, Kcur, Vcur, ...)
    Attn-->>Model: attention output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through matrices, three into one,

fused the kernels and basked in the sun.
Q, K, V bundled — no messy view,
tidy layers hum, the graph feels new.
🍃🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description references an upstream PR and explains the purpose, but lacks required template sections like overview details and AI usage disclosure. Add a proper overview section explaining what this PR does and why, and include the AI usage disclosure requirement from the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 1.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: refactoring QKV construction into common helper functions across models.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/models/refact.cpp (1)

27-32: ⚠️ Potential issue | 🟡 Minor

Remove the now-duplicated cb() calls in this block.

build_qkv() already emits cb(Qcur/Kcur/Vcur) before returning, and cb is the hook used for custom tensor logic such as allocation/offloading. With no tensor mutation between the helper call and Lines 30-32, those callbacks now run twice on the same tensors here. (github.com)

✂️ Suggested cleanup
             auto [Qcur, Kcur, Vcur] = build_qkv(model.layers[il], cur,
                     n_embd_head, n_head, n_head_kv, il);
-
-            cb(Qcur, "Qcur", il);
-            cb(Kcur, "Kcur", il);
-            cb(Vcur, "Vcur", il);
 
             cur = build_attn(inp_attn,
                     model.layers[il].wo, NULL,
                     Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/refact.cpp` around lines 27 - 32, The three redundant cb()
invocations after build_qkv are duplicating work because
build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) already
calls cb(Qcur/Kcur/Vcur) before returning; remove the lines cb(Qcur, "Qcur",
il); cb(Kcur, "Kcur", il); cb(Vcur, "Vcur"); so the tensors from build_qkv are
only passed through the hook once (leave build_qkv, Qcur/Kcur/Vcur, cb, and the
surrounding call-site with model.layers[il] and cur unchanged).
🧹 Nitpick comments (5)
src/models/llada.cpp (1)

32-34: Update the stale projection comment.

After this refactor, the block no longer guarantees “separate” projections or “without bias”: build_qkv() now handles both fused and separate QKV paths and applies optional biases when present. The comment above misdescribes the implementation. (github.com)

📝 Suggested wording
-            // compute separate Q, K, V projections without bias, matching LLaDALlamaBlock
+            // build Q/K/V projections via the shared helper, then apply RoPE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/llada.cpp` around lines 32 - 34, The existing comment above the
build_qkv call is stale — it states projections are "separate" and "without
bias" even though build_qkv now handles both fused and separate QKV paths and
applies optional biases; update the inline comment above the call to
build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) to
accurately state that build_qkv produces Q, K, V projections and supports
fused/separate QKV implementations and optional biases when present (i.e.,
neutral, factual description referencing build_qkv behavior instead of asserting
"separate" or "without bias").
src/models/starcoder.cpp (1)

35-36: Consider restoring the per-tensor callbacks here.

The helper swap looks fine functionally, but this builder now skips the Qcur / Kcur / Vcur traces that the other touched builders still emit, which makes graph-dump comparisons less consistent.

Suggested change
             auto [Qcur, Kcur, Vcur] = build_qkv(model.layers[il], cur,
                     n_embd_head, n_head, n_head_kv, il);
+
+            cb(Qcur, "Qcur", il);
+            cb(Kcur, "Kcur", il);
+            cb(Vcur, "Vcur", il);
 
             cur = build_attn(inp_attn,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/starcoder.cpp` around lines 35 - 36, The call to
build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) bypasses
the per-tensor tracing/callbacks for Qcur/Kcur/Vcur that other builders still
emit, causing inconsistent graph dumps; restore those per-tensor callbacks after
the helper swap by invoking the same trace/callback functions used elsewhere
(the ones that record Qcur, Kcur, Vcur) immediately after build_qkv returns (or
integrate the tracing into build_qkv), ensuring the unique symbols Qcur, Kcur,
Vcur and the builder path that emits tensor traces are preserved so graph-dump
comparisons remain consistent.
src/models/falcon-h1.cpp (1)

30-31: Make the K/V head-width invariant explicit here.

build_qkv() now receives one n_embd_head for all three projections, but this constructor no longer documents that Falcon-H1's K and V head widths match. A local GGML_ASSERT(n_embd_head == hparams.n_embd_head_k()) near the declaration would make future asymmetric variants fail fast instead of being reshaped silently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/falcon-h1.cpp` around lines 30 - 31, Add an explicit invariant
check that the K/V head width matches the Q head width before calling build_qkv:
insert a GGML_ASSERT comparing n_embd_head and hparams.n_embd_head_k() (e.g.,
GGML_ASSERT(n_embd_head == hparams.n_embd_head_k())) near the
declaration/assignment of n_embd_head (immediately before the build_qkv(...)
call) so any future asymmetric-head changes fail fast instead of silently
reshaping.
src/models/gemma2-iswa.cpp (1)

34-35: Assert the symmetric K/V head-width assumption before using the shared helper.

This file now funnels Q/K/V through a single n_embd_head, but it does not guard that hparams.n_embd_head_v() matches the K head width chosen here. Adding a local assert would document that Gemma2-ISWA still satisfies the helper's reshape assumption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/gemma2-iswa.cpp` around lines 34 - 35, Add a local assertion
before calling build_qkv to verify the symmetric K/V head-width assumption:
check that hparams.n_embd_head_v() equals the K head width used here (n_head_kv)
so the shared helper’s reshape assumptions hold; place the assert near the call
site that invokes build_qkv(model.layers[il], cur, n_embd_head, n_head,
n_head_kv, il) and reference hparams.n_embd_head_v(), n_head_kv, n_embd_head and
build_qkv in the message for clarity.
src/llama-graph.cpp (1)

1027-1052: Minor: Duplicate callbacks with identical names in separate Q/K/V path.

In the separate path, cb() is called 2-3 times per tensor with the same name "Qcur"/"Kcur"/"Vcur":

  1. After build_lora_mm (lines 1028, 1034, 1040)
  2. After bias add if bias exists (lines 1031, 1037, 1043)
  3. After reshape at the end (lines 1050-1052)

In contrast, the fused path only has the final callbacks. This inconsistency could cause confusion when debugging tensor states or using callbacks for profiling.

Consider removing the intermediate callbacks in the separate path to match the fused path behavior, or use distinct names (e.g., "Qcur_mm", "Qcur_bias") for intermediate states.

♻️ Optional: Remove intermediate callbacks to match fused path
     } else {
         // separate Q/K/V path
         Qcur = build_lora_mm(layer.wq, cur, layer.wq_s);
-        cb(Qcur, "Qcur", il);
         if (layer.bq) {
             Qcur = ggml_add(ctx0, Qcur, layer.bq);
-            cb(Qcur, "Qcur", il);
         }
         Kcur = build_lora_mm(layer.wk, cur, layer.wk_s);
-        cb(Kcur, "Kcur", il);
         if (layer.bk) {
             Kcur = ggml_add(ctx0, Kcur, layer.bk);
-            cb(Kcur, "Kcur", il);
         }
         Vcur = build_lora_mm(layer.wv, cur, layer.wv_s);
-        cb(Vcur, "Vcur", il);
         if (layer.bv) {
             Vcur = ggml_add(ctx0, Vcur, layer.bv);
-            cb(Vcur, "Vcur", il);
         }
         Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
         Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
         Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-graph.cpp` around lines 1027 - 1052, The cb() calls for
intermediate Qcur/Kcur/Vcur states are duplicated; after build_lora_mm and after
optional ggml_add the code calls cb() with the same names and then again after
ggml_reshape_3d, causing inconsistent behavior versus the fused path. Remove the
intermediate cb(...) invocations that follow build_lora_mm and the ones
immediately after ggml_add for Qcur, Kcur, and Vcur, leaving only the final
cb(...) calls after ggml_reshape_3d for each tensor (references: functions/vars
build_lora_mm, ggml_add, ggml_reshape_3d, and cb with symbols Qcur/Kcur/Vcur).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llama-model.cpp`:
- Around line 2723-2737: The helper create_tensor_qkv currently ignores the
incoming flags (it uses hardcoded TENSOR_NOT_REQUIRED / TENSOR_SKIP_IF_VIRTUAL)
which causes layers that should be skipped (e.g., NextN with TENSOR_SKIP) to
materialize QKV; update create_tensor_qkv so all create_tensor calls (the merged
path layer.wqkv and its bias layer.bqkv, and the per-split path layer.wq/wk/wv
and their biases layer.bq/bk/bv) pass through the caller-provided flags (or
combine caller flags with TENSOR_NOT_REQUIRED where appropriate) instead of the
hardcoded constants so callers relying on flags (from sites using
create_tensor_qkv) preserve TENSOR_SKIP behavior.

In `@src/models/afmoe.cpp`:
- Around line 44-45: The code redundantly reshapes Vcur after build_qkv; remove
the extra reshape and pass Vcur directly to build_attn (i.e., stop doing the
Vcur.reshape(...) step and use the Vcur returned by build_qkv when calling
build_attn). Update the call site where build_attn(...) currently receives the
reshaped Vcur to use the original Vcur variable (from auto [Qcur, Kcur, Vcur] =
build_qkv(...)), and ensure the function signatures (build_qkv, build_attn)
still match expected 3D tensor shapes (n_head_kv, n_embd_head, seq_len) after
this change.

In `@src/models/mimo2-iswa.cpp`:
- Around line 27-28: The call to build_qkv is using global head counts (n_head,
n_head_kv) but Mimo2-ISWA needs per-layer counts; update the call that builds
Qcur/Kcur/Vcur so it passes hparams.n_head(il) and hparams.n_head_kv(il)
(keeping n_embd_head_k, model.layers[il], cur, il unchanged) so each layer uses
its layer-specific head and head_kv values as done in granite.cpp.

---

Outside diff comments:
In `@src/models/refact.cpp`:
- Around line 27-32: The three redundant cb() invocations after build_qkv are
duplicating work because build_qkv(model.layers[il], cur, n_embd_head, n_head,
n_head_kv, il) already calls cb(Qcur/Kcur/Vcur) before returning; remove the
lines cb(Qcur, "Qcur", il); cb(Kcur, "Kcur", il); cb(Vcur, "Vcur"); so the
tensors from build_qkv are only passed through the hook once (leave build_qkv,
Qcur/Kcur/Vcur, cb, and the surrounding call-site with model.layers[il] and cur
unchanged).

---

Nitpick comments:
In `@src/llama-graph.cpp`:
- Around line 1027-1052: The cb() calls for intermediate Qcur/Kcur/Vcur states
are duplicated; after build_lora_mm and after optional ggml_add the code calls
cb() with the same names and then again after ggml_reshape_3d, causing
inconsistent behavior versus the fused path. Remove the intermediate cb(...)
invocations that follow build_lora_mm and the ones immediately after ggml_add
for Qcur, Kcur, and Vcur, leaving only the final cb(...) calls after
ggml_reshape_3d for each tensor (references: functions/vars build_lora_mm,
ggml_add, ggml_reshape_3d, and cb with symbols Qcur/Kcur/Vcur).

In `@src/models/falcon-h1.cpp`:
- Around line 30-31: Add an explicit invariant check that the K/V head width
matches the Q head width before calling build_qkv: insert a GGML_ASSERT
comparing n_embd_head and hparams.n_embd_head_k() (e.g., GGML_ASSERT(n_embd_head
== hparams.n_embd_head_k())) near the declaration/assignment of n_embd_head
(immediately before the build_qkv(...) call) so any future asymmetric-head
changes fail fast instead of silently reshaping.

In `@src/models/gemma2-iswa.cpp`:
- Around line 34-35: Add a local assertion before calling build_qkv to verify
the symmetric K/V head-width assumption: check that hparams.n_embd_head_v()
equals the K head width used here (n_head_kv) so the shared helper’s reshape
assumptions hold; place the assert near the call site that invokes
build_qkv(model.layers[il], cur, n_embd_head, n_head, n_head_kv, il) and
reference hparams.n_embd_head_v(), n_head_kv, n_embd_head and build_qkv in the
message for clarity.

In `@src/models/llada.cpp`:
- Around line 32-34: The existing comment above the build_qkv call is stale — it
states projections are "separate" and "without bias" even though build_qkv now
handles both fused and separate QKV paths and applies optional biases; update
the inline comment above the call to build_qkv(model.layers[il], cur,
n_embd_head, n_head, n_head_kv, il) to accurately state that build_qkv produces
Q, K, V projections and supports fused/separate QKV implementations and optional
biases when present (i.e., neutral, factual description referencing build_qkv
behavior instead of asserting "separate" or "without bias").

In `@src/models/starcoder.cpp`:
- Around line 35-36: The call to build_qkv(model.layers[il], cur, n_embd_head,
n_head, n_head_kv, il) bypasses the per-tensor tracing/callbacks for
Qcur/Kcur/Vcur that other builders still emit, causing inconsistent graph dumps;
restore those per-tensor callbacks after the helper swap by invoking the same
trace/callback functions used elsewhere (the ones that record Qcur, Kcur, Vcur)
immediately after build_qkv returns (or integrate the tracing into build_qkv),
ensuring the unique symbols Qcur, Kcur, Vcur and the builder path that emits
tensor traces are preserved so graph-dump comparisons remain consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a04487a-f98a-4ac4-a410-11dc7aa590d3

📥 Commits

Reviewing files that changed from the base of the PR and between 6b949d1 and bcc69fd.

📒 Files selected for processing (80)
  • src/llama-graph.cpp
  • src/llama-graph.h
  • src/llama-model.cpp
  • src/models/afmoe.cpp
  • src/models/apertus.cpp
  • src/models/arcee.cpp
  • src/models/arctic.cpp
  • src/models/baichuan.cpp
  • src/models/bailingmoe.cpp
  • src/models/bailingmoe2.cpp
  • src/models/bitnet.cpp
  • src/models/bloom.cpp
  • src/models/chameleon.cpp
  • src/models/chatglm.cpp
  • src/models/codeshell.cpp
  • src/models/cohere2-iswa.cpp
  • src/models/command-r.cpp
  • src/models/deci.cpp
  • src/models/deepseek.cpp
  • src/models/dots1.cpp
  • src/models/dream.cpp
  • src/models/ernie4-5-moe.cpp
  • src/models/ernie4-5.cpp
  • src/models/eurobert.cpp
  • src/models/exaone-moe.cpp
  • src/models/exaone.cpp
  • src/models/exaone4.cpp
  • src/models/falcon-h1.cpp
  • src/models/falcon.cpp
  • src/models/gemma-embedding.cpp
  • src/models/gemma.cpp
  • src/models/gemma2-iswa.cpp
  • src/models/gemma3.cpp
  • src/models/glm4-moe.cpp
  • src/models/glm4.cpp
  • src/models/gpt2.cpp
  • src/models/gptneox.cpp
  • src/models/granite.cpp
  • src/models/grok.cpp
  • src/models/grovemoe.cpp
  • src/models/hunyuan-dense.cpp
  • src/models/hunyuan-moe.cpp
  • src/models/internlm2.cpp
  • src/models/jais.cpp
  • src/models/jais2.cpp
  • src/models/jamba.cpp
  • src/models/llada-moe.cpp
  • src/models/llada.cpp
  • src/models/llama-iswa.cpp
  • src/models/llama.cpp
  • src/models/maincoder.cpp
  • src/models/mimo2-iswa.cpp
  • src/models/mistral3.cpp
  • src/models/modern-bert.cpp
  • src/models/nemotron.cpp
  • src/models/neo-bert.cpp
  • src/models/openai-moe-iswa.cpp
  • src/models/orion.cpp
  • src/models/paddleocr.cpp
  • src/models/pangu-embedded.cpp
  • src/models/phi2.cpp
  • src/models/phi3.cpp
  • src/models/plamo.cpp
  • src/models/qwen.cpp
  • src/models/qwen2.cpp
  • src/models/qwen2moe.cpp
  • src/models/qwen2vl.cpp
  • src/models/qwen3.cpp
  • src/models/qwen3moe.cpp
  • src/models/qwen3vl-moe.cpp
  • src/models/qwen3vl.cpp
  • src/models/refact.cpp
  • src/models/rnd1.cpp
  • src/models/seed-oss.cpp
  • src/models/smallthinker.cpp
  • src/models/smollm3.cpp
  • src/models/stablelm.cpp
  • src/models/starcoder.cpp
  • src/models/starcoder2.cpp
  • src/models/xverse.cpp

@github-actions github-actions bot added the model label Apr 1, 2026
@JoursBleu JoursBleu force-pushed the refactor/build-qkv-helper branch from bcc69fd to 42eae08 Compare April 1, 2026 13:19
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/models/smallthinker.cpp (1)

48-59: ⚠️ Potential issue | 🟡 Minor

Add missing cb(Vcur, "Vcur", il) callback.

The Vcur variable is constructed via build_qkv() and used in build_attn(), but lacks an instrumentation callback. Every other model file in src/models/ includes this callback after build_qkv(). Add it for consistency:

cb(Vcur, "Vcur", il);

Place this after line 59, alongside the existing cb(Qcur, ...) and cb(Kcur, ...) calls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/smallthinker.cpp` around lines 48 - 59, The code calls
build_qkv(...) which produces Vcur but never instruments it; add a callback
invocation for Vcur to match Qcur and Kcur instrumentation by inserting cb(Vcur,
"Vcur", il) immediately after the existing cb(Kcur, "Kcur", il) call so Vcur is
logged/instrumented before it is used (e.g., in build_attn()).
🧹 Nitpick comments (2)
src/llama-graph.h (2)

701-705: Default-initialize the Q/K/V bundle.

llm_graph_qkv is the only new pointer bundle here that does not start from a known state. Initializing these members to nullptr makes future early-return paths in build_qkv() fail obviously instead of handing garbage pointers to the callers.

♻️ Suggested tweak
 struct llm_graph_qkv {
-    ggml_tensor * q; // [n_embd_head, n_head,    n_tokens]
-    ggml_tensor * k; // [n_embd_head, n_head_kv, n_tokens]
-    ggml_tensor * v; // [n_embd_head, n_head_kv, n_tokens]
+    ggml_tensor * q = nullptr; // [n_embd_head, n_head,    n_tokens]
+    ggml_tensor * k = nullptr; // [n_embd_head, n_head_kv, n_tokens]
+    ggml_tensor * v = nullptr; // [n_embd_head, n_head_kv, n_tokens]
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-graph.h` around lines 701 - 705, The llm_graph_qkv struct's pointer
members q, k, v must be default-initialized to nullptr so early-return paths in
build_qkv() don't propagate uninitialized garbage; update the struct
llm_graph_qkv to initialize q = nullptr, k = nullptr, v = nullptr (or use an
inline constructor/default member initializers) so any new instance starts in a
known null state.

794-802: Add precondition assertion or expand signature for separate K/V head widths.

llm_graph_context tracks n_embd_head_k and n_embd_head_v separately, but this function accepts only a single n_embd_head and uses it for both K and V projections. The implementation lacks an GGML_ASSERT to enforce the implicit precondition that K and V must share the same per-head width. Either add an assertion in src/llama-graph.cpp::llm_graph_context::build_qkv() to enforce equal K/V widths, or extend the signature to accept separate n_embd_head_k and n_embd_head_v parameters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-graph.h` around lines 794 - 802, The build_qkv API accepts a single
n_embd_head but llm_graph_context tracks n_embd_head_k and n_embd_head_v
separately; add a precondition or expand the signature to fix this: either
update llm_graph_context::build_qkv (and its declaration llm_graph_qkv
build_qkv(...)) to take separate int64_t n_embd_head_k and int64_t n_embd_head_v
parameters, or keep the single n_embd_head and add a GGML_ASSERT in build_qkv
implementation that verifies n_embd_head == n_embd_head_k && n_embd_head ==
n_embd_head_v (use the context’s n_embd_head_k and n_embd_head_v symbols) so K
and V per-head widths are enforced. Ensure callers are updated if you change the
signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/models/smallthinker.cpp`:
- Around line 48-59: The code calls build_qkv(...) which produces Vcur but never
instruments it; add a callback invocation for Vcur to match Qcur and Kcur
instrumentation by inserting cb(Vcur, "Vcur", il) immediately after the existing
cb(Kcur, "Kcur", il) call so Vcur is logged/instrumented before it is used
(e.g., in build_attn()).

---

Nitpick comments:
In `@src/llama-graph.h`:
- Around line 701-705: The llm_graph_qkv struct's pointer members q, k, v must
be default-initialized to nullptr so early-return paths in build_qkv() don't
propagate uninitialized garbage; update the struct llm_graph_qkv to initialize q
= nullptr, k = nullptr, v = nullptr (or use an inline constructor/default member
initializers) so any new instance starts in a known null state.
- Around line 794-802: The build_qkv API accepts a single n_embd_head but
llm_graph_context tracks n_embd_head_k and n_embd_head_v separately; add a
precondition or expand the signature to fix this: either update
llm_graph_context::build_qkv (and its declaration llm_graph_qkv build_qkv(...))
to take separate int64_t n_embd_head_k and int64_t n_embd_head_v parameters, or
keep the single n_embd_head and add a GGML_ASSERT in build_qkv implementation
that verifies n_embd_head == n_embd_head_k && n_embd_head == n_embd_head_v (use
the context’s n_embd_head_k and n_embd_head_v symbols) so K and V per-head
widths are enforced. Ensure callers are updated if you change the signature.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 939bff27-0530-4317-a513-3629f47b85d5

📥 Commits

Reviewing files that changed from the base of the PR and between bcc69fd and 42eae08.

📒 Files selected for processing (80)
  • src/llama-graph.cpp
  • src/llama-graph.h
  • src/llama-model.cpp
  • src/models/afmoe.cpp
  • src/models/apertus.cpp
  • src/models/arcee.cpp
  • src/models/arctic.cpp
  • src/models/baichuan.cpp
  • src/models/bailingmoe.cpp
  • src/models/bailingmoe2.cpp
  • src/models/bitnet.cpp
  • src/models/bloom.cpp
  • src/models/chameleon.cpp
  • src/models/chatglm.cpp
  • src/models/codeshell.cpp
  • src/models/cohere2-iswa.cpp
  • src/models/command-r.cpp
  • src/models/deci.cpp
  • src/models/deepseek.cpp
  • src/models/dots1.cpp
  • src/models/dream.cpp
  • src/models/ernie4-5-moe.cpp
  • src/models/ernie4-5.cpp
  • src/models/eurobert.cpp
  • src/models/exaone-moe.cpp
  • src/models/exaone.cpp
  • src/models/exaone4.cpp
  • src/models/falcon-h1.cpp
  • src/models/falcon.cpp
  • src/models/gemma-embedding.cpp
  • src/models/gemma.cpp
  • src/models/gemma2-iswa.cpp
  • src/models/gemma3.cpp
  • src/models/glm4-moe.cpp
  • src/models/glm4.cpp
  • src/models/gpt2.cpp
  • src/models/gptneox.cpp
  • src/models/granite.cpp
  • src/models/grok.cpp
  • src/models/grovemoe.cpp
  • src/models/hunyuan-dense.cpp
  • src/models/hunyuan-moe.cpp
  • src/models/internlm2.cpp
  • src/models/jais.cpp
  • src/models/jais2.cpp
  • src/models/jamba.cpp
  • src/models/llada-moe.cpp
  • src/models/llada.cpp
  • src/models/llama-iswa.cpp
  • src/models/llama.cpp
  • src/models/maincoder.cpp
  • src/models/mimo2-iswa.cpp
  • src/models/mistral3.cpp
  • src/models/modern-bert.cpp
  • src/models/nemotron.cpp
  • src/models/neo-bert.cpp
  • src/models/openai-moe-iswa.cpp
  • src/models/orion.cpp
  • src/models/paddleocr.cpp
  • src/models/pangu-embedded.cpp
  • src/models/phi2.cpp
  • src/models/phi3.cpp
  • src/models/plamo.cpp
  • src/models/qwen.cpp
  • src/models/qwen2.cpp
  • src/models/qwen2moe.cpp
  • src/models/qwen2vl.cpp
  • src/models/qwen3.cpp
  • src/models/qwen3moe.cpp
  • src/models/qwen3vl-moe.cpp
  • src/models/qwen3vl.cpp
  • src/models/refact.cpp
  • src/models/rnd1.cpp
  • src/models/seed-oss.cpp
  • src/models/smallthinker.cpp
  • src/models/smollm3.cpp
  • src/models/stablelm.cpp
  • src/models/starcoder.cpp
  • src/models/starcoder2.cpp
  • src/models/xverse.cpp
✅ Files skipped from review due to trivial changes (14)
  • src/models/qwen2moe.cpp
  • src/models/starcoder2.cpp
  • src/models/jamba.cpp
  • src/models/glm4-moe.cpp
  • src/models/grovemoe.cpp
  • src/models/cohere2-iswa.cpp
  • src/models/bailingmoe.cpp
  • src/models/mimo2-iswa.cpp
  • src/models/deepseek.cpp
  • src/models/bailingmoe2.cpp
  • src/models/mistral3.cpp
  • src/models/llada-moe.cpp
  • src/models/command-r.cpp
  • src/models/smollm3.cpp
🚧 Files skipped from review as they are similar to previous changes (32)
  • src/models/gemma3.cpp
  • src/models/ernie4-5-moe.cpp
  • src/models/gemma.cpp
  • src/models/eurobert.cpp
  • src/models/gptneox.cpp
  • src/models/exaone4.cpp
  • src/models/plamo.cpp
  • src/models/bloom.cpp
  • src/models/qwen3vl.cpp
  • src/models/jais2.cpp
  • src/models/stablelm.cpp
  • src/models/hunyuan-moe.cpp
  • src/models/gemma-embedding.cpp
  • src/models/apertus.cpp
  • src/models/paddleocr.cpp
  • src/models/qwen2.cpp
  • src/models/qwen.cpp
  • src/models/qwen3.cpp
  • src/models/exaone-moe.cpp
  • src/models/granite.cpp
  • src/models/afmoe.cpp
  • src/models/nemotron.cpp
  • src/models/falcon-h1.cpp
  • src/models/qwen3vl-moe.cpp
  • src/models/xverse.cpp
  • src/models/qwen2vl.cpp
  • src/models/jais.cpp
  • src/llama-graph.cpp
  • src/models/seed-oss.cpp
  • src/models/neo-bert.cpp
  • src/llama-model.cpp
  • src/models/internlm2.cpp

@JoursBleu JoursBleu force-pushed the refactor/build-qkv-helper branch from 42eae08 to 75d759d Compare April 2, 2026 01:56
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/models/smallthinker.cpp (1)

48-63: ⚠️ Potential issue | 🟡 Minor

Restore the final Vcur callback in this block.

Qcur and Kcur are still reported after the optional RoPE step, but Vcur is no longer emitted here. That makes tensor tracing inconsistent in what is otherwise a no-op refactor.

Suggested fix
             cb(Qcur, "Qcur", il);
             cb(Kcur, "Kcur", il);
+            cb(Vcur, "Vcur", il);

             cur = build_attn(inp_attn,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/smallthinker.cpp` around lines 48 - 63, The Vcur tensor callback
was accidentally removed after building Qcur/Kcur, causing inconsistent tracing;
re-add a cb(Vcur, "Vcur", il) right after the Qcur/Kcur callbacks (after the
optional use_rope block) so all three Qcur, Kcur, Vcur are reported before
calling build_attn; locate the block where build_qkv(...) returns Qcur, Kcur,
Vcur and where cb(Qcur, "Qcur", il) and cb(Kcur, "Kcur", il) are invoked and
insert the Vcur callback in the same position.
src/models/hunyuan-dense.cpp (1)

46-59: ⚠️ Potential issue | 🟠 Major

Callback placement for Kcur is inconsistent with Qcur and with the pattern in llama.cpp.

The cb(Kcur, "Kcur", il) is called before RoPE is applied to Kcur (lines 50-54), while cb(Qcur, "Qcur", il) is called after RoPE is applied to Qcur. The llama.cpp model applies both RoPE operations first, then performs all callbacks in sequence (cb(Qcur), cb(Kcur), cb(Vcur)), maintaining symmetry.

Move cb(Kcur, "Kcur", il) to after the Kcur RoPE application (after line 54) to align with the Qcur pattern and match the established convention in llama.cpp.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/hunyuan-dense.cpp` around lines 46 - 59, The Kcur callback is
invoked before applying RoPE, breaking symmetry with Qcur and llama.cpp; move
the cb(Kcur, "Kcur", il) call so it runs after Kcur is passed through
ggml_rope_ext (and after build_norm if you prefer the post-norm callback pattern
used for Qcur/Vcur) — locate the Kcur handling around ggml_rope_ext, build_norm,
and the existing cb calls and reposition the cb(Kcur, "Kcur", il) to follow the
RoPE application (and normalization) for model.layers[il].attn_k_norm.
♻️ Duplicate comments (1)
src/llama-model.cpp (1)

2723-2736: ⚠️ Potential issue | 🟠 Major

Propagate caller flags through create_tensor_qkv again.

This regresses the earlier skip-flag fix: layer.wqkv / layer.bqkv still ignore flags, and the split biases still drop them too. That means the NextN skip callers at Lines 5211, 5269, and 5639 can still materialize QKV tensors in layers that were supposed to stay skipped.

🛠️ Proposed fix
         auto create_tensor_qkv = [&](llama_layer & layer, int bid,
                 int64_t n_embd_, int64_t n_embd_q_, int64_t n_embd_k_, int64_t n_embd_v_,
                 int flags) {
             const int64_t n_embd_qkv = n_embd_q_ + n_embd_k_ + n_embd_v_;
-            layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", bid), {n_embd_, n_embd_qkv}, TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL);
+            const int qkv_flags = flags | TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL;
+            layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", bid), {n_embd_, n_embd_qkv}, qkv_flags);
             if (layer.wqkv) {
-                layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", bid), {n_embd_qkv}, TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL);
+                layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", bid), {n_embd_qkv}, qkv_flags);
             } else {
                 layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", bid), {n_embd_, n_embd_q_}, flags);
                 layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", bid), {n_embd_, n_embd_k_}, flags);
                 layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", bid), {n_embd_, n_embd_v_}, flags);
-                layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", bid), {n_embd_q_}, TENSOR_NOT_REQUIRED);
-                layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", bid), {n_embd_k_}, TENSOR_NOT_REQUIRED);
-                layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", bid), {n_embd_v_}, TENSOR_NOT_REQUIRED);
+                layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", bid), {n_embd_q_}, flags | TENSOR_NOT_REQUIRED);
+                layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", bid), {n_embd_k_}, flags | TENSOR_NOT_REQUIRED);
+                layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", bid), {n_embd_v_}, flags | TENSOR_NOT_REQUIRED);
             }
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-model.cpp` around lines 2723 - 2736, The helper create_tensor_qkv
is ignoring the caller-provided flags for combined QKV and split bias tensors,
which breaks the skip behavior; update the create_tensor calls so they use the
incoming flags parameter rather than hard-coded TENSOR_NOT_REQUIRED (and
TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL) – specifically, pass flags into
create_tensor for layer.wqkv and layer.bqkv (instead of TENSOR_NOT_REQUIRED |
TENSOR_SKIP_IF_VIRTUAL) and for layer.bq, layer.bk, layer.bv (instead of
TENSOR_NOT_REQUIRED); keep using flags for layer.wq, layer.wk, layer.wv as
already done so the skip/virtual semantics propagate correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llama-graph.cpp`:
- Around line 1048-1098: The helper llm_graph_context::build_qkv assumes Q/K/V
all use the same per-head width (n_embd_head) and collapses/reshapes V using
that width, which breaks asymmetric V heads; update the function to accept a
separate n_embd_head_v (or n_embd_k and n_embd_v) parameter and use
n_embd_head_v when computing Vcur’s ggml_view_3d/ggml_reshape_3d and its byte
offsets (and when computing n_embd_kv/n_embd_q if needed), or if asymmetric
heads are unsupported, add an explicit assert at the start of build_qkv that
n_embd_head == n_embd_head_v to fail fast; refer to
llm_graph_context::build_qkv, Qcur/Kcur/Vcur, n_embd_head, n_embd_kv, and the
ggml_view_3d/ggml_reshape_3d calls to locate the changes.

In `@src/llama-model.cpp`:
- Around line 7012-7015: The generic create_tensor_qkv helper must not be used
for asymmetric/doubled-Q attention layouts (e.g., where Q size = n_embd_head_k *
n_head * 2 or where a separate wqkv_gate is used); update the code paths around
the non-recurrent attention branch so instead of calling
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, ...), explicitly
create the independent Q, K, V (and gate if present) tensors via create_tensor
(and preserve layer.wo creation) so the custom split logic in Qwen35, Qwen35MoE,
Qwen3Next, and Step35 graph builders remains compatible and no fused wqkv is
produced by create_tensor_qkv.

---

Outside diff comments:
In `@src/models/hunyuan-dense.cpp`:
- Around line 46-59: The Kcur callback is invoked before applying RoPE, breaking
symmetry with Qcur and llama.cpp; move the cb(Kcur, "Kcur", il) call so it runs
after Kcur is passed through ggml_rope_ext (and after build_norm if you prefer
the post-norm callback pattern used for Qcur/Vcur) — locate the Kcur handling
around ggml_rope_ext, build_norm, and the existing cb calls and reposition the
cb(Kcur, "Kcur", il) to follow the RoPE application (and normalization) for
model.layers[il].attn_k_norm.

In `@src/models/smallthinker.cpp`:
- Around line 48-63: The Vcur tensor callback was accidentally removed after
building Qcur/Kcur, causing inconsistent tracing; re-add a cb(Vcur, "Vcur", il)
right after the Qcur/Kcur callbacks (after the optional use_rope block) so all
three Qcur, Kcur, Vcur are reported before calling build_attn; locate the block
where build_qkv(...) returns Qcur, Kcur, Vcur and where cb(Qcur, "Qcur", il) and
cb(Kcur, "Kcur", il) are invoked and insert the Vcur callback in the same
position.

---

Duplicate comments:
In `@src/llama-model.cpp`:
- Around line 2723-2736: The helper create_tensor_qkv is ignoring the
caller-provided flags for combined QKV and split bias tensors, which breaks the
skip behavior; update the create_tensor calls so they use the incoming flags
parameter rather than hard-coded TENSOR_NOT_REQUIRED (and TENSOR_NOT_REQUIRED |
TENSOR_SKIP_IF_VIRTUAL) – specifically, pass flags into create_tensor for
layer.wqkv and layer.bqkv (instead of TENSOR_NOT_REQUIRED |
TENSOR_SKIP_IF_VIRTUAL) and for layer.bq, layer.bk, layer.bv (instead of
TENSOR_NOT_REQUIRED); keep using flags for layer.wq, layer.wk, layer.wv as
already done so the skip/virtual semantics propagate correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15e971bd-0a6e-4873-9e72-a0fd7bd74c38

📥 Commits

Reviewing files that changed from the base of the PR and between 42eae08 and 75d759d.

📒 Files selected for processing (80)
  • src/llama-graph.cpp
  • src/llama-graph.h
  • src/llama-model.cpp
  • src/models/afmoe.cpp
  • src/models/apertus.cpp
  • src/models/arcee.cpp
  • src/models/arctic.cpp
  • src/models/baichuan.cpp
  • src/models/bailingmoe.cpp
  • src/models/bailingmoe2.cpp
  • src/models/bitnet.cpp
  • src/models/bloom.cpp
  • src/models/chameleon.cpp
  • src/models/chatglm.cpp
  • src/models/codeshell.cpp
  • src/models/cohere2-iswa.cpp
  • src/models/command-r.cpp
  • src/models/deci.cpp
  • src/models/deepseek.cpp
  • src/models/dots1.cpp
  • src/models/dream.cpp
  • src/models/ernie4-5-moe.cpp
  • src/models/ernie4-5.cpp
  • src/models/eurobert.cpp
  • src/models/exaone-moe.cpp
  • src/models/exaone.cpp
  • src/models/exaone4.cpp
  • src/models/falcon-h1.cpp
  • src/models/falcon.cpp
  • src/models/gemma-embedding.cpp
  • src/models/gemma.cpp
  • src/models/gemma2-iswa.cpp
  • src/models/gemma3.cpp
  • src/models/glm4-moe.cpp
  • src/models/glm4.cpp
  • src/models/gpt2.cpp
  • src/models/gptneox.cpp
  • src/models/granite.cpp
  • src/models/grok.cpp
  • src/models/grovemoe.cpp
  • src/models/hunyuan-dense.cpp
  • src/models/hunyuan-moe.cpp
  • src/models/internlm2.cpp
  • src/models/jais.cpp
  • src/models/jais2.cpp
  • src/models/jamba.cpp
  • src/models/llada-moe.cpp
  • src/models/llada.cpp
  • src/models/llama-iswa.cpp
  • src/models/llama.cpp
  • src/models/maincoder.cpp
  • src/models/mimo2-iswa.cpp
  • src/models/mistral3.cpp
  • src/models/modern-bert.cpp
  • src/models/nemotron.cpp
  • src/models/neo-bert.cpp
  • src/models/openai-moe-iswa.cpp
  • src/models/orion.cpp
  • src/models/paddleocr.cpp
  • src/models/pangu-embedded.cpp
  • src/models/phi2.cpp
  • src/models/phi3.cpp
  • src/models/plamo.cpp
  • src/models/qwen.cpp
  • src/models/qwen2.cpp
  • src/models/qwen2moe.cpp
  • src/models/qwen2vl.cpp
  • src/models/qwen3.cpp
  • src/models/qwen3moe.cpp
  • src/models/qwen3vl-moe.cpp
  • src/models/qwen3vl.cpp
  • src/models/refact.cpp
  • src/models/rnd1.cpp
  • src/models/seed-oss.cpp
  • src/models/smallthinker.cpp
  • src/models/smollm3.cpp
  • src/models/stablelm.cpp
  • src/models/starcoder.cpp
  • src/models/starcoder2.cpp
  • src/models/xverse.cpp
✅ Files skipped from review due to trivial changes (10)
  • src/models/glm4-moe.cpp
  • src/models/qwen2moe.cpp
  • src/models/bailingmoe.cpp
  • src/models/gemma.cpp
  • src/models/command-r.cpp
  • src/models/paddleocr.cpp
  • src/models/arcee.cpp
  • src/models/neo-bert.cpp
  • src/models/starcoder.cpp
  • src/models/ernie4-5.cpp
🚧 Files skipped from review as they are similar to previous changes (37)
  • src/models/falcon-h1.cpp
  • src/models/gemma3.cpp
  • src/models/codeshell.cpp
  • src/models/chatglm.cpp
  • src/models/exaone4.cpp
  • src/models/bloom.cpp
  • src/models/grovemoe.cpp
  • src/models/internlm2.cpp
  • src/models/arctic.cpp
  • src/models/ernie4-5-moe.cpp
  • src/models/apertus.cpp
  • src/models/llada-moe.cpp
  • src/models/modern-bert.cpp
  • src/models/plamo.cpp
  • src/models/smollm3.cpp
  • src/models/exaone-moe.cpp
  • src/models/seed-oss.cpp
  • src/models/llama.cpp
  • src/models/gptneox.cpp
  • src/models/gemma-embedding.cpp
  • src/models/nemotron.cpp
  • src/models/qwen2vl.cpp
  • src/models/rnd1.cpp
  • src/models/jais2.cpp
  • src/llama-graph.h
  • src/models/phi2.cpp
  • src/models/qwen.cpp
  • src/models/qwen3vl.cpp
  • src/models/eurobert.cpp
  • src/models/qwen3vl-moe.cpp
  • src/models/deepseek.cpp
  • src/models/mistral3.cpp
  • src/models/glm4.cpp
  • src/models/xverse.cpp
  • src/models/gemma2-iswa.cpp
  • src/models/stablelm.cpp
  • src/models/cohere2-iswa.cpp

Comment on lines 7012 to 7015
if (!hparams.is_recurrent(i)) {
// Attention layers
layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", i), { n_embd, n_embd_head_k * n_head * 2 }, 0);
layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", i), { n_embd, n_embd_k_gqa }, 0);
layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", i), { n_embd, n_embd_v_gqa }, 0);
create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, n_embd_k_gqa, n_embd_v_gqa, 0);
layer.wo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "weight", i), { n_embd_head_k * n_head, n_embd }, 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

fd -i '(qwen35|qwen3next|step35).*\.cpp$' src/models | while read -r f; do
  echo "=== $f ==="
  rg -n -C3 'build_qkv\(|wqkv|wqkv_gate|ATTN_GATE|build_lora_mm' "$f"
  echo
done

Repository: ngxson/llama.cpp

Length of output: 10471


🏁 Script executed:

# Check the specific lines in llama-model.cpp
# First, let's see lines around 7012-7015
sed -n '7010,7020p' src/llama-model.cpp

# Then 7076-7079
echo "---"
sed -n '7074,7084p' src/llama-model.cpp

# Then 7139-7142
echo "---"
sed -n '7137,7147p' src/llama-model.cpp

# Then 7239-7240
echo "---"
sed -n '7237,7245p' src/llama-model.cpp

Repository: ngxson/llama.cpp

Length of output: 3178


🏁 Script executed:

# Find which model architectures these code paths apply to
# Check context around line 7012
sed -n '6990,7020p' src/llama-model.cpp | head -50

# Search for model type conditions around these lines
rg -n 'LLM_ARCH.*QWEN|case LLM_ARCH' src/llama-model.cpp | head -30

Repository: ngxson/llama.cpp

Length of output: 3136


🏁 Script executed:

# Find which case statement contains line 7012
sed -n '6800,7012p' src/llama-model.cpp | grep -n 'case LLM_ARCH' | tail -1

# Let's get the full context by searching backwards from line 7012
sed -n '6900,7012p' src/llama-model.cpp | grep -E 'case LLM_ARCH|break;' | tail -5

# More direct approach - look at the broader context
sed -n '6950,7012p' src/llama-model.cpp | grep -E 'case LLM_ARCH|^        }'

Repository: ngxson/llama.cpp

Length of output: 263


🏁 Script executed:

# Find architecture for line 7076-7079
sed -n '6950,7076p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1

# Find architecture for line 7139-7142
sed -n '7050,7139p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1

# Find architecture for line 7239-7240
sed -n '7150,7239p' src/llama-model.cpp | grep 'case LLM_ARCH' | tail -1

Repository: ngxson/llama.cpp

Length of output: 164


🏁 Script executed:

# Find and examine the create_tensor_qkv function
rg -n 'void create_tensor_qkv|create_tensor_qkv.*{' src/llama-model.cpp | head -5

# Get the function implementation
sed -n '6600,6700p' src/llama-model.cpp | grep -A 30 'create_tensor_qkv'

Repository: ngxson/llama.cpp

Length of output: 6429


🏁 Script executed:

# Find the function definition using a simpler pattern
rg -n 'void create_tensor_qkv' src/llama-model.cpp

# Or search for where it's defined
rg -n 'create_tensor_qkv.*layer.*i.*n_embd' src/llama-model.cpp | head -3

# Let's find the actual function definition
grep -n 'create_tensor_qkv' src/llama-model.cpp | grep -v '(' | head -5
grep -n 'inline.*create_tensor_qkv\|auto.*create_tensor_qkv\|void.*create_tensor_qkv' src/llama-model.cpp

Repository: ngxson/llama.cpp

Length of output: 474


🏁 Script executed:

# Get the implementation of create_tensor_qkv lambda
sed -n '2723,2770p' src/llama-model.cpp

Repository: ngxson/llama.cpp

Length of output: 2897


Do not wire doubled-Q and asymmetric-head layouts through the generic create_tensor_qkv helper.

The calls at lines 7014, 7078, 7141 with n_embd_head_k * n_head * 2 for Q, and line 7239 with separate wqkv_gate, represent non-standard attention layouts. The generic create_tensor_qkv function attempts to create a fused wqkv tensor first, which would be incompatible with the custom splitting logic in Qwen35, Qwen35MoE, Qwen3Next, and Step35 graph builders. These models are intentionally excluded from this refactor because their loaders require custom split paths for these asymmetric projections.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-model.cpp` around lines 7012 - 7015, The generic create_tensor_qkv
helper must not be used for asymmetric/doubled-Q attention layouts (e.g., where
Q size = n_embd_head_k * n_head * 2 or where a separate wqkv_gate is used);
update the code paths around the non-recurrent attention branch so instead of
calling create_tensor_qkv(layer, i, n_embd, n_embd_head_k * n_head * 2, ...),
explicitly create the independent Q, K, V (and gate if present) tensors via
create_tensor (and preserve layer.wo creation) so the custom split logic in
Qwen35, Qwen35MoE, Qwen3Next, and Step35 graph builders remains compatible and
no fused wqkv is produced by create_tensor_qkv.

@JoursBleu JoursBleu force-pushed the refactor/build-qkv-helper branch 5 times, most recently from ccd1f60 to 67a8492 Compare April 11, 2026 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants