Skip to content

fix: support Gemma 4 architecture#9

Merged
t8 merged 4 commits intomainfrom
fix/issue-8-gemma4-support
Apr 7, 2026
Merged

fix: support Gemma 4 architecture#9
t8 merged 4 commits intomainfrom
fix/issue-8-gemma4-support

Conversation

@t8
Copy link
Copy Markdown
Owner

@t8 t8 commented Apr 7, 2026

Summary

  • Bumps vendor/llama.cpp to 66c4f9ded to pick up upstream Gemma 4 support (ggml-org/llama.cpp#21309 plus seven follow-up patches)
  • Teaches Hypura's expert tensor classifier about Gemma 4's fused ffn_gate_up_exps
  • Adds array-valued metadata handling so per-layer attention.head_count_kv works on Gemma 4
  • Fixes tokenize to pass parse_special=true so chat-template tokens (<|turn>, <|think|>, etc.) are recognized as their vocab IDs instead of being tokenized as raw bytes

Fixes #8.

Why

The error AlexHarrowell reported (unknown model architecture: 'gemma4') was llama.cpp itself, not Hypura — the vendored submodule was at b5e1212 from March 13, while gemma4 support landed upstream on April 2 in #21309. Hypura classifies tensors by name patterns rather than maintaining its own per-architecture tables, so once upstream knows the architecture, almost everything just works. The Hypura-side adaptations:

  1. ExpertTensorType::GateUp — Mixtral keeps gate/up/down as three separate fused expert tensors. Gemma 4 fuses gate+up into one (ffn_gate_up_exps), giving two fused tensors per MoE layer. Without a distinct enum variant, the neuron-cache key for the fused tensor was collapsing to Gate via .unwrap_or().
  2. get_u32_array — Gemma 4 encodes attention.head_count_kv as arr[i32, n_layer] because layers with sliding-window attention have different KV head counts than full-attention layers. The old get_u32 path returned None and fell back to num_heads, over-reserving KV cache headroom.
  3. parse_special=true — discovered while running a real bench against the model. With parse_special=false, Gemma 4's chat-template tokens (<|turn>, <turn|>, <|think|> — gemma4 uses different turn markers than gemma3's <start_of_turn>) get tokenized as raw bytes. The model produced high tok/s but gibberish output ("modelle_modelle_modelle...", "0.5.0.5.0.5..."). Vanilla llama-completion uses parse_special=true and produces coherent output; Hypura now matches.

Test plan

  • cargo build --release clean against new llama.cpp (no internal-header drift in hypura_buft.c)
  • cargo test --release --lib — 41 passed, 0 failed
  • hypura inspect correctly identifies arch=gemma4, MoE 128/8, layers=30, KV heads=8 (validates the array-fallback path; the underlying value is arr[i32,30] = [8,8,8,8,8,2,...])
  • End-to-end inference with gemma-4-26B-A4B-it-Q4_K_M (the lmstudio file from the issue):
    • Placement decision: Sparse MoE mmap: 6% activation (8/128 experts), 1.0 GB active of 15.6 GB total → 15.6 GB GPU | 0 B RAM | 0 B NVMe
    • bench --max-tokens 50: 51 tok/s decode, 577 tok/s prompt eval, 2.9s wall time ✓
    • run --prompt '<|turn>user\nThe capital of France is<turn|>\n<|turn>model\n': "The capital of France is Paris."
    • run --prompt '<|turn>user\nWrite one sentence about cats.<turn|>\n<|turn>model\n': "Cats are fascinating creatures known for their independent personalities and graceful movements."

Notes

  • Tested on M1 Max 32GB. Model fits entirely in unified memory and runs via the SparseMoeMmap path — the fastest mode in Hypura, since the OS page cache handles 6.25% MoE sparsity without any router-interception machinery.
  • The Ollama-pulled gemma4 (8B variant) is a multimodal bundle (vision + audio + projector) — llama_model_load errors on it because the LLM-only loader doesn't consume the multimodal projector tensors. That's a separate, upstream-side limitation; the lmstudio text-only file from the issue doesn't hit it.
  • A follow-up nice-to-have would be auto-applying the model's chat template in hypura run, so users don't need to hand-construct <|turn>user\n...<turn|>\n<|turn>model\n prefixes themselves. Out of scope for this PR.

t8 added 4 commits April 7, 2026 18:02
Picks up gemma4 architecture support merged in upstream
ggml-org/llama.cpp#21309 (LLM_ARCH_GEMMA4 + src/models/gemma4-iswa.cpp)
plus seven follow-up patches:

- vocab byte token handling for BPE detokenizer (#21488)
- convert: set add_bos to True (#21500)
- specialized parser (#21418)
- final logit softcapping (#21390)
- custom newline split (#21406)
- tokenizer fixes (#21343)
- chat template fix (#21326)

373 commits across the gap. No API breakage observed in hypura-sys —
release build and full test suite (41 tests) still pass.
Mixtral-style MoE models keep gate/up/down as three separate fused
expert tensors. Gemma 4 fuses gate+up into a single tensor
(blk.N.ffn_gate_up_exps.weight), leaving two fused tensors per MoE
layer instead of three.

Add a GateUp variant to ExpertTensorType so the per-(layer, expert,
type) neuron cache key stays distinct from plain Gate. Without this,
ExpertTensorType::from_name returned None for the fused tensor and
callers fell back to .unwrap_or(Gate), collapsing the cache key.
Gemma 4 encodes attention.head_count_kv as a per-layer i32 array
(arr[i32, n_layer]) because layers with sliding-window attention
have different KV head counts than full-attention layers (mostly 8,
sometimes 2). The old get_u32 path returned None for arrays and
fell back to num_heads, over-reserving KV cache headroom.

Add as_u32_array on GgufValue and get_u32_array on GgufFile, then
take the array max for KV reservation. Conservative — over-reserves
rather than under-reserves, which keeps placement decisions safe.
Hypura's tokenize wrapper hardcoded `parse_special=false` when calling
`llama_tokenize`. With this setting, special tokens like Gemma 4's
`<|turn>`, `<turn|>`, and `<|think|>` get tokenized as raw byte
sequences instead of as their actual vocab IDs, so chat-templated
prompts feed nonsense to instruction-tuned models. The model was
recovering enough to produce high tok/s but the output was gibberish
("modelle_modelle_modelle...", "0.5.0.5.0.5...").

Add a `parse_special` parameter to LlamaModel::tokenize and default
the three call sites in the inference paths to pass `true`, matching
upstream `llama-cli`/`llama-completion` behavior.

Verified end-to-end against gemma-4-26B-A4B-it-Q4_K_M:
  Prompt:   <|turn>user\nThe capital of France is<turn|>\n<|turn>model\n
  Output:   "thought\n<channel|>The capital of France is **Paris**."
  Speed:    51 tok/s decode (M1 Max 32GB, SparseMoeMmap path)
@AlexHarrowell
Copy link
Copy Markdown

Still doesn't work, same error as in #8 - I've reset git to head, pulled, built from source.

The problem is in llama.cpp, specifically the file llama-model.cpp. This contains lookup tables as case-switch statements that contain information about a lot of models, and the code searches in them to find a matching case.
When this function:

void llama_model::load_arch(llama_model_loader & ml) {
arch = ml.get_arch();
if (arch == LLM_ARCH_UNKNOWN) {
throw std::runtime_error("unknown model architecture: '" + ml.get_arch_name() + "'");
}
}

is called, if there is no matching entry in that table, arch is set to LLM_ARCH_UNKNOWN, the comparison holds, and the runtime error I am seeing is thrown (it's the only call in the codebase where that string figures)

Slightly oddly, the only pathway to that specific error message comes from llama_rope_type rather than from arch itself - arch contains a lot of architecture information about different models but the return statement if nothing is found is as follows:

default: throw std::runtime_error("unsupported model architecture: " + arch_name());
}

which is not the error. llama_rope_type holds info on RoPE schemes, and apparently all the Gemmas so far are LLAMA_ROPE_TYPE_NEOX, but if I understand the Gemma4 paper I think it's not exactly the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question : Have you tried gemma 4 - 31b ? And can I use the latest llama cpp with this project ?

2 participants