fix: support Gemma 4 architecture by t8 · Pull Request #9 · t8/hypura

t8 · 2026-04-07T23:04:47Z

Summary

Bumps vendor/llama.cpp to 66c4f9ded to pick up upstream Gemma 4 support (ggml-org/llama.cpp#21309 plus seven follow-up patches)
Teaches Hypura's expert tensor classifier about Gemma 4's fused ffn_gate_up_exps
Adds array-valued metadata handling so per-layer attention.head_count_kv works on Gemma 4
Fixes tokenize to pass parse_special=true so chat-template tokens (<|turn>, <|think|>, etc.) are recognized as their vocab IDs instead of being tokenized as raw bytes

Fixes #8.

Why

The error AlexHarrowell reported (unknown model architecture: 'gemma4') was llama.cpp itself, not Hypura — the vendored submodule was at b5e1212 from March 13, while gemma4 support landed upstream on April 2 in #21309. Hypura classifies tensors by name patterns rather than maintaining its own per-architecture tables, so once upstream knows the architecture, almost everything just works. The Hypura-side adaptations:

ExpertTensorType::GateUp — Mixtral keeps gate/up/down as three separate fused expert tensors. Gemma 4 fuses gate+up into one (ffn_gate_up_exps), giving two fused tensors per MoE layer. Without a distinct enum variant, the neuron-cache key for the fused tensor was collapsing to Gate via .unwrap_or().
get_u32_array — Gemma 4 encodes attention.head_count_kv as arr[i32, n_layer] because layers with sliding-window attention have different KV head counts than full-attention layers. The old get_u32 path returned None and fell back to num_heads, over-reserving KV cache headroom.
parse_special=true — discovered while running a real bench against the model. With parse_special=false, Gemma 4's chat-template tokens (<|turn>, <turn|>, <|think|> — gemma4 uses different turn markers than gemma3's <start_of_turn>) get tokenized as raw bytes. The model produced high tok/s but gibberish output ("modelle_modelle_modelle...", "0.5.0.5.0.5..."). Vanilla llama-completion uses parse_special=true and produces coherent output; Hypura now matches.

Test plan

cargo build --release clean against new llama.cpp (no internal-header drift in hypura_buft.c)
cargo test --release --lib — 41 passed, 0 failed
hypura inspect correctly identifies arch=gemma4, MoE 128/8, layers=30, KV heads=8 (validates the array-fallback path; the underlying value is arr[i32,30] = [8,8,8,8,8,2,...])
End-to-end inference with gemma-4-26B-A4B-it-Q4_K_M (the lmstudio file from the issue):
- Placement decision: Sparse MoE mmap: 6% activation (8/128 experts), 1.0 GB active of 15.6 GB total → 15.6 GB GPU | 0 B RAM | 0 B NVMe ✓
- bench --max-tokens 50: 51 tok/s decode, 577 tok/s prompt eval, 2.9s wall time ✓
- run --prompt '<|turn>user\nThe capital of France is<turn|>\n<|turn>model\n': "The capital of France is Paris." ✓
- run --prompt '<|turn>user\nWrite one sentence about cats.<turn|>\n<|turn>model\n': "Cats are fascinating creatures known for their independent personalities and graceful movements." ✓

Notes

Tested on M1 Max 32GB. Model fits entirely in unified memory and runs via the SparseMoeMmap path — the fastest mode in Hypura, since the OS page cache handles 6.25% MoE sparsity without any router-interception machinery.
The Ollama-pulled gemma4 (8B variant) is a multimodal bundle (vision + audio + projector) — llama_model_load errors on it because the LLM-only loader doesn't consume the multimodal projector tensors. That's a separate, upstream-side limitation; the lmstudio text-only file from the issue doesn't hit it.
A follow-up nice-to-have would be auto-applying the model's chat template in hypura run, so users don't need to hand-construct <|turn>user\n...<turn|>\n<|turn>model\n prefixes themselves. Out of scope for this PR.

Picks up gemma4 architecture support merged in upstream ggml-org/llama.cpp#21309 (LLM_ARCH_GEMMA4 + src/models/gemma4-iswa.cpp) plus seven follow-up patches: - vocab byte token handling for BPE detokenizer (#21488) - convert: set add_bos to True (#21500) - specialized parser (#21418) - final logit softcapping (#21390) - custom newline split (#21406) - tokenizer fixes (#21343) - chat template fix (#21326) 373 commits across the gap. No API breakage observed in hypura-sys — release build and full test suite (41 tests) still pass.

Mixtral-style MoE models keep gate/up/down as three separate fused expert tensors. Gemma 4 fuses gate+up into a single tensor (blk.N.ffn_gate_up_exps.weight), leaving two fused tensors per MoE layer instead of three. Add a GateUp variant to ExpertTensorType so the per-(layer, expert, type) neuron cache key stays distinct from plain Gate. Without this, ExpertTensorType::from_name returned None for the fused tensor and callers fell back to .unwrap_or(Gate), collapsing the cache key.

Gemma 4 encodes attention.head_count_kv as a per-layer i32 array (arr[i32, n_layer]) because layers with sliding-window attention have different KV head counts than full-attention layers (mostly 8, sometimes 2). The old get_u32 path returned None for arrays and fell back to num_heads, over-reserving KV cache headroom. Add as_u32_array on GgufValue and get_u32_array on GgufFile, then take the array max for KV reservation. Conservative — over-reserves rather than under-reserves, which keeps placement decisions safe.

Hypura's tokenize wrapper hardcoded `parse_special=false` when calling `llama_tokenize`. With this setting, special tokens like Gemma 4's `<|turn>`, `<turn|>`, and `<|think|>` get tokenized as raw byte sequences instead of as their actual vocab IDs, so chat-templated prompts feed nonsense to instruction-tuned models. The model was recovering enough to produce high tok/s but the output was gibberish ("modelle_modelle_modelle...", "0.5.0.5.0.5..."). Add a `parse_special` parameter to LlamaModel::tokenize and default the three call sites in the inference paths to pass `true`, matching upstream `llama-cli`/`llama-completion` behavior. Verified end-to-end against gemma-4-26B-A4B-it-Q4_K_M: Prompt: <|turn>user\nThe capital of France is<turn|>\n<|turn>model\n Output: "thought\n<channel|>The capital of France is **Paris**." Speed: 51 tok/s decode (M1 Max 32GB, SparseMoeMmap path)

AlexHarrowell · 2026-04-08T17:10:48Z

Still doesn't work, same error as in #8 - I've reset git to head, pulled, built from source.

The problem is in llama.cpp, specifically the file llama-model.cpp. This contains lookup tables as case-switch statements that contain information about a lot of models, and the code searches in them to find a matching case.
When this function:

void llama_model::load_arch(llama_model_loader & ml) {
arch = ml.get_arch();
if (arch == LLM_ARCH_UNKNOWN) {
throw std::runtime_error("unknown model architecture: '" + ml.get_arch_name() + "'");
}
}

is called, if there is no matching entry in that table, arch is set to LLM_ARCH_UNKNOWN, the comparison holds, and the runtime error I am seeing is thrown (it's the only call in the codebase where that string figures)

Slightly oddly, the only pathway to that specific error message comes from llama_rope_type rather than from arch itself - arch contains a lot of architecture information about different models but the return statement if nothing is found is as follows:

default: throw std::runtime_error("unsupported model architecture: " + arch_name());
}

which is not the error. llama_rope_type holds info on RoPE schemes, and apparently all the Gemmas so far are LLAMA_ROPE_TYPE_NEOX, but if I understand the Gemma4 paper I think it's not exactly the same.

t8 added 4 commits April 7, 2026 18:02

t8 merged commit 574acf2 into main Apr 7, 2026

t8 mentioned this pull request Apr 7, 2026

Question : Have you tried gemma 4 - 31b ? And can I use the latest llama cpp with this project ? #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support Gemma 4 architecture#9

fix: support Gemma 4 architecture#9
t8 merged 4 commits intomainfrom
fix/issue-8-gemma4-support

t8 commented Apr 7, 2026 •

edited

Loading

Uh oh!

AlexHarrowell commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

t8 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Notes

Uh oh!

AlexHarrowell commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

t8 commented Apr 7, 2026 •

edited

Loading