Conversation
Picks up gemma4 architecture support merged in upstream ggml-org/llama.cpp#21309 (LLM_ARCH_GEMMA4 + src/models/gemma4-iswa.cpp) plus seven follow-up patches: - vocab byte token handling for BPE detokenizer (#21488) - convert: set add_bos to True (#21500) - specialized parser (#21418) - final logit softcapping (#21390) - custom newline split (#21406) - tokenizer fixes (#21343) - chat template fix (#21326) 373 commits across the gap. No API breakage observed in hypura-sys — release build and full test suite (41 tests) still pass.
Mixtral-style MoE models keep gate/up/down as three separate fused expert tensors. Gemma 4 fuses gate+up into a single tensor (blk.N.ffn_gate_up_exps.weight), leaving two fused tensors per MoE layer instead of three. Add a GateUp variant to ExpertTensorType so the per-(layer, expert, type) neuron cache key stays distinct from plain Gate. Without this, ExpertTensorType::from_name returned None for the fused tensor and callers fell back to .unwrap_or(Gate), collapsing the cache key.
Gemma 4 encodes attention.head_count_kv as a per-layer i32 array (arr[i32, n_layer]) because layers with sliding-window attention have different KV head counts than full-attention layers (mostly 8, sometimes 2). The old get_u32 path returned None for arrays and fell back to num_heads, over-reserving KV cache headroom. Add as_u32_array on GgufValue and get_u32_array on GgufFile, then take the array max for KV reservation. Conservative — over-reserves rather than under-reserves, which keeps placement decisions safe.
Hypura's tokenize wrapper hardcoded `parse_special=false` when calling
`llama_tokenize`. With this setting, special tokens like Gemma 4's
`<|turn>`, `<turn|>`, and `<|think|>` get tokenized as raw byte
sequences instead of as their actual vocab IDs, so chat-templated
prompts feed nonsense to instruction-tuned models. The model was
recovering enough to produce high tok/s but the output was gibberish
("modelle_modelle_modelle...", "0.5.0.5.0.5...").
Add a `parse_special` parameter to LlamaModel::tokenize and default
the three call sites in the inference paths to pass `true`, matching
upstream `llama-cli`/`llama-completion` behavior.
Verified end-to-end against gemma-4-26B-A4B-it-Q4_K_M:
Prompt: <|turn>user\nThe capital of France is<turn|>\n<|turn>model\n
Output: "thought\n<channel|>The capital of France is **Paris**."
Speed: 51 tok/s decode (M1 Max 32GB, SparseMoeMmap path)
|
Still doesn't work, same error as in #8 - I've reset git to head, pulled, built from source. The problem is in llama.cpp, specifically the file llama-model.cpp. This contains lookup tables as case-switch statements that contain information about a lot of models, and the code searches in them to find a matching case. void llama_model::load_arch(llama_model_loader & ml) { is called, if there is no matching entry in that table, arch is set to LLM_ARCH_UNKNOWN, the comparison holds, and the runtime error I am seeing is thrown (it's the only call in the codebase where that string figures) Slightly oddly, the only pathway to that specific error message comes from llama_rope_type rather than from arch itself - arch contains a lot of architecture information about different models but the return statement if nothing is found is as follows: default: throw std::runtime_error("unsupported model architecture: " + arch_name()); which is not the error. llama_rope_type holds info on RoPE schemes, and apparently all the Gemmas so far are LLAMA_ROPE_TYPE_NEOX, but if I understand the Gemma4 paper I think it's not exactly the same. |
Summary
vendor/llama.cppto66c4f9dedto pick up upstream Gemma 4 support (ggml-org/llama.cpp#21309 plus seven follow-up patches)ffn_gate_up_expsattention.head_count_kvworks on Gemma 4tokenizeto passparse_special=trueso chat-template tokens (<|turn>,<|think|>, etc.) are recognized as their vocab IDs instead of being tokenized as raw bytesFixes #8.
Why
The error AlexHarrowell reported (
unknown model architecture: 'gemma4') was llama.cpp itself, not Hypura — the vendored submodule was atb5e1212from March 13, while gemma4 support landed upstream on April 2 in #21309. Hypura classifies tensors by name patterns rather than maintaining its own per-architecture tables, so once upstream knows the architecture, almost everything just works. The Hypura-side adaptations:ExpertTensorType::GateUp— Mixtral keeps gate/up/down as three separate fused expert tensors. Gemma 4 fuses gate+up into one (ffn_gate_up_exps), giving two fused tensors per MoE layer. Without a distinct enum variant, the neuron-cache key for the fused tensor was collapsing toGatevia.unwrap_or().get_u32_array— Gemma 4 encodesattention.head_count_kvasarr[i32, n_layer]because layers with sliding-window attention have different KV head counts than full-attention layers. The oldget_u32path returnedNoneand fell back tonum_heads, over-reserving KV cache headroom.parse_special=true— discovered while running a real bench against the model. Withparse_special=false, Gemma 4's chat-template tokens (<|turn>,<turn|>,<|think|>— gemma4 uses different turn markers than gemma3's<start_of_turn>) get tokenized as raw bytes. The model produced high tok/s but gibberish output ("modelle_modelle_modelle...","0.5.0.5.0.5..."). Vanillallama-completionusesparse_special=trueand produces coherent output; Hypura now matches.Test plan
cargo build --releaseclean against new llama.cpp (no internal-header drift inhypura_buft.c)cargo test --release --lib— 41 passed, 0 failedhypura inspectcorrectly identifies arch=gemma4, MoE 128/8, layers=30, KV heads=8 (validates the array-fallback path; the underlying value isarr[i32,30] = [8,8,8,8,8,2,...])gemma-4-26B-A4B-it-Q4_K_M(the lmstudio file from the issue):Sparse MoE mmap: 6% activation (8/128 experts), 1.0 GB active of 15.6 GB total → 15.6 GB GPU | 0 B RAM | 0 B NVMe✓bench --max-tokens 50: 51 tok/s decode, 577 tok/s prompt eval, 2.9s wall time ✓run --prompt '<|turn>user\nThe capital of France is<turn|>\n<|turn>model\n': "The capital of France is Paris." ✓run --prompt '<|turn>user\nWrite one sentence about cats.<turn|>\n<|turn>model\n': "Cats are fascinating creatures known for their independent personalities and graceful movements." ✓Notes
gemma4(8B variant) is a multimodal bundle (vision + audio + projector) —llama_model_loaderrors on it because the LLM-only loader doesn't consume the multimodal projector tensors. That's a separate, upstream-side limitation; the lmstudio text-only file from the issue doesn't hit it.hypura run, so users don't need to hand-construct<|turn>user\n...<turn|>\n<|turn>model\nprefixes themselves. Out of scope for this PR.