Skip to content

Eval bug: Gemma 4 fails to load with tensor shape mismatch due to sliding_window_pattern read as uint32_t instead of bool #21434

@hmannekomimi-ctrl2

Description

@hmannekomimi-ctrl2

Name and Version

version: b8661 (b7ad48e)
built with MSVC 19.44.35211.0 for Windows x86_64
CMake options: GGML_CUDA=ON, GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, BUILD_SHARED_LIBS=ON, LLAMA_OPENSSL=ON
CUDA 12.9

Operating systems

Windows

GGML backends

CUDA

Hardware

AMD Ryzen 7 7800X3D + NVIDIA GeForce RTX 4070 (12GB VRAM)

Models

unsloth/gemma-4-26B-A4B-it-GGUF (UD-Q4_K_M)
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF

Problem description & steps to reproduce

Summary

Attempting to load a Gemma 4 model (e.g., gemma-4-26B-A4B-it-GGUF) with a self-compiled build from source fails with a tensor shape mismatch error:

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.attn_q.weight' has wrong shape; expected 2816, 8192, got 2816, 4096

However, the same model loads and works correctly in LM Studio v2.11.0, which uses a forked version of llama.cpp.

Root Cause

The GGUF file stores gemma4.attention.sliding_window_pattern as a bool[] array (30 entries). The llama.cpp loader reads this into std::array<uint32_t, LLAMA_MAX_LAYERS> swa_layers via get_key_or_arr. Due to the type mismatch between bool (1 byte) and uint32_t (4 bytes), the memory layout is misinterpreted, causing incorrect is_swa() results for most layers. This leads to wrong dimension calculations for n_embd_head_k and n_embd_k_gqa, resulting in the tensor shape mismatch.

Fix

Changing swa_layers from std::array<uint32_t, LLAMA_MAX_LAYERS> to std::array<bool, LLAMA_MAX_LAYERS> and adding the necessary get_arr<bool, 512> template instantiation resolves the issue completely. The model then loads correctly, producing the same n_embd_k_gqa values as LM Studio.

First Bad Commit

No response

Relevant log output

Error log (before fix)

$ llama-server -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf --ctx-size 8192 --n-gpu-layers 30 --flash-attn on --cpu-moe
print_info: arch                  = gemma4
print_info: n_embd_k_gqa          = [2048, 4096, 2048, 2048, 2048, 512, 2048, 4096, ...]
print_info: n_embd_v_gqa          = [2048, 4096, 2048, 2048, 2048, 512, 2048, 4096, ...]
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.attn_q.weight' has wrong shape; expected 2816, 8192, got 2816, 4096

Expected output (after fix)

print_info: n_embd_k_gqa          = [2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, ...]
print_info: n_embd_v_gqa          = [2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, ...]
load_tensors: ............................................................
main: model loaded
srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions