Name and Version
pwilkin@SYN-PC-11:/devel/models$ llama-cli --version
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31679 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
Device 1: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-alderlake.so
version: 8738 (d6f3030)
built with GNU 15.2.0 for Linux x86_64
pwilkin@SYN-PC-11:/devel/models$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Operating systems
Linux
GGML backends
CUDA
Hardware
2x 5070 Ti 16 GB
CPU: Intel(R) Core(TM) i7-14700KF
Models
Gemma 31B Q5_K_S (bartowski)
Problem description & steps to reproduce
Running
llama-server -m google_gemma-4-31B-it-Q5_K_M.gguf -c 150000 -a syndatis --mmproj mmproj-google_gemma-4-31B-it-q8_0.gguf --chat-template-file /devel/tools/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja --host 0.0.0.0 --cache-ram 4096 -ctxcp 4 -np 1 -sm tensor -nkvo
fails with an assertion error. The same model runs fine on smaller context without -nkvo.
First Bad Commit
No response
Relevant log output
Logs
/devel/tools/llama.cpp/ggml/src/ggml-backend-meta.cpp:729: GGML_ASSERT(src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2) failed
Name and Version
pwilkin@SYN-PC-11:/devel/models$ llama-cli --version
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31679 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
Device 1: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-alderlake.so
version: 8738 (d6f3030)
built with GNU 15.2.0 for Linux x86_64
pwilkin@SYN-PC-11:/devel/models$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Operating systems
Linux
GGML backends
CUDA
Hardware
2x 5070 Ti 16 GB
CPU: Intel(R) Core(TM) i7-14700KF
Models
Gemma 31B Q5_K_S (bartowski)
Problem description & steps to reproduce
Running
llama-server -m google_gemma-4-31B-it-Q5_K_M.gguf -c 150000 -a syndatis --mmproj mmproj-google_gemma-4-31B-it-q8_0.gguf --chat-template-file /devel/tools/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja --host 0.0.0.0 --cache-ram 4096 -ctxcp 4 -np 1 -sm tensor -nkvofails with an assertion error. The same model runs fine on smaller context without
-nkvo.First Bad Commit
No response
Relevant log output
Logs
/devel/tools/llama.cpp/ggml/src/ggml-backend-meta.cpp:729: GGML_ASSERT(src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2) failed