Skip to content

Eval bug: tensor parallelism failing with -nkvo (Gemma 31B) #21686

@pwilkin

Description

@pwilkin

Name and Version

pwilkin@SYN-PC-11:/devel/models$ llama-cli --version
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31679 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
Device 1: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15839 MiB
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-alderlake.so
version: 8738 (d6f3030)
built with GNU 15.2.0 for Linux x86_64

pwilkin@SYN-PC-11:/devel/models$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Operating systems

Linux

GGML backends

CUDA

Hardware

2x 5070 Ti 16 GB
CPU: Intel(R) Core(TM) i7-14700KF

Models

Gemma 31B Q5_K_S (bartowski)

Problem description & steps to reproduce

Running

llama-server -m google_gemma-4-31B-it-Q5_K_M.gguf -c 150000 -a syndatis --mmproj mmproj-google_gemma-4-31B-it-q8_0.gguf --chat-template-file /devel/tools/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja --host 0.0.0.0 --cache-ram 4096 -ctxcp 4 -np 1 -sm tensor -nkvo

fails with an assertion error. The same model runs fine on smaller context without -nkvo.

First Bad Commit

No response

Relevant log output

Logs
/devel/tools/llama.cpp/ggml/src/ggml-backend-meta.cpp:729: GGML_ASSERT(src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    CUDARelated to the CUDA backendbugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions