Misc. bug: crashes when trying to use qwen3.5-27b and gemma4-26b-4a using tensor parallelism.

### Name and Version

./llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72375 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24123 MiB
version: 8760 (865ff06b2)
built with GNU 15.2.0 for Linux x86_64



### Operating systems

_No response_

### Which llama.cpp modules do you know to be affected?

_No response_

### Command line

```shell
./llama-server -m /home/kotokin/model/google_gemma-4-26B-A4B-it-Q8_0.gguf --no-mmap -ngl 999 -c 131072 -np 1 -sm tensor --cache-ram 2048 -ctxcp 2
./llama-server -m /home/kotokin/model/Qwen3.5-27B-ultra-uncensored-heretic-v2.Q8_0.gguf --no-mmap -ngl 999 -c 131072 -np 1 -sm tensor --cache-ram 2048 -ctxcp 2
```

### Problem description & steps to reproduce

I get strange behavior when using the two models listed above. In the case of the 27b, I can successfully upload the model, and even use it in the built-in webui from llamacpp, the first message succeeds, and after the second I get an error.

In the case of the gemma4-26b-4A model, I can't even load the model, it crashes like a cache allocation attempt.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console
       eval time =   26828.29 ms /  1132 tokens (   23.70 ms per token,    42.19 tokens per second)
      total time =   27013.15 ms /  1146 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 1145, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.226 (> 0.100 thold), f_keep = 0.010
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1145, total state size = 221.211 MiB
/home/kotokin/llama.cpp/ggml/src/ggml-backend-meta.cpp:1276: GGML_ASSERT(split_state.n_segments == 1) failed
[New LWP 45990]
[New LWP 45989]
[New LWP 45988]
[New LWP 45987]
[New LWP 45986]
[New LWP 45985]
[New LWP 45984]
[New LWP 45983]
[New LWP 45982]
[New LWP 45981]
[New LWP 45980]
[New LWP 45979]
[New LWP 45978]
[New LWP 45977]
[New LWP 45976]
[New LWP 45975]
[New LWP 45974]
[New LWP 45973]
[New LWP 45972]
[New LWP 45971]
[New LWP 45970]
[New LWP 45969]
[New LWP 45968]
[New LWP 45967]
[New LWP 45966]
[New LWP 45965]
[New LWP 45964]
[New LWP 45963]
[New LWP 45962]
[New LWP 45961]
[New LWP 45960]
[New LWP 45955]
[New LWP 45954]
[New LWP 45953]
[New LWP 45952]
[New LWP 45951]
[New LWP 45950]
[New LWP 45949]
[New LWP 45948]
[New LWP 45935]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x000076bfc4aa013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x000076bfc4b1ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x000076bfc5619cd3 in ggml_print_backtrace () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#5  0x000076bfc5619e86 in ggml_abort () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#6  0x000076bfc5640713 in ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#7  0x000076bfc52cfe7f in llama_io_write_buffer::write_tensor(ggml_tensor const*, unsigned long, unsigned long) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#8  0x000076bfc5331d1d in llama_memory_recurrent::state_write_data(llama_io_write_i&, std::vector<std::pair<unsigned int, unsigned int>, std::allocator<std::pair<unsigned int, unsigned int> > > const&) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#9  0x000076bfc53320b8 in llama_memory_recurrent::state_write(llama_io_write_i&, int, unsigned int) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#10 0x000076bfc52bfc9f in llama_context::state_seq_write_data(llama_io_write_i&, int, unsigned int) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#11 0x000076bfc52bfd8d in llama_context::state_seq_get_data(int, unsigned char*, unsigned long, unsigned int) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#12 0x000056ed206a5bef in server_context_impl::get_available_slot(server_task const&) ()
#13 0x000056ed206bed47 in server_context_impl::process_single_task(server_task&&) ()
#14 0x000056ed20741ac7 in server_queue::start_loop(long) ()
#15 0x000056ed206086f7 in main ()
[Inferior 1 (process 45934) detached]
Aborted (core dumped)

common_init_result: added <eos> logit bias = -inf
common_init_result: added <|tool_response> logit bias = -inf
common_init_result: added <turn|> logit bias = -inf
llama_init_from_model: enabling flash_attn since it is required for SPLIT_MODE_TENSOR
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
/home/kotokin/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed
[New LWP 46272]
[New LWP 46271]
[New LWP 46270]
[New LWP 46269]
[New LWP 46268]
[New LWP 46267]
[New LWP 46266]
[New LWP 46265]
[New LWP 46264]
[New LWP 46263]
[New LWP 46262]
[New LWP 46261]
[New LWP 46260]
[New LWP 46259]
[New LWP 46258]
[New LWP 46257]
[New LWP 46256]
[New LWP 46255]
[New LWP 46254]
[New LWP 46253]
[New LWP 46252]
[New LWP 46251]
[New LWP 46250]
[New LWP 46249]
[New LWP 46248]
[New LWP 46247]
[New LWP 46246]
[New LWP 46245]
[New LWP 46244]
[New LWP 46243]
[New LWP 46242]
[New LWP 46236]
[New LWP 46235]
[New LWP 46234]
[New LWP 46233]
[New LWP 46232]
[New LWP 46231]
[New LWP 46230]
[New LWP 46229]
[New LWP 46218]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007e885e8a013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007e885e91ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007e885ef54cd3 in ggml_print_backtrace () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007e885ef54e86 in ggml_abort () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007e885ef6bfa0 in ggml_backend_buffer_get_size () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#7  0x00007e885ef7f355 in ggml_backend_meta_alloc_ctx_tensors_from_buft () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#8  0x00007e885f113211 in llama_kv_cache::llama_kv_cache(llama_model const&, ggml_type, ggml_type, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, llama_swa_type, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#9  0x00007e885f120eae in llama_kv_cache_iswa::llama_kv_cache_iswa(llama_model const&, ggml_type, ggml_type, bool, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#10 0x00007e885f150936 in llama_model::create_memory(llama_memory_params const&, llama_cparams const&) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#11 0x00007e885f0c690c in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#12 0x00007e885f0c73b1 in llama_init_from_model () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#13 0x000060de66a70f9b in common_init_result::common_init_result(common_params&) ()
#14 0x000060de66a72d8a in common_init_from_params(common_params&) ()
#15 0x000060de6697205e in server_context_impl::load_model(common_params&) ()
#16 0x000060de668b8175 in main ()
[Inferior 1 (process 46217) detached]
Aborted (core dumped)
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: crashes when trying to use qwen3.5-27b and gemma4-26b-4a using tensor parallelism. #21765

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: crashes when trying to use qwen3.5-27b and gemma4-26b-4a using tensor parallelism. #21765

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions