eval time = 26828.29 ms / 1132 tokens ( 23.70 ms per token, 42.19 tokens per second)
total time = 27013.15 ms / 1146 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 1145, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.226 (> 0.100 thold), f_keep = 0.010
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1145, total state size = 221.211 MiB
/home/kotokin/llama.cpp/ggml/src/ggml-backend-meta.cpp:1276: GGML_ASSERT(split_state.n_segments == 1) failed
[New LWP 45990]
[New LWP 45989]
[New LWP 45988]
[New LWP 45987]
[New LWP 45986]
[New LWP 45985]
[New LWP 45984]
[New LWP 45983]
[New LWP 45982]
[New LWP 45981]
[New LWP 45980]
[New LWP 45979]
[New LWP 45978]
[New LWP 45977]
[New LWP 45976]
[New LWP 45975]
[New LWP 45974]
[New LWP 45973]
[New LWP 45972]
[New LWP 45971]
[New LWP 45970]
[New LWP 45969]
[New LWP 45968]
[New LWP 45967]
[New LWP 45966]
[New LWP 45965]
[New LWP 45964]
[New LWP 45963]
[New LWP 45962]
[New LWP 45961]
[New LWP 45960]
[New LWP 45955]
[New LWP 45954]
[New LWP 45953]
[New LWP 45952]
[New LWP 45951]
[New LWP 45950]
[New LWP 45949]
[New LWP 45948]
[New LWP 45935]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x000076bfc4aa013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: No such file or directory
#2 __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x000076bfc4b1ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4 0x000076bfc5619cd3 in ggml_print_backtrace () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#5 0x000076bfc5619e86 in ggml_abort () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#6 0x000076bfc5640713 in ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#7 0x000076bfc52cfe7f in llama_io_write_buffer::write_tensor(ggml_tensor const*, unsigned long, unsigned long) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#8 0x000076bfc5331d1d in llama_memory_recurrent::state_write_data(llama_io_write_i&, std::vector<std::pair<unsigned int, unsigned int>, std::allocator<std::pair<unsigned int, unsigned int> > > const&) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#9 0x000076bfc53320b8 in llama_memory_recurrent::state_write(llama_io_write_i&, int, unsigned int) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#10 0x000076bfc52bfc9f in llama_context::state_seq_write_data(llama_io_write_i&, int, unsigned int) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#11 0x000076bfc52bfd8d in llama_context::state_seq_get_data(int, unsigned char*, unsigned long, unsigned int) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#12 0x000056ed206a5bef in server_context_impl::get_available_slot(server_task const&) ()
#13 0x000056ed206bed47 in server_context_impl::process_single_task(server_task&&) ()
#14 0x000056ed20741ac7 in server_queue::start_loop(long) ()
#15 0x000056ed206086f7 in main ()
[Inferior 1 (process 45934) detached]
Aborted (core dumped)
common_init_result: added <eos> logit bias = -inf
common_init_result: added <|tool_response> logit bias = -inf
common_init_result: added <turn|> logit bias = -inf
llama_init_from_model: enabling flash_attn since it is required for SPLIT_MODE_TENSOR
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
/home/kotokin/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed
[New LWP 46272]
[New LWP 46271]
[New LWP 46270]
[New LWP 46269]
[New LWP 46268]
[New LWP 46267]
[New LWP 46266]
[New LWP 46265]
[New LWP 46264]
[New LWP 46263]
[New LWP 46262]
[New LWP 46261]
[New LWP 46260]
[New LWP 46259]
[New LWP 46258]
[New LWP 46257]
[New LWP 46256]
[New LWP 46255]
[New LWP 46254]
[New LWP 46253]
[New LWP 46252]
[New LWP 46251]
[New LWP 46250]
[New LWP 46249]
[New LWP 46248]
[New LWP 46247]
[New LWP 46246]
[New LWP 46245]
[New LWP 46244]
[New LWP 46243]
[New LWP 46242]
[New LWP 46236]
[New LWP 46235]
[New LWP 46234]
[New LWP 46233]
[New LWP 46232]
[New LWP 46231]
[New LWP 46230]
[New LWP 46229]
[New LWP 46218]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007e885e8a013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: No such file or directory
#2 __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007e885e91ca0f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4 0x00007e885ef54cd3 in ggml_print_backtrace () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#5 0x00007e885ef54e86 in ggml_abort () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#6 0x00007e885ef6bfa0 in ggml_backend_buffer_get_size () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#7 0x00007e885ef7f355 in ggml_backend_meta_alloc_ctx_tensors_from_buft () from /home/kotokin/llama.cpp/build/bin/libggml-base.so.0
#8 0x00007e885f113211 in llama_kv_cache::llama_kv_cache(llama_model const&, ggml_type, ggml_type, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, llama_swa_type, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#9 0x00007e885f120eae in llama_kv_cache_iswa::llama_kv_cache_iswa(llama_model const&, ggml_type, ggml_type, bool, bool, bool, bool, unsigned int, unsigned int, unsigned int, unsigned int, std::function<bool (int)> const&, std::function<int (int)> const&) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#10 0x00007e885f150936 in llama_model::create_memory(llama_memory_params const&, llama_cparams const&) const () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#11 0x00007e885f0c690c in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#12 0x00007e885f0c73b1 in llama_init_from_model () from /home/kotokin/llama.cpp/build/bin/libllama.so.0
#13 0x000060de66a70f9b in common_init_result::common_init_result(common_params&) ()
#14 0x000060de66a72d8a in common_init_from_params(common_params&) ()
#15 0x000060de6697205e in server_context_impl::load_model(common_params&) ()
#16 0x000060de668b8175 in main ()
[Inferior 1 (process 46217) detached]
Aborted (core dumped)
Name and Version
./llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72375 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24123 MiB
version: 8760 (865ff06)
built with GNU 15.2.0 for Linux x86_64
Operating systems
No response
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
I get strange behavior when using the two models listed above. In the case of the 27b, I can successfully upload the model, and even use it in the built-in webui from llamacpp, the first message succeeds, and after the second I get an error.
In the case of the gemma4-26b-4A model, I can't even load the model, it crashes like a cache allocation attempt.
First Bad Commit
No response
Relevant log output
Logs