WARNING: radv is not a conformant Vulkan implementation, testing use only.
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-cpu.so
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 19 threads for HTTP server
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-cpu.so
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-cpu.so
srv load_models: Loaded 1 cached model presets
srv load_models: Loaded 7 local model presets from /var/home/bazzite/.lmstudio/models-flat/
srv load_models: Available models (8) (*: custom preset)
srv load_models: GLM-4.7-Flash-UD-IQ3_XXS
srv load_models: Qwen3.5-35B-A3B-APEX-Mini-SystemAnywhere
srv load_models: Qwen3.5-35B-A3B-Claude-Distilled-APEX-I-Mini
srv load_models: gemma-4-26B-A4B-APEX-I-Mini
srv load_models: gemma-4-26B-A4B-heretic-APEX-I-Mini
srv load_models: gemma-4-26B-A4B-it-UD-Q2_K_XL
srv load_models: gemma-4-26B-A4B-it-UD-Q3_K_XL
srv load_models: unsloth/Qwen3.5-4B-GGUF:Q4_K_XL
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:8081
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv load: spawning server instance with name=gemma-4-26B-A4B-it-UD-Q3_K_XL on port 48203
srv load: spawning server instance with args:
srv load: /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/llama-server
srv load: --clear-idle
srv load: --host
srv load: 127.0.0.1
srv load: --no-mmproj-offload
srv load: --port
srv load: 48203
srv load: --alias
srv load: gemma-4-26B-A4B-it-UD-Q3_K_XL
srv load: --cache-type-k
srv load: turbo3
srv load: --cache-type-v
srv load: turbo3
srv load: --flash-attn
srv load: on
srv load: --fit-ctx
srv load: 96000
srv load: --fit-target
srv load: 400
srv load: --kv-unified
srv load: --model
srv load: /var/home/bazzite/.lmstudio/models-flat/gemma-4-26B-A4B-it-UD-Q3_K_XL/gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf
srv load: --mmproj
srv load: /var/home/bazzite/.lmstudio/models-flat/gemma-4-26B-A4B-it-UD-Q3_K_XL/mmproj-BF16.gguf
srv load: --n-gpu-layers
srv load: all
srv load: --parallel
srv load: 10
srv log_server_r: done request: POST /models/load 127.0.0.1 200
[48203] WARNING: radv is not a conformant Vulkan implementation, testing use only.
[48203] load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so
[48203] load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-cpu.so
[48203] system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
[48203]
[48203] system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[48203]
[48203] init: using 19 threads for HTTP server
[48203] start: binding port with default address family
[48203] main: loading model
[48203] srv load_model: loading model '/var/home/bazzite/.lmstudio/models-flat/gemma-4-26B-A4B-it-UD-Q3_K_XL/gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf'
[48203] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[48203] llama_params_fit_impl: projected to use 14504 MiB of device memory vs. 14925 MiB of free device memory
[48203] llama_params_fit_impl: will leave 421 >= 400 MiB of free device memory, no changes needed
[48203] llama_params_fit: successfully fit params to free device memory
[48203] llama_params_fit: fitting params to free memory took 1.33 seconds
[48203] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 9070 XT (RADV GFX1201)) (0000:03:00.0) - 14925 MiB free
[48203] llama_model_loader: loaded meta data with 60 key-value pairs and 658 tensors from /var/home/bazzite/.lmstudio/models-flat/gemma-4-26B-A4B-it-UD-Q3_K_XL/gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
[48203] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[48203] llama_model_loader: - kv 0: general.architecture str = gemma4
[48203] llama_model_loader: - kv 1: general.type str = model
[48203] llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
[48203] llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
[48203] llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
[48203] llama_model_loader: - kv 5: general.name str = Gemma-4-26B-A4B-It
[48203] llama_model_loader: - kv 6: general.finetune str = it
[48203] llama_model_loader: - kv 7: general.basename str = Gemma-4-26B-A4B-It
[48203] llama_model_loader: - kv 8: general.quantized_by str = Unsloth
[48203] llama_model_loader: - kv 9: general.size_label str = 26B-A4B
[48203] llama_model_loader: - kv 10: general.license str = apache-2.0
[48203] llama_model_loader: - kv 11: general.license.link str = https://ai.google.dev/gemma/docs/gemm...
[48203] llama_model_loader: - kv 12: general.repo_url str = https://huggingface.co/unsloth
[48203] llama_model_loader: - kv 13: general.base_model.count u32 = 1
[48203] llama_model_loader: - kv 14: general.base_model.0.name str = Gemma 4 26B A4B It
[48203] llama_model_loader: - kv 15: general.base_model.0.organization str = Google
[48203] llama_model_loader: - kv 16: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-4...
[48203] llama_model_loader: - kv 17: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
[48203] llama_model_loader: - kv 18: gemma4.block_count u32 = 30
[48203] llama_model_loader: - kv 19: gemma4.context_length u32 = 262144
[48203] llama_model_loader: - kv 20: gemma4.embedding_length u32 = 2816
[48203] llama_model_loader: - kv 21: gemma4.feed_forward_length u32 = 2112
[48203] llama_model_loader: - kv 22: gemma4.attention.head_count u32 = 16
[48203] llama_model_loader: - kv 23: gemma4.attention.head_count_kv arr[i32,30] = [8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2, ...
[48203] llama_model_loader: - kv 24: gemma4.rope.freq_base f32 = 1000000.000000
[48203] llama_model_loader: - kv 25: gemma4.rope.freq_base_swa f32 = 10000.000000
[48203] llama_model_loader: - kv 26: gemma4.attention.layer_norm_rms_epsilon f32 = 0.000001
[48203] llama_model_loader: - kv 27: gemma4.expert_count u32 = 128
[48203] llama_model_loader: - kv 28: gemma4.expert_used_count u32 = 8
[48203] llama_model_loader: - kv 29: gemma4.attention.key_length u32 = 512
[48203] llama_model_loader: - kv 30: gemma4.attention.value_length u32 = 512
[48203] llama_model_loader: - kv 31: gemma4.final_logit_softcapping f32 = 30.000000
[48203] llama_model_loader: - kv 32: gemma4.attention.sliding_window u32 = 1024
[48203] llama_model_loader: - kv 33: gemma4.attention.shared_kv_layers u32 = 0
[48203] llama_model_loader: - kv 34: gemma4.embedding_length_per_layer_input u32 = 0
[48203] llama_model_loader: - kv 35: gemma4.attention.sliding_window_pattern arr[bool,30] = [true, true, true, true, true, false,...
[48203] llama_model_loader: - kv 36: gemma4.attention.key_length_swa u32 = 256
[48203] llama_model_loader: - kv 37: gemma4.attention.value_length_swa u32 = 256
[48203] llama_model_loader: - kv 38: gemma4.expert_feed_forward_length u32 = 704
[48203] llama_model_loader: - kv 39: gemma4.rope.dimension_count u32 = 512
[48203] llama_model_loader: - kv 40: gemma4.rope.dimension_count_swa u32 = 256
[48203] llama_model_loader: - kv 41: tokenizer.ggml.model str = gemma4
[48203] llama_model_loader: - kv 42: tokenizer.ggml.tokens arr[str,262144] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
[48203] llama_model_loader: - kv 43: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
[48203] llama_model_loader: - kv 44: tokenizer.ggml.token_type arr[i32,262144] = [3, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
[48203] llama_model_loader: - kv 45: tokenizer.ggml.merges arr[str,514906] = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
[48203] llama_model_loader: - kv 46: tokenizer.ggml.bos_token_id u32 = 2
[48203] llama_model_loader: - kv 47: tokenizer.ggml.eos_token_id u32 = 106
[48203] llama_model_loader: - kv 48: tokenizer.ggml.unknown_token_id u32 = 3
[48203] llama_model_loader: - kv 49: tokenizer.ggml.padding_token_id u32 = 0
[48203] llama_model_loader: - kv 50: tokenizer.ggml.mask_token_id u32 = 4
[48203] llama_model_loader: - kv 51: tokenizer.chat_template str = {%- macro format_parameters(propertie...
[48203] llama_model_loader: - kv 52: tokenizer.ggml.add_space_prefix bool = false
[48203] llama_model_loader: - kv 53: tokenizer.ggml.add_bos_token bool = true
[48203] llama_model_loader: - kv 54: general.quantization_version u32 = 2
[48203] llama_model_loader: - kv 55: general.file_type u32 = 12
[48203] llama_model_loader: - kv 56: quantize.imatrix.file str = gemma-4-26B-A4B-it-GGUF/imatrix_unslo...
[48203] llama_model_loader: - kv 57: quantize.imatrix.dataset str = unsloth_calibration_gemma-4-26B-A4B-i...
[48203] llama_model_loader: - kv 58: quantize.imatrix.entries_count u32 = 295
[48203] llama_model_loader: - kv 59: quantize.imatrix.chunks_count u32 = 141
[48203] llama_model_loader: - type f32: 392 tensors
[48203] llama_model_loader: - type q5_1: 2 tensors
[48203] llama_model_loader: - type q8_0: 206 tensors
[48203] llama_model_loader: - type iq3_xxs: 29 tensors
[48203] llama_model_loader: - type iq4_nl: 28 tensors
[48203] llama_model_loader: - type iq4_xs: 1 tensors
[48203] print_info: file format = GGUF V3 (latest)
[48203] print_info: file type = Q3_K - Medium
[48203] print_info: file size = 11.98 GiB (4.08 BPW)
[48203] load: 0 unused tokens
[48203] load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
[48203] load: printing all EOG tokens:
[48203] load: - 106 ('<turn|>')
[48203] load: - 212 ('</s>')
[48203] load: special tokens cache size = 24
[48203] load: token to piece cache size = 1.9443 MB
[48203] print_info: arch = gemma4
[48203] print_info: vocab_only = 0
[48203] print_info: no_alloc = 0
[48203] print_info: n_ctx_train = 262144
[48203] print_info: n_embd = 2816
[48203] print_info: n_embd_inp = 2816
[48203] print_info: n_layer = 30
[48203] print_info: n_head = 16
[48203] print_info: n_head_kv = [8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2]
[48203] print_info: n_rot = 512
[48203] print_info: n_swa = 1024
[48203] print_info: is_swa_any = 1
[48203] print_info: n_embd_head_k = 512
[48203] print_info: n_embd_head_v = 512
[48203] print_info: n_gqa = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
[48203] print_info: n_embd_k_gqa = [2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024]
[48203] print_info: n_embd_v_gqa = [2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024, 2048, 2048, 2048, 2048, 2048, 1024]
[48203] print_info: f_norm_eps = 0.0e+00
[48203] print_info: f_norm_rms_eps = 1.0e-06
[48203] print_info: f_clamp_kqv = 0.0e+00
[48203] print_info: f_max_alibi_bias = 0.0e+00
[48203] print_info: f_logit_scale = 0.0e+00
[48203] print_info: f_attn_scale = 1.0e+00
[48203] print_info: n_ff = 2112
[48203] print_info: n_expert = 128
[48203] print_info: n_expert_used = 8
[48203] print_info: n_expert_groups = 0
[48203] print_info: n_group_used = 0
[48203] print_info: causal attn = 1
[48203] print_info: pooling type = -1
[48203] print_info: rope type = 2
[48203] print_info: rope scaling = linear
[48203] print_info: freq_base_train = 1000000.0
[48203] print_info: freq_scale_train = 1
[48203] print_info: freq_base_swa = 10000.0
[48203] print_info: freq_scale_swa = 1
[48203] print_info: n_embd_head_k_swa = 256
[48203] print_info: n_embd_head_v_swa = 256
[48203] print_info: n_rot_swa = 256
[48203] print_info: n_ctx_orig_yarn = 262144
[48203] print_info: rope_yarn_log_mul = 0.0000
[48203] print_info: rope_finetuned = unknown
[48203] print_info: model type = ?B
[48203] print_info: model params = 25.23 B
[48203] print_info: general.name = Gemma-4-26B-A4B-It
[48203] print_info: vocab type = BPE
[48203] print_info: n_vocab = 262144
[48203] print_info: n_merges = 514906
[48203] print_info: BOS token = 2 '<bos>'
[48203] print_info: EOS token = 106 '<turn|>'
[48203] print_info: UNK token = 3 '<unk>'
[48203] print_info: PAD token = 0 '<pad>'
[48203] print_info: MASK token = 4 '<mask>'
[48203] print_info: LF token = 107 '
[48203] '
[48203] print_info: EOG token = 106 '<turn|>'
[48203] print_info: EOG token = 212 '</s>'
[48203] print_info: max token length = 93
[48203] load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
[48203] str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1
[48203] load_tensors: offloading output layer to GPU
[48203] load_tensors: offloading 29 repeating layers to GPU
[48203] load_tensors: offloaded 31/31 layers to GPU
[48203] load_tensors: CPU_Mapped model buffer size = 748.00 MiB
[48203] load_tensors: Vulkan0 model buffer size = 12264.00 MiB
[48203] .............................................................................
[48203] common_init_result: added <turn|> logit bias = -inf
[48203] common_init_result: added </s> logit bias = -inf
[48203] llama_context: constructing llama_context
[48203] llama_context: n_seq_max = 10
[48203] llama_context: n_ctx = 262144
[48203] llama_context: n_ctx_seq = 262144
[48203] llama_context: n_batch = 2048
[48203] llama_context: n_ubatch = 512
[48203] llama_context: causal_attn = 1
[48203] llama_context: flash_attn = enabled
[48203] llama_context: kv_unified = true
[48203] llama_context: freq_base = 1000000.0
[48203] llama_context: freq_scale = 1
[48203] llama_context: Vulkan_Host output buffer size = 10.00 MiB
[48203] llama_kv_cache_iswa: creating non-SWA KV cache, size = 262144 cells
[48203] llama_kv_cache: Vulkan0 KV buffer size = 1000.13 MiB
[48203] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[48203] llama_kv_cache: size = 1000.00 MiB (262144 cells, 5 layers, 10/1 seqs), K (turbo3): 500.00 MiB, V (turbo3): 500.00 MiB
[48203] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[48203] llama_kv_cache: attn_rot_k = 0
[48203] llama_kv_cache: attn_rot_v = 0
[48203] llama_kv_cache_iswa: creating SWA KV cache, size = 10752 cells
[48203] llama_kv_cache: Vulkan0 KV buffer size = 410.28 MiB
[48203] llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
[48203] llama_kv_cache: size = 410.16 MiB ( 10752 cells, 25 layers, 10/1 seqs), K (turbo3): 205.08 MiB, V (turbo3): 205.08 MiB
[48203] llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
[48203] llama_kv_cache: attn_rot_k = 0
[48203] llama_kv_cache: attn_rot_v = 0
[48203] sched_reserve: reserving ...
[48203] sched_reserve: resolving fused Gated Delta Net support:
[48203] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[48203] sched_reserve: fused Gated Delta Net (chunked) enabled
[48203] sched_reserve: Vulkan0 compute buffer size = 829.79 MiB
[48203] sched_reserve: Vulkan_Host compute buffer size = 552.54 MiB
[48203] sched_reserve: graph nodes = 2707
[48203] sched_reserve: graph splits = 2
[48203] sched_reserve: reserve took 70.20 ms, sched copies = 1
[48203] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[48203] /home/bazzite/llama-cpp-turboquant/ggml/src/ggml-vulkan/ggml-vulkan.cpp:8972: GGML_ASSERT(Br == pipeline->wg_denoms[0]) failed
[48203] [New LWP 931335]
[48203] [New LWP 931334]
[48203] [New LWP 931333]
[48203] [New LWP 931332]
[48203] [New LWP 931331]
[48203] [New LWP 931330]
[48203] [New LWP 931329]
[48203] [New LWP 931328]
[48203] [New LWP 931327]
[48203] [New LWP 931326]
[48203] [New LWP 931325]
[48203] [New LWP 931324]
[48203] [New LWP 931323]
[48203] [New LWP 931322]
[48203] [New LWP 931321]
[48203] [New LWP 931320]
[48203] [New LWP 931319]
[48203] [New LWP 931318]
[48203] [New LWP 931317]
[48203] [New LWP 931316]
[48203] [New LWP 931315]
[48203] [New LWP 931313]
[48203] [New LWP 931312]
[48203]
[48203] This GDB supports auto-downloading debuginfo from the following URLs:
[48203] <ima:enforcing>
[48203] <https://debuginfod.fedoraproject.org/>
[48203] <ima:ignore>
[48203] Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
[48203] Debuginfod has been disabled.
[48203] To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[48203] [Thread debugging using libthread_db enabled]
[48203] Using host libthread_db library "/lib64/libthread_db.so.1".
[48203] 0x00007f77aa2879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
[48203] #0 0x00007f77aa2879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
[48203] #1 0x00007f77aa27bc3c in __internal_syscall_cancel () from /lib64/libc.so.6
[48203] #2 0x00007f77aa27bc84 in __syscall_cancel () from /lib64/libc.so.6
[48203] #3 0x00007f77aa2ebb8f in wait4 () from /lib64/libc.so.6
[48203] #4 0x00007f77ab0d49bd in ggml_print_backtrace () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-base.so.0
[48203] #5 0x00007f77ab0d4b46 in ggml_abort () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-base.so.0
[48203] #6 0x00007f77a6916d2a in ggml_vk_flash_attn(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so.0
[48203] #7 0x00007f77a692a9fa in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so.0
[48203] #8 0x00007f77a69378e9 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so.0
[48203] #9 0x00007f77ab0f011b in ggml_backend_graph_compute_async () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-base.so.0
[48203] #10 0x00007f77ab0f4c62 in ggml_backend_sched_compute_splits(ggml_backend_sched*) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-base.so.0
[48203] #11 0x00007f77ab0f5aa0 in ggml_backend_sched_graph_compute_async () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-base.so.0
[48203] #12 0x00007f77aa8a0a42 in llama_context::graph_compute(ggml_cgraph*, bool) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libllama.so.0
[48203] #13 0x00007f77aa89c139 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libllama.so.0
[48203] #14 0x00007f77aa89de47 in llama_context::decode(llama_batch const&) () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libllama.so.0
[48203] #15 0x00007f77aa8a4d81 in llama_decode () from /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libllama.so.0
[48203] #16 0x0000000000673767 in common_init_from_params(common_params&) ()
[48203] #17 0x00000000004f3e06 in server_context_impl::load_model(common_params const&) ()
[48203] #18 0x00000000004cf00c in server_context::load_model(common_params const&) ()
[48203] #19 0x0000000000408427 in main ()
[48203] [Inferior 1 (process 931308) detached]
Name and Version
bazzite@bazzite:/var/home/bazzite/llama.cpp-vulkan/llama-turboquant$ ./llama-cli --version
WARNING: radv is not a conformant Vulkan implementation, testing use only.
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /var/home/bazzite/llama.cpp-vulkan/llama-turboquant/libggml-cpu.so
version: 8814 (8590cbf)
built with GNU 15.2.1 for Linux x86_64
launch script:
llama-server-launcher.sh
Operating systems
Linux
GGML backends
Vulkan
Hardware
RX 9070 XT, intel i5 14600k
Models
gemma-4-26B-A4B-it-UD-Q3_K_XL
Problem description & steps to reproduce
model doesn't load with turbo3 kv cache on vulkan. when it does load, it outputs gibberish
First Bad Commit
No response
Relevant log output
Logs