Misc. bug: memory regression for Qwen3.5-A122-A10B-GGUF Model

### Name and Version

commit 8590cbff961dbaf1d3a9793fd11d402e248869b9 of llama-server from 
feature/turboquant-kv-cache branch.

./build/bin/llama-server --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 using 4-mag LUT (pre-M5 hardware)
ggml_metal_library_init: turbo3 sparse V dequant enabled (opt-out: TURBO_SPARSE_V=0)
ggml_metal_library_init: loaded in 15.454 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 51539.61 MB
version: 8814 (8590cbff9)
built with AppleClang 17.0.0.17000604 for Darwin arm64


### Operating systems

Mac

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./build/bin/llama-server \
  -m ~/.cache/lm-studio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf \
  -mm ~/.cache/lm-studio/models/unsloth/Qwen3.5-122B-A10B-GGUF/mmproj-F16.gguf \
  --alias "qwen3.5-122b-a10b" \
  --jinja -ngl all -fa on \
  --cache-type-k q8_0 --cache-type-v turbo4 \
  --kv-unified \
  --kv-offload \
  --mmap \
  --batch-size 1024 \
  -np 2 --metrics --host 0.0.0.0 --port 1234 \
  --ctx-size 262144 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00
```

### Problem description & steps to reproduce

This latest commit 8590cb (meant to add a Vulkan feature) causes GPU memory for inference on M4Pro chip to increase significantly, to the point where I can no longer run Qwen3.5-A122-A10B-GGUF unsloth IQ2_XXS

https://github.com/TheTom/llama-cpp-turboquant/commit/8590cbff961dbaf1d3a9793fd11d402e248869b9

Check the relevant log output section below for the exact error message

### First Bad Commit

8590cbff9

### Relevant log output

<details>
<summary>Logs</summary>


Comparing the output of both versions I see some errors with the latest commit

commit 6a29b5 runs the model just fine

```console
...
llama_params_fit_impl: projected to use 45591 MiB of device memory vs. 48853 MiB of free device memory
```

commit 8590cb has trouble fitting that same model

```console
...
llama_params_fit_impl: projected to use 111688 MiB of device memory vs. 48853 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 63858 MiB
llama_params_fit_impl: context size set by user to 262144 -> no change
llama_params_fit: failed to fit params to free device memory: n_gpu_layers already set by user to -2, abort
...
ggml_metal_buffer_init: error: failed to allocate buffer, size = 66096.67 MiB
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: memory regression for Qwen3.5-A122-A10B-GGUF Model #75

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: memory regression for Qwen3.5-A122-A10B-GGUF Model #75

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions