Skip to content

Misc. bug: memory regression for Qwen3.5-A122-A10B-GGUF Model #75

@cbuchner1

Description

@cbuchner1

Name and Version

commit 8590cbf of llama-server from
feature/turboquant-kv-cache branch.

./build/bin/llama-server --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 using 4-mag LUT (pre-M5 hardware)
ggml_metal_library_init: turbo3 sparse V dequant enabled (opt-out: TURBO_SPARSE_V=0)
ggml_metal_library_init: loaded in 15.454 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 51539.61 MB
version: 8814 (8590cbf)
built with AppleClang 17.0.0.17000604 for Darwin arm64

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./build/bin/llama-server \
  -m ~/.cache/lm-studio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf \
  -mm ~/.cache/lm-studio/models/unsloth/Qwen3.5-122B-A10B-GGUF/mmproj-F16.gguf \
  --alias "qwen3.5-122b-a10b" \
  --jinja -ngl all -fa on \
  --cache-type-k q8_0 --cache-type-v turbo4 \
  --kv-unified \
  --kv-offload \
  --mmap \
  --batch-size 1024 \
  -np 2 --metrics --host 0.0.0.0 --port 1234 \
  --ctx-size 262144 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00

Problem description & steps to reproduce

This latest commit 8590cb (meant to add a Vulkan feature) causes GPU memory for inference on M4Pro chip to increase significantly, to the point where I can no longer run Qwen3.5-A122-A10B-GGUF unsloth IQ2_XXS

8590cbf

Check the relevant log output section below for the exact error message

First Bad Commit

8590cbf

Relevant log output

Logs

Comparing the output of both versions I see some errors with the latest commit

commit 6a29b5 runs the model just fine

...
llama_params_fit_impl: projected to use 45591 MiB of device memory vs. 48853 MiB of free device memory

commit 8590cb has trouble fitting that same model

...
llama_params_fit_impl: projected to use 111688 MiB of device memory vs. 48853 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 63858 MiB
llama_params_fit_impl: context size set by user to 262144 -> no change
llama_params_fit: failed to fit params to free device memory: n_gpu_layers already set by user to -2, abort
...
ggml_metal_buffer_init: error: failed to allocate buffer, size = 66096.67 MiB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions