How to improve llama.cpp hexagon decode performance #21702

PengweiLi · 2026-04-10T03:35:59Z

PengweiLi
Apr 10, 2026

Title: Hexagon decode performance gap vs Qualcomm proprietary engine

Device: Snapdragon 8 Elite
Model: Qwen3-4B Q4_0
llama.cpp: 12 tok/s
Qualcomm HTP engine: 26 tok/s

Tested optimizations with no improvement:

Flash Attention (prefill improved, decode unchanged)
KV cache quantization
Different thread counts

Question: Is there a known optimization direction or method？

Answered by pauldev-hub

Apr 14, 2026

Hey @PengweiLi ,

The 12 tok/s vs 26 tok/s gap you're seeing is real and expected — it's not a bug or misconfiguration. Here's why and what you can actually do:

Why the gap exists:

Qualcomm's HTP engine uses highly optimized, proprietary GEMM kernels and direct HTA memory access paths that are baked into their SDK. llama.cpp's Hexagon backend goes through the QNN SDK but doesn't have access to the same low-level kernel optimizations — it's essentially a general-purpose path vs. a purpose-built one.
What you haven't tried that's worth testing:
Q4_K_M instead of Q4_0 — The _K variants have better dequantization patterns that map more efficiently onto Hexagon's vector units. Q4_0 on Hexagon o…

View full answer

pauldev-hub · 2026-04-14T05:04:17Z

pauldev-hub
Apr 14, 2026

Hey @PengweiLi ,

The 12 tok/s vs 26 tok/s gap you're seeing is real and expected — it's not a bug or misconfiguration. Here's why and what you can actually do:

Why the gap exists:

Qualcomm's HTP engine uses highly optimized, proprietary GEMM kernels and direct HTA memory access paths that are baked into their SDK. llama.cpp's Hexagon backend goes through the QNN SDK but doesn't have access to the same low-level kernel optimizations — it's essentially a general-purpose path vs. a purpose-built one.
What you haven't tried that's worth testing:
Q4_K_M instead of Q4_0 — The _K variants have better dequantization patterns that map more efficiently onto Hexagon's vector units. Q4_0 on Hexagon often underperforms its K-quant equivalents.

Explicit HTP backend flags — Make sure you're actually hitting HTP and not falling back to CPU for some ops:

--n-gpu-layers 999 --device hexagon
Check the startup logs for any layers that fail to offload.
Batch size tuning — Try --ubatch-size 128 or --ubatch-size 256. Hexagon vector units have a sweet spot for batch sizes during decode.

Watch for memory bandwidth bottleneck — At Q4_0, decode on Hexagon is often memory-bandwidth bound, not compute bound. Switching to a smaller model (Qwen3-1.7B) and comparing the ratio will tell you if you're hitting a bandwidth ceiling.
Realistically, closing the full gap to Qualcomm's proprietary engine from llama.cpp is unlikely without upstream kernel work — but you should be able to get from 12 to 16–18 tok/s with the above.

If this helped you, please mark it as the answer — it helps others in the community who run into the same issue find the solution faster!

3 replies

PengweiLi Apr 15, 2026
Author

Thanks for your answer, I will give it a try according to your suggestion.

PengweiLi Apr 15, 2026
Author

I see that the ggml_hexagon_supported_mul_mat function in the code does not support Q4_K op type，so the _K is better?

pauldev-hub Apr 15, 2026

I see that the ggml_hexagon_supported_mul_mat function in the code does not support Q4_K op type，so the _K is better?

The quality trade-off is worth it specifically when your target is Hexagon throughput — running on CPU with Q4_K gives better quality but loses all Hexagon speedup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve llama.cpp hexagon decode performance #21702

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to improve llama.cpp hexagon decode performance #21702

Uh oh!

PengweiLi Apr 10, 2026

Replies: 1 comment · 3 replies

Uh oh!

pauldev-hub Apr 14, 2026

Uh oh!

PengweiLi Apr 15, 2026 Author

Uh oh!

PengweiLi Apr 15, 2026 Author

Uh oh!

pauldev-hub Apr 15, 2026

PengweiLi
Apr 10, 2026

Replies: 1 comment 3 replies

pauldev-hub
Apr 14, 2026

PengweiLi Apr 15, 2026
Author

PengweiLi Apr 15, 2026
Author