How to improve llama.cpp hexagon decode performance #21702
-
|
Title: Hexagon decode performance gap vs Qualcomm proprietary engine Device: Snapdragon 8 Elite Tested optimizations with no improvement:
Question: Is there a known optimization direction or method? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
|
Hey @PengweiLi , The 12 tok/s vs 26 tok/s gap you're seeing is real and expected — it's not a bug or misconfiguration. Here's why and what you can actually do: Why the gap exists: Qualcomm's HTP engine uses highly optimized, proprietary GEMM kernels and direct HTA memory access paths that are baked into their SDK. llama.cpp's Hexagon backend goes through the QNN SDK but doesn't have access to the same low-level kernel optimizations — it's essentially a general-purpose path vs. a purpose-built one. Explicit HTP backend flags — Make sure you're actually hitting HTP and not falling back to CPU for some ops: --n-gpu-layers 999 --device hexagon Watch for memory bandwidth bottleneck — At Q4_0, decode on Hexagon is often memory-bandwidth bound, not compute bound. Switching to a smaller model (Qwen3-1.7B) and comparing the ratio will tell you if you're hitting a bandwidth ceiling. If this helped you, please mark it as the answer — it helps others in the community who run into the same issue find the solution faster! |
Beta Was this translation helpful? Give feedback.
Hey @PengweiLi ,
The 12 tok/s vs 26 tok/s gap you're seeing is real and expected — it's not a bug or misconfiguration. Here's why and what you can actually do:
Why the gap exists:
Qualcomm's HTP engine uses highly optimized, proprietary GEMM kernels and direct HTA memory access paths that are baked into their SDK. llama.cpp's Hexagon backend goes through the QNN SDK but doesn't have access to the same low-level kernel optimizations — it's essentially a general-purpose path vs. a purpose-built one.
What you haven't tried that's worth testing:
Q4_K_M instead of Q4_0 — The _K variants have better dequantization patterns that map more efficiently onto Hexagon's vector units. Q4_0 on Hexagon o…