Skip to content
Discussion options

You must be logged in to vote

Hey @PengweiLi ,

The 12 tok/s vs 26 tok/s gap you're seeing is real and expected — it's not a bug or misconfiguration. Here's why and what you can actually do:

Why the gap exists:

Qualcomm's HTP engine uses highly optimized, proprietary GEMM kernels and direct HTA memory access paths that are baked into their SDK. llama.cpp's Hexagon backend goes through the QNN SDK but doesn't have access to the same low-level kernel optimizations — it's essentially a general-purpose path vs. a purpose-built one.
What you haven't tried that's worth testing:
Q4_K_M instead of Q4_0 — The _K variants have better dequantization patterns that map more efficiently onto Hexagon's vector units. Q4_0 on Hexagon o…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@PengweiLi
Comment options

@PengweiLi
Comment options

@pauldev-hub
Comment options

Answer selected by PengweiLi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants