UPSTREAM PR #21554: hexagon: optimization for HMX mat_mul by loci-dev · Pull Request #1346 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-12T03:11:08Z

Note

Source pull request: ggml-org/llama.cpp#21554

Overview

This PR introduces two additional optimizations for the Hexagon HMX backend:

Enable asynchronous HMX execution
HMX computations are now executed asynchronously, allowing them to overlap with HVX dequantization and DMA stages within the pipeline. Previously, synchronous HMX calls blocked the main thread and limited parallelism.
Automatic shape search for mat_mul_qk_0_d16a32_out_stationary()
The auto-tuning logic is extended to the out-stationary pipeline path. This functionality was previously only available for non out-stationary paths.

Additional Information

Improved auto-tuning strategy
The previous strategy maximized mc * nc, effectively reducing the number of DMA calls. While this works well for FP16 matmul, it does not accurately model the cost of quantized matmul.

In quantized matmul:
- Weight tensors require both dequantization and shuffling
- Activation tensors require only shuffling
Profiling on 8 Elite Gen 5 indicates that loading quantized weights is approximately 1.5× more expensive than loading activations. Although this is a rough estimate, it's produce good enough results.

Benchmark on 8 Elite Gen 5

Master

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	123.54 ± 0.28
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.70 ± 0.04

Commit a521c91 (HMX Async)

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	130.68 ± 0.12
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.67 ± 0.06

Commit ef501f8 (HMX async and auto-tuning)

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	138.56 ± 0.75
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.92 ± 0.07

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes. Used for adding tests, logs and creating scripts to filter logs.

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread.

Store the boolean to local variable avoid atomic load twice

loci-review · 2026-04-12T04:07:20Z

No meaningful performance changes were detected across 126809 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

njsyw1997 and others added 6 commits April 11, 2026 15:36

hexagon: add async HMX worker

2969cfa

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread.

hexagon: cost-based VTCM chunk search for out-stationary matmul

d7a8634

hexagon: fix futex race in hmx_worker_drain

266120d

Store the boolean to local variable avoid atomic load twice

hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

f5833aa

hex-vmem: drop vmem limit a touch under 3GB on v73

56ae479

hexagon: add fwd declaration of htp_context

0d79977

loci-dev temporarily deployed to PROD__AL_DEMO April 12, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19

loci-dev force-pushed the main branch from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21554: hexagon: optimization for HMX mat_mul#1346

UPSTREAM PR #21554: hexagon: optimization for HMX mat_mul#1346
loci-dev wants to merge 6 commits intomainfrom
loci/pr-21554-feat-hmx-optimization

loci-dev commented Apr 12, 2026

Uh oh!

loci-review Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Apr 12, 2026

Overview

Additional Information

Requirements

Uh oh!

loci-review Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants