Skip to content

UPSTREAM PR #21554: hexagon: optimization for HMX mat_mul#1346

Open
loci-dev wants to merge 6 commits intomainfrom
loci/pr-21554-feat-hmx-optimization
Open

UPSTREAM PR #21554: hexagon: optimization for HMX mat_mul#1346
loci-dev wants to merge 6 commits intomainfrom
loci/pr-21554-feat-hmx-optimization

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21554

Overview

This PR introduces two additional optimizations for the Hexagon HMX backend:

  1. Enable asynchronous HMX execution
    HMX computations are now executed asynchronously, allowing them to overlap with HVX dequantization and DMA stages within the pipeline. Previously, synchronous HMX calls blocked the main thread and limited parallelism.

  2. Automatic shape search for mat_mul_qk_0_d16a32_out_stationary()
    The auto-tuning logic is extended to the out-stationary pipeline path. This functionality was previously only available for non out-stationary paths.

Additional Information

  • Improved auto-tuning strategy
    The previous strategy maximized mc * nc, effectively reducing the number of DMA calls. While this works well for FP16 matmul, it does not accurately model the cost of quantized matmul.

    In quantized matmul:

    • Weight tensors require both dequantization and shuffling
    • Activation tensors require only shuffling

    Profiling on 8 Elite Gen 5 indicates that loading quantized weights is approximately 1.5× more expensive than loading activations. Although this is a rough estimate, it's produce good enough results.

Benchmark on 8 Elite Gen 5

Master

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 123.54 ± 0.28
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.70 ± 0.04

Commit a521c91 (HMX Async)

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 130.68 ± 0.12
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.67 ± 0.06

Commit ef501f8 (HMX async and auto-tuning)

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 138.56 ± 0.75
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.92 ± 0.07

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. Used for adding tests, logs and creating scripts to filter logs.

njsyw1997 and others added 6 commits April 11, 2026 15:36
Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.
Store the boolean to local variable avoid atomic load twice
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 12, 2026

No meaningful performance changes were detected across 126809 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants