Refactor `mul_mat_ids` into `grouped_gemm` for faster MoE #18369

hebangwen · 2025-12-25T14:47:49Z

hebangwen
Dec 25, 2025

Currently transformers is refactoring eager MoE into torch._grouped_gemm and support compiling full CUDA graph, see huggingface/transformers#42697 . The mul_mat_ids operator requires CPU synchronization to select expert. I think we can refactor mul_mat_ids into grouped_gemm too. One disscussion is in #12859 .

implmentation

The grouped_gemm version of transformers is abstracted as below. I removed some comments and code for simplification.

Since argsort and cumsum are already supported in llama.cpp, only histogram, gather, grouped_gemm need to be implemented.

Plus, with grouped_gemm supported, models with dynamic experts like meituan-longcat can be added.

def grouped_mm_experts_forward(
    self: torch.nn.Module,
    hidden_states: torch.Tensor,
    top_k_index: torch.Tensor,
    top_k_weights: torch.Tensor,
) -> torch.Tensor:
    num_tokens = hidden_states.size(0)
    num_top_k = top_k_index.size(-1)
    expert_ids = top_k_index.reshape(-1)
    sample_weights = top_k_weights.reshape(-1)  # (S,)
    current_hidden_states = hidden_states.unsqueeze(1).expand(-1, num_top_k, -1).reshape(num_tokens * num_top_k, -1)

    # token view -> expert view
    perm = torch.argsort(expert_ids, stable=True)
    # expert view -> token view
    inv_perm = torch.argsort(perm, stable=True)

    # Group by expert for grouped_mm
    expert_ids_g = expert_ids[perm]
    sample_weights_g = sample_weights[perm]
    current_states_g = current_hidden_states[perm]
    num_tokens_per_expert = torch.histc(expert_ids_g.float(), bins=num_experts, min=0, max=num_experts - 1)
    offsets = torch.cumsum(num_tokens_per_expert, dim=0, dtype=torch.int32)

    # MoE FFN
    gate_up_out = torch._grouped_mm(current_states_g, self.gate_up_proj.transpose(-2, -1), offs=offsets)
    gate, up = gate_up_out.chunk(2, dim=-1)  # both have shape (S, intermediate_dim)
    hidden_after_activation = self.act_fn(gate) * up  # (S, intermediate_dim)
    out_per_sample_g = torch._grouped_mm(hidden_after_activation, self.down_proj.transpose(-2, -1), offs=offsets)

    out_per_sample_g = out_per_sample_g * sample_weights_g.unsqueeze(-1)

    # revert original layout
    out_per_sample = out_per_sample_g[inv_perm]
    final_hidden_states = out_per_sample.view(num_tokens, num_top_k, -1).sum(axis=1)

    return final_hidden_states

am17an · 2025-12-25T15:17:38Z

am17an
Dec 25, 2025
Collaborator

The CUDA code already does MUL_MAT_ID this way (see mmid_helper)

0 replies

markussiebert · 2026-04-09T19:11:30Z

markussiebert
Apr 9, 2026

Adding a SYCL data point to this thread since the current discussion has been CUDA-focused.

I was benchmarking Qwen3-30B-A3B Q4_K_M on an Intel Arc B70 Pro and noticed OpenVINO 2026.1 hitting ~60 t/s on decode while llama.cpp SYCL was at ~44. Prefill was the other way around, llama.cpp roughly 5x faster. The decode gap surprised me so I let Claude investigate.

Here's what Claude found: ggml_sycl_mul_mat_id still uses the old host-sync-and-loop pattern. Ids are copied back to host, stream synced, then each active expert is dispatched as its own ggml_sycl_mul_mat. There's no SYCL equivalent to mm_ids_helper or the [TAG_MUL_MAT_ID_CUDA_GRAPHS] fast path in ggml_cuda_mul_mat_id.

Claude wrote a prototype port of the idea, though using a simpler work-group-per-expert design rather than the mm_ids_helper permutation approach. On clean master (d6f3030), same binary, toggled via env var:

	tg128	pp512
SYCL master	43.97	634.89
+ fused mmid (prototype)	66.30	635.97

178 lines, additive, prefill unaffected. That tells me the decode win is real and worth pursuing properly.

Two questions:

Is there interest in a SYCL port of the CUDA approach (i.e. a SYCL mm_ids_helper + wiring it into the existing mmvq path), or is SYCL waiting on the grouped_gemm refactor discussed above?
Disclosure: the investigation and prototype were done with Claude in one session, I've read AGENTS.md, and I am not planning to submit the prototype as-is. If the direction is wanted I'd re-implement from scratch using CUDA's mmid.cu as the reference spec, with the prototype kept locally only as a regression check. Is that acceptable?

Hardware: Intel Arc B70 Pro 32 GB, oneAPI 2025.3. Model: unsloth Qwen3-30B-A3B-Instruct-2507 Q4_K_M.

0 replies

markussiebert · 2026-04-09T20:07:23Z

markussiebert
Apr 9, 2026

Quick follow-up.

Honest context first: I bought an Arc B70 Pro recently, llama.cpp is what makes it actually useful to me day-to-day, and I ended up spending some Claude tokens investigating why decode was so much slower than OpenVINO on the same card. The result of that investigation is something I'd like to give back to the community if you want it.

I had Claude rewrite the prototype to match the CUDA mmid.cu design — SYCL mm_ids_helper producing ids_src1 / ids_dst / expert_bounds on device, then a compact-matvec over the permuted layout, with an ncols_dst > 16 gate mirroring mmf.cu. Same binary A/B on clean master (d6f3030):

	tg128	pp512	pp2048
SYCL master	43.97	634.89	633.24
v2 (CUDA-equivalent)	65.41	642.33	652.75

+48.8% decode, prefill unaffected. OpenVINO 2026.1 on the same GPU is 60.34 tg128 for reference.

The code is Claude-written, I know that's a problem under AGENTS.md. If you want the branch as a reference, or as a starting point for someone doing a human-authored version, I'm happy to share it. If you don't want it at all, totally fine — I'll just keep running it on my own machine.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `mul_mat_ids` into `grouped_gemm` for faster MoE #18369

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Refactor mul_mat_ids into grouped_gemm for faster MoE #18369

Uh oh!

hebangwen Dec 25, 2025

implmentation

Replies: 3 comments

Uh oh!

am17an Dec 25, 2025 Collaborator

Uh oh!

markussiebert Apr 9, 2026

Uh oh!

markussiebert Apr 9, 2026

Refactor `mul_mat_ids` into `grouped_gemm` for faster MoE #18369

hebangwen
Dec 25, 2025

am17an
Dec 25, 2025
Collaborator

markussiebert
Apr 9, 2026

markussiebert
Apr 9, 2026