Replies: 3 comments
-
|
The CUDA code already does MUL_MAT_ID this way (see |
Beta Was this translation helpful? Give feedback.
-
|
Adding a SYCL data point to this thread since the current discussion has been CUDA-focused. I was benchmarking Qwen3-30B-A3B Q4_K_M on an Intel Arc B70 Pro and noticed OpenVINO 2026.1 hitting ~60 t/s on decode while llama.cpp SYCL was at ~44. Prefill was the other way around, llama.cpp roughly 5x faster. The decode gap surprised me so I let Claude investigate. Here's what Claude found: Claude wrote a prototype port of the idea, though using a simpler work-group-per-expert design rather than the
178 lines, additive, prefill unaffected. That tells me the decode win is real and worth pursuing properly. Two questions:
Hardware: Intel Arc B70 Pro 32 GB, oneAPI 2025.3. Model: unsloth Qwen3-30B-A3B-Instruct-2507 Q4_K_M. |
Beta Was this translation helpful? Give feedback.
-
|
Quick follow-up. Honest context first: I bought an Arc B70 Pro recently, llama.cpp is what makes it actually useful to me day-to-day, and I ended up spending some Claude tokens investigating why decode was so much slower than OpenVINO on the same card. The result of that investigation is something I'd like to give back to the community if you want it. I had Claude rewrite the prototype to match the CUDA
+48.8% decode, prefill unaffected. OpenVINO 2026.1 on the same GPU is 60.34 tg128 for reference. The code is Claude-written, I know that's a problem under AGENTS.md. If you want the branch as a reference, or as a starting point for someone doing a human-authored version, I'm happy to share it. If you don't want it at all, totally fine — I'll just keep running it on my own machine. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently transformers is refactoring eager MoE into
torch._grouped_gemmand support compiling full CUDA graph, see huggingface/transformers#42697 . Themul_mat_idsoperator requires CPU synchronization to select expert. I think we can refactormul_mat_idsintogrouped_gemmtoo. One disscussion is in #12859 .implmentation
The grouped_gemm version of transformers is abstracted as below. I removed some comments and code for simplification.
Since
argsortandcumsumare already supported in llama.cpp, onlyhistogram,gather,grouped_gemmneed to be implemented.Plus, with
grouped_gemmsupported, models with dynamic experts like meituan-longcat can be added.Beta Was this translation helpful? Give feedback.
All reactions