RTX 5090 (CUDA) vs Radeon AI PRO R9700 (Vulkan) — Qwen3.5-35B-A3B MoE Q4_K_XL llama-bench results #19890
Replies: 9 comments 72 replies
-
|
I made a custom mix that should be optimized for speed on Vulkan backend because it uses only legacy quant types like q8_0/q4_0/q4_1 which have better kernels for vulkan psure. So if you are interested to try it to see if using the correct quant types for Vulkan backend improves anything. https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/resolve/main/Qwen3.5-35B-A3B-Q4_0.gguf I have a few similar quants by unsloth and others showing this recipe perplexity is better than those. (i'll test the one you list above too)
Thanks! EDIT Also, the UD-Q4_K_XL quant you used seems to have a recipe bug and using mxfp4 for attn and other tensors:
Some more discussion here: https://www.reddit.com/r/LocalLLaMA/comments/1resggh/ |
Beta Was this translation helpful? Give feedback.
-
Follow-up: Unsloth UD-Q4_K_XL vs ubergarm Q4_0 on AMD Radeon AI PRO R9700 (Vulkan)Thanks @ubergarm for the suggestion! I ran your System
Models
Prompt Processing (PP) [t/s]
Token Generation (TG) [t/s]
AnalysisInteresting split result:
Bottom line: Each quant has its strength on R9700 Vulkan — Q4_0 custom mix wins in PP throughput, Q4_K_XL wins in TG. The Q4_0 mix also retains higher precision in attention and shared expert layers (Q8_0), which may translate to better output quality despite the lower TG speed. Worth considering depending on whether you prioritize speed or quality in your workflow. Hope this data is useful! Happy to run more tests if needed. |
Beta Was this translation helpful? Give feedback.
-
|
Are the 9700 cards not supporting ROCm for Testing yet? |
Beta Was this translation helpful? Give feedback.
-
|
This was an out-of-the-box comparison — the goal was to test what you get when you simply plug in each card and run llama.cpp with the least amount of setup friction. |
Beta Was this translation helpful? Give feedback.
-
|
Mind if I ask how on earth you guys are getting 100+t/s out of this model running Vulkan? Running the Unsloth UD_Q4_K_XL GGUF on my R9700 on an i5-13400F rig with 128GB DDR4, I can't get more than 85t/s with my main card (in the x16 slot) or 67t/s with the second (in the x4-from-chipset slot). Admittedly, it kinda feels like there's something screwy going on with the difference between the two slots, but still...this is my config: Running on Ubuntu 24.04, with latest mesa drivers. I find myself begging for GATED_DELTA_NET and speculative decoding just to get decent performance, but you guys seem to be getting the eval rates I'm after without any of that and I'm just...confused. |
Beta Was this translation helpful? Give feedback.
-
|
I ran a detailed performance study of Qwen3-30B-A3B Q4_K_M (a 30B-class MoE model) on a single AMD Radeon AI PRO R9700 (RDNA4, 32 GB, Vulkan backend). The headline number: 183 tokens/s decode at 92% of the card's theoretical bandwidth limit. Below you'll find the full results, a per-layer bandwidth model explaining where the numbers come from, and everything you need to reproduce this yourself. If you have corrections, suggestions, or results from other GPUs — please share, I'd love to build a cross-GPU comparison table. Qwen3-30B-A3B Q4_K_M Benchmark on AMD Radeon AI PRO R9700 (RDNA4) — VulkanSummary183 tokens/s decode (86% of theoretical bandwidth limit) and 3,033 tokens/s prefill for Qwen3-30B-A3B Q4_K_M on a single AMD Radeon AI PRO R9700 (RDNA4, gfx1201, 32 GB VRAM) using llama.cpp Vulkan backend with flash attention enabled. MoE routing activates only ~5.9B effective parameters per token (out of 30.53B total), making this 30B-class model decode at speeds comparable to a dense ~6B model. All 48 layers offloaded to GPU. No multi-GPU. ResultsText Generation (Decode)5 runs each, mean ± std. dev. shown.
With flash attention ON, decode speed drops only 6.6% from ctx=128 to ctx=4096. Without FA, the drop is 11.9%. This is consistent with the expected reduction in memory bandwidth pressure when using flash attention. Prompt Processing (Prefill)
Prefill throughput plateaus between pp512 and pp1024. The slight drop from 3,033 to 3,009 tokens/s (FA ON) is likely due to increased L2 cache pressure at longer sequences — at pp1024, the combined KV cache and intermediate activations begin to exceed the GPU's 4 MB L2 cache, causing more VRAM round-trips. Benchmark MethodologyAll benchmarks were executed using Each configuration was repeated 5 times. Mean and standard deviation are reported. System state during testing:
Exact Reproduction Steps1. Hardware
2. Software
3. Model
4. Build llama.cppgit clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout c5a778891ba0ddbd4cbb507c823f970595b1adc2
mkdir build_vulkan && cd build_vulkan
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)5. Run the benchmark
Check your device list first: ./bin/llama-bench --list-devicesOn our system, the output is: Decode benchmark (Flash Attention OFF):./bin/llama-bench \
-m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 \
-t 1 \
-dev Vulkan1 \
-fa 0 \
-p 0 \
-n 128,256,512,1024,2048,4096 \
-r 5Decode benchmark (Flash Attention ON):./bin/llama-bench \
-m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 \
-t 1 \
-dev Vulkan1 \
-fa 1 \
-p 0 \
-n 128,256,512,1024,2048,4096 \
-r 5Prefill benchmark:./bin/llama-bench \
-m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 \
-t 1 \
-dev Vulkan1 \
-fa 1 \
-p 128,256,512,1024 \
-n 0 \
-r 5Full combined benchmark (decode + prefill, FA ON and OFF):# Flash Attention OFF
./bin/llama-bench \
-m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 -t 1 -dev Vulkan1 -fa 0 \
-p 128,256,512,1024 -n 128,256,512,1024,2048,4096 -r 5
# Flash Attention ON
./bin/llama-bench \
-m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 -t 1 -dev Vulkan1 -fa 1 \
-p 128,256,512,1024 -n 128,256,512,1024,2048,4096 -r 56. Parameter explanation
Raw llama-bench OutputFlash Attention OFF (5 repeats)Flash Attention ON (5 repeats)Notes
Approximate Bandwidth ModelThe R9700 has ~640 GB/s peak memory bandwidth. For Qwen3-30B-A3B with top-8 MoE routing, the per-token VRAM read can be estimated as follows. All values assume Q4_K_M quantization at ~4.86 BPW (0.6075 bytes/param average). This is a weighted average: Q4_K_M uses q4_K blocks for most tensors (~4.5 BPW) and q6_K blocks for attention.v and FFN down_proj tensors (~6.5 BPW). The overall average depends on the model architecture's ratio of these tensor types. 86% bandwidth utilization is a strong result for a Vulkan backend. The remaining ~14% accounts for attention KV-cache reads, kernel dispatch, synchronization, non-GEMV computation (softmax, SiLU, RMSNorm, RoPE, MoE routing), and memory controller overhead. The key insight is that despite having 30.53B total parameters, MoE routing means only ~5.9B parameters (attention + shared expert + 8 routed experts) are read per token, making this model comparable in bandwidth requirements to a dense ~6B model. |
Beta Was this translation helpful? Give feedback.
-
|
Claude sent me here. I'm wondering how AMD R9700 users are finding tok/s generation with the Qwen 3.5 27B dense model. On paper the nvidia 3090 should be fast but gets high 20s to low 30s. An RTX 6000 Pro doubles that number (and I assume runs about the same speed as a 5090) - that's based upon a 100-200k context size. When driving these models from Claude Code there's 25k context used up before you even start with the agent's prompt. |
Beta Was this translation helpful? Give feedback.
-
|
@digitalscream hi again, this may be a dumb question but I'm often hitting more than 262K context on kilocode local llm set up with qwen3.5 and qwen3-coder-next models ( I got another R9700 ) |
Beta Was this translation helpful? Give feedback.
-
|
~50% less cost yet only ~10% less capable than RTX 5090... "For anyone running local LLMs, doing LoRA fine-tuning, or generating AI video, the 32 GB frame buffer alone justifies the cost for serious workloads, and ROCm 7.2 support is finally first-class. Hitting 91.8% of RTX 5090 CUDA performance on identical fine-tuning workloads on a consumer desktop — versus NVIDIA's flagship on a 120-core datacenter server — is a remarkable result that reframes the whole AMD-vs-NVIDIA AI conversation. If you use Windows only and won't touch WSL2, buy something else. If you're on Linux or willing to run WSL2, this is the most exciting prosumer AI GPU released in years." |
Beta Was this translation helpful? Give feedback.





Uh oh!
There was an error while loading. Please reload this page.
-
RTX 5090 (CUDA) vs AMD Radeon AI PRO R9700 (Vulkan) — llama-bench results
Sharing my llama-bench comparison between the NVIDIA RTX 5090 and AMD Radeon AI PRO R9700 on a MoE model. Hopefully useful for anyone considering the R9700 for local LLM inference.
System Specs
Benchmark Parameters
Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf(~19 GB)244641955-ngl)-fa)-ctk/-ctv)Results: Prompt Processing (PP) [t/s]
Higher is better.
Results: Token Generation (TG) [t/s]
Higher is better.
Observations
Beta Was this translation helpful? Give feedback.
All reactions