RTX 5090 (CUDA) vs Radeon AI PRO R9700 (Vulkan) — Qwen3.5-35B-A3B MoE Q4_K_XL llama-bench results #19890

JohnTDI-cpu · 2026-02-25T12:17:18Z

JohnTDI-cpu
Feb 25, 2026

RTX 5090 (CUDA) vs AMD Radeon AI PRO R9700 (Vulkan) — llama-bench results

Sharing my llama-bench comparison between the NVIDIA RTX 5090 and AMD Radeon AI PRO R9700 on a MoE model. Hopefully useful for anyone considering the R9700 for local LLM inference.

System Specs

Component	Machine 1 (NVIDIA)	Machine 2 (AMD)
GPU	NVIDIA RTX 5090	AMD Radeon AI PRO R9700
VRAM	32 GB	32 GB
CPU	AMD Ryzen 9 9900X	AMD Ryzen 5 7500F (6 cores)
Backend API	CUDA 13.0	Vulkan

Note: CPUs differ between the two systems. However, with -ngl 99 (full GPU offload) the CPU impact on inference performance should be minimal.

Benchmark Parameters

Parameter	Value
Tool	llama-bench
Model	`Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf` (~19 GB)
llama.cpp	commit `244641955`
n_gpu_layers (`-ngl`)	99
split_mode	none (single GPU)
flash_attn (`-fa`)	1 (ON)
KV cache (`-ctk` / `-ctv`)	q8_0 (K and V)

Results: Prompt Processing (PP) [t/s]

Higher is better.

Context	RTX 5090 (CUDA)	R9700 (Vulkan)	RTX 5090 advantage
512	7,026	2,713	2.59×
1,024	6,953	2,643	2.63×
2,048	6,960	2,610	2.67×
4,096	6,958	2,546	2.73×
8,192	6,835	2,413	2.83×
16,384	6,773	2,196	3.08×
32,768	6,461	1,877	3.44×

Results: Token Generation (TG) [t/s]

Higher is better.

Metric	RTX 5090 (CUDA)	R9700 (Vulkan)	RTX 5090 advantage
TG (average)	194.0	127.4	1.52×

Observations

Prompt processing: The RTX 5090 is roughly 2.6–3.4× faster, with the gap widening at longer context lengths. This is expected given the massive memory bandwidth advantage of GDDR7 (1.8 TB/s vs ~512 GB/s).
Token generation: The gap narrows to ~1.5×. At 127.4 t/s the R9700 delivers a very comfortable interactive experience — well above the ~30–40 t/s threshold where most users stop noticing differences in real-time chat.
Price context (Poland, Feb 2026): The RTX 5090 currently costs ~ 17,000 PLN (~ $4,300) vs ~ 6,400 PLN (~ $1,600) for the R9700. That's a 2.66× price difference for a 1.52× TG speed advantage. Two R9700s (64 GB total VRAM) cost ~12,800 PLN — still less than a single RTX 5090, while doubling available memory.
R9700 on Vulkan backend — no ROCm required, works out of the box on both Windows and Linux.
This is a MoE model (Qwen3.5-35B-A3B), which is architecturally efficient — only ~3B parameters are active per token despite the 35B total. Real-world performance perception is excellent on both cards.

ubergarm · 2026-02-25T21:49:16Z

ubergarm
Feb 25, 2026

@JohnTDI-cpu

I made a custom mix that should be optimized for speed on Vulkan backend because it uses only legacy quant types like q8_0/q4_0/q4_1 which have better kernels for vulkan psure.

So if you are interested to try it to see if using the correct quant types for Vulkan backend improves anything.

https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/resolve/main/Qwen3.5-35B-A3B-Q4_0.gguf

I have a few similar quants by unsloth and others showing this recipe perplexity is better than those. (i'll test the one you list above too)

Thanks!

EDIT

Also, the UD-Q4_K_XL quant you used seems to have a recipe bug and using mxfp4 for attn and other tensors:

Some more discussion here: https://www.reddit.com/r/LocalLLaMA/comments/1resggh/

0 replies

JohnTDI-cpu · 2026-02-26T07:41:22Z

JohnTDI-cpu
Feb 26, 2026
Author

Follow-up: Unsloth UD-Q4_K_XL vs ubergarm Q4_0 on AMD Radeon AI PRO R9700 (Vulkan)

Thanks @ubergarm for the suggestion! I ran your Qwen3.5-35B-A3B-Q4_0.gguf custom mix against the Unsloth UD-Q4_K_XL on my R9700 setup. Results:

System

Component	Value
GPU	AMD Radeon AI PRO R9700 (32 GB)
CPU	AMD Ryzen 5 7500F (6 cores)
Backend	Vulkan
llama.cpp	commit `244641955`
`-ngl`	99
`-fa`	1 (ON)
KV cache	q8_0 (K and V)

Models

	Unsloth UD-Q4_K_XL	ubergarm Q4_0 custom
File size	~18.32 GiB	~19.78 GiB

Prompt Processing (PP) [t/s]

Context	Q4_K_XL	Q4_0	Δ
512	2,713	2,900	Q4_0 +6.9%
1,024	2,643	2,815	Q4_0 +6.5%
2,048	2,610	2,795	Q4_0 +7.1%
4,096	2,546	2,718	Q4_0 +6.8%
8,192	2,413	2,552	Q4_0 +5.8%
16,384	2,196	2,319	Q4_0 +5.6%
32,768	1,877	1,962	Q4_0 +4.5%

Token Generation (TG) [t/s]

Metric	Q4_K_XL	Q4_0	Δ
TG (average)	127.4	111.9	Q4_K_XL +13.9%

Analysis

Interesting split result:

Prompt processing: Q4_0 wins by ~5–7%. This aligns with your hypothesis — legacy quant types (Q4_0/Q4_1) use simpler math ops that the Vulkan backend can execute faster during the compute-heavy PP phase.
Token generation: Q4_K_XL wins by ~14%. This was unexpected. My hypothesis: your custom mix keeps attention, Delta Net (SSM) and shared expert layers at Q8_0, making those tensors ~2× larger than what Unsloth's dynamic mix assigns them. Since TG is memory-bandwidth-bound (reading weights for one token at a time), the larger Q8_0 tensors become a bottleneck — more bytes to move through the R9700's 512 GB/s bus per token. The total model size confirms this: 19.78 GiB (Q4_0) vs 18.32 GiB (Q4_K_XL).

Bottom line: Each quant has its strength on R9700 Vulkan — Q4_0 custom mix wins in PP throughput, Q4_K_XL wins in TG. The Q4_0 mix also retains higher precision in attention and shared expert layers (Q8_0), which may translate to better output quality despite the lower TG speed. Worth considering depending on whether you prioritize speed or quality in your workflow.

Hope this data is useful! Happy to run more tests if needed.

1 reply

ubergarm Feb 26, 2026

Super, thanks and yes the generated discussion there is accurate given TG is typically memory bandwidth limited and the active parameters have a higher BPW.

Yes, its all trade-offs when designing quants. Personally I'd take the Q4_0 with ~14% slower TG given it has about 0.6% increased perplexity over baseline as compared to the UD-Q4_K_XL with over 3+% increased "worse" perplexity and that quant recipe has a bug in it which unsloth Daniel is working through currently: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/discussions/5

I have some more benchmark data coming in for 7900XTX and some other Vulkan/ROCm GPUs: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/discussions/1 and big thanks to @0cc4m for all the hard work optimizing vulkan for llama.cpp!! Really appreciate your efforts and taking the time to benchmark this custom q8_0/q4_0/q4_1 mix and great recent FOSDEM'26 Vulkan API llama.cpp talk !!!

mplat-1981 · 2026-03-03T14:52:56Z

mplat-1981
Mar 3, 2026

Are the 9700 cards not supporting ROCm for Testing yet?

4 replies

0cc4m Mar 3, 2026
Collaborator

They are supported in ROCm.

mplat-1981 Mar 3, 2026

Maybe I just misunderstand the API in Vulcan, but is calling this via Vulcan not a restriction in itself?
Explanation for my assumption: I would expect ROCm to deliver better results since it is better tailored to the professional LLM use.
I switched from Vulcan to ROCm in Ubuntu 24.04, yes that was a not straight forward task. In the next Ubuntu release ROCm will be supported natively.

0cc4m Mar 3, 2026
Collaborator

No, it is not much of a restriction to use Vulkan. Most of the performance comes from how well the kernels are written, not from which API is used. Whether Vulkan or ROCm is faster is different depending on your hardware and the specific model, but they are usually not that far apart. Sometimes Vulkan is even faster than ROCm.

mplat-1981 Mar 3, 2026

Thank you so much for the response. Then I understand that me always pushing for ROCm was kind of wasted effort on my cluster machines.

JohnTDI-cpu · 2026-03-03T20:49:47Z

JohnTDI-cpu
Mar 3, 2026
Author

This was an out-of-the-box comparison — the goal was to test what you get when you simply plug in each card and run llama.cpp with the least amount of setup friction.
RTX 5090: Plug into a Linux box, install NVIDIA drivers + CUDA (standard procedure), build llama.cpp with CUDA backend — done. This is a well-established, battle-tested workflow.
Radeon AI PRO R9700: The test was done on Windows, where Vulkan is the path of least resistance. You install GPU drivers, grab a llama.cpp build with Vulkan support, and it just works — no additional SDK installation, no ROCm configuration, no compatibility guessing.
To keep things fair, I'll try to run a follow-up test on Linux with the ROCm backend when I get a chance — that should give us a more complete picture of what the R9700 can really do.

2 replies

d-shehu Mar 3, 2026

Thanks for sharing the numbers. But it's not surprising given the bandwidth and compute differences between the 2 cards. Cost wise it's a 3x ratio, at least last time I checked: ~ $1350 vs $4000, for the R9700 and RTX5090 respectively.

Vulkan also works out of the box on Ubuntu 24.04 and I'm quite happy with my R9700 running llama.cpp.

digitalscream Mar 9, 2026

For what it's worth, I found that for the models I habitually use (primarily Qwen3-Coder-A30B-3B), the R9700 using Vulkan is roughly equivalent in performance to a 3090 running with a 250-300W power limit, which is to say ~140t/s (impressive, given the 45-50% memory bandwidth advantage of the 3090). ROCm, however, rarely goes over 105t/s.

digitalscream · 2026-03-08T13:29:54Z

digitalscream
Mar 8, 2026

Mind if I ask how on earth you guys are getting 100+t/s out of this model running Vulkan? Running the Unsloth UD_Q4_K_XL GGUF on my R9700 on an i5-13400F rig with 128GB DDR4, I can't get more than 85t/s with my main card (in the x16 slot) or 67t/s with the second (in the x4-from-chipset slot). Admittedly, it kinda feels like there's something screwy going on with the difference between the two slots, but still...this is my config:

mmap=1
cache-type-k=q8_0
cache-type-v=q8_0
ctx-size=8192
flash-attn=on
model=/main_data/llm/models.1/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
n-gpu-layers=99
threads=14
split-mode=none
device=Vulkan0
chat-template-kwargs={"enable_thinking": false}

Running on Ubuntu 24.04, with latest mesa drivers.

I find myself begging for GATED_DELTA_NET and speculative decoding just to get decent performance, but you guys seem to be getting the eval rates I'm after without any of that and I'm just...confused.

31 replies

digitalscream Mar 13, 2026

For what it's worth, I've just tweaked the config to be more consistent and I'm getting 128t/s sustained on the main GPU at 300W, 105t/s on the second at 220W, 95t/s on both together at 144W combined.

Roll on tensor parallel mode and self-speculative decoding for Qwen3.5 :)

zedbytes Mar 13, 2026

@digitalscream Great news ! I think you'll get some more juice out of R9700s. Performance with x8/x8 PCIe split across two R9700s is exactly what I wanted to confirm. Are you seeing 128 tokens/sec with RADV_DEBUG=nocompute or is it just the mastre build ?

so here I go

I've recently updated to the latest llama.cpp master branch , results are just insanse. I didrealize my previous benchmarking was excessive as I was running three tries for each fill percentage, which is unnecessary. I've now streamlined my benchmarking:

One run per fill level—for each fill percentage, a single test, capturing metrics at 25% 75% and 100% fills
For each run, I capture:
- TG (Generation Speed),
- PP (Prompt Processing),
- TTFT (Time-to-First-Token).
I also report the average TG across all fill levels, which gives a good sense of continuous performance for orchestrators like OpenClaw—useful for 24/7 agentic workloads. To keep things concise, I'm only posting average TG performance delta

Model family	Context	Old → New (t/s)	Change
Qwen3.5-35B-A3B (MoE)	8K	105 → 138	+31%
	65K	93 → 121	+30%
	131K	78 → 100	+28%
	262K	66 → 87	+32%
Qwen3.5-27B (dense)	8K	28 → 32	+14%
	32K	34 → 41	+21%
	65K	32 → 39	+22%
	131K	29 → 35	+21%
Qwen3.5-9B (dense)	262K	39 → 46	+18%

LACT GPU Profile Tuning

I experimented with about 10 GPU tuning profiles using LACT on Ubuntu, ultimately settling on the three best ones (plus Default):

Setting\Profile	Default	247	TG	PPTTFT
Power cap	300W	210W	315W	330W
Voltage offset	—	-75mV	-75mV	-100mV
SCLK offset	—	-300 MHz	-44 MHz	+100 MHz
Fan curve max	stock	70% at 70°C	93% at 80°C	100% at 85°C
Target junction	stock	< 70°C	< 80°C	< 85°C
Best for	—	24/7	Best TG	PP/TTFT max

Profile Results :

Profile 247: TG -1% to -3%. PP -3% to -6%. TTFT 0% to +9% slower
Profile TG : TG +1% to +2%. PP +3% to +8%. TTFT -2% to -9% faster
Profile PPTTFT : TG 0% to +3%. PP +6% to +9%. TTFT -6% to -9% faster
TG profile: Needs memory overclock to make a real impact—currently TG profile gives just a 1–2% boost to TG.
PPTTFT profile: Much more responsive to clock/power, and undervolt efficiency outweighs raw power for PP. Large-context models benefit more on PP and TTFT than small-context models. Like at 8K context, TG profile cuts ~0.5s off TTFT but 262K context, it saves about 22s on TTFT

Also, MoE models are less affected by LACT profile tuning compared to dense models.

The tests were done against 12 models + parameter settings combos , each going all the way to 100% fill

Hope this helps , can add more details , just ask :)

digitalscream Mar 13, 2026

@zedbytes - yeah, that's with RADV_DEBUG=nocompute set on the current master build. To be honest, I'm most interested in the performance using both GPUs together - my current main use case is Qwen3-Coder-Next for agentic coding and Qwen3-Coder-30B for inline completions (and then Qwen3.5-35B when we get speculative decoding for the 3.5 series). That requires both models to be 50/50 on each GPU...so I'm hoping that the new board balances the performance more equally between them.

Even using REAM/REAP versions respectively doesn't let Next run within 32GB, hence balancing them across cards. That's not so bad, I'm actually reasonably happy with the performance as it is at the moment.

I'd love to be using 27B, but it's just too slow at the moment. That may change when tensor parallelism comes to life, but I can't honestly see it getting to the 70t/s+ range in any way on these cards. Sadly, I'm not holding out much hope of a Qwen3.5-Coder given the personnel changes in the team :(

zedbytes Apr 10, 2026

@digitalscream have you tried -sm tensor , it loads and runs , but underperforms by a margin as context grows.
i think we gotta wait for vulkan mesa upgrades right ?

digitalscream Apr 10, 2026

@digitalscream have you tried -sm tensor , it loads and runs , but underperforms by a margin as context grows. i think we gotta wait for vulkan mesa upgrades right ?

Yeah, I've been trying it as they were building it. It functions with Vulkan, but there's a massive performance penalty - it's about 60-70% slower than -sm layer.

It was written primarily for CPU and CUDA, so the work to optimise for other backends is still to be done.

JohnTDI-cpu · 2026-03-14T12:54:25Z

JohnTDI-cpu
Mar 14, 2026
Author

I ran a detailed performance study of Qwen3-30B-A3B Q4_K_M (a 30B-class MoE model) on a single AMD Radeon AI PRO R9700 (RDNA4, 32 GB, Vulkan backend). The headline number: 183 tokens/s decode at 92% of the card's theoretical bandwidth limit. Below you'll find the full results, a per-layer bandwidth model explaining where the numbers come from, and everything you need to reproduce this yourself. If you have corrections, suggestions, or results from other GPUs — please share, I'd love to build a cross-GPU comparison table.

Qwen3-30B-A3B Q4_K_M Benchmark on AMD Radeon AI PRO R9700 (RDNA4) — Vulkan

Summary

183 tokens/s decode (86% of theoretical bandwidth limit) and 3,033 tokens/s prefill for Qwen3-30B-A3B Q4_K_M on a single AMD Radeon AI PRO R9700 (RDNA4, gfx1201, 32 GB VRAM) using llama.cpp Vulkan backend with flash attention enabled.

MoE routing activates only ~5.9B effective parameters per token (out of 30.53B total), making this 30B-class model decode at speeds comparable to a dense ~6B model. All 48 layers offloaded to GPU. No multi-GPU.

Results

Text Generation (Decode)

5 runs each, mean ± std. dev. shown.

Context Length	Flash Attn OFF (tokens/s)	Flash Attn ON (tokens/s)	FA Improvement
128	177.87 ± 1.23	183.47 ± 1.03	+3.1%
256	178.40 ± 0.42	183.76 ± 0.13	+3.0%
512	174.82 ± 0.32	183.21 ± 0.14	+4.8%
1024	169.53 ± 0.22	181.32 ± 0.05	+7.0%
2048	165.27 ± 0.10	177.05 ± 0.09	+7.1%
4096	156.80 ± 0.09	171.30 ± 0.01	+9.3%

With flash attention ON, decode speed drops only 6.6% from ctx=128 to ctx=4096. Without FA, the drop is 11.9%. This is consistent with the expected reduction in memory bandwidth pressure when using flash attention.

Prompt Processing (Prefill)

Prompt Length	Flash Attn OFF (tokens/s)	Flash Attn ON (tokens/s)
128	1,437 ± 27	1,452 ± 35
256	2,197 ± 27	2,197 ± 28
512	3,007 ± 26	3,033 ± 23
1024	2,914 ± 25	3,009 ± 25

Prefill throughput plateaus between pp512 and pp1024. The slight drop from 3,033 to 3,009 tokens/s (FA ON) is likely due to increased L2 cache pressure at longer sequences — at pp1024, the combined KV cache and intermediate activations begin to exceed the GPU's 4 MB L2 cache, causing more VRAM round-trips.

Benchmark Methodology

All benchmarks were executed using llama-bench from llama.cpp.

Each configuration was repeated 5 times. Mean and standard deviation are reported.

System state during testing:

No other GPU workloads running (verified via rocm-smi)
GPU running at default clocks (no overclocking, no power limit modifications)
CPU governor: default (ondemand)
Ambient temperature: ~22°C
No display connected to the R9700 (headless compute)

Exact Reproduction Steps

1. Hardware

Component	Details
GPU	AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32 GB GDDR6, 64 CUs)
GPU VRAM	32,624 MiB total, 32,566 MiB free (reported by llama-bench)
PCIe	PCIe 5.0 x8 (bifurcated from x16 slot)
CPU	AMD Ryzen 9 9900X 12-Core Processor
RAM	64 GB DDR5

2. Software

Component	Version
OS	Ubuntu 24.04.4 LTS
Kernel	6.17.0-14-generic
Mesa (RADV)	25.2.8-0ubuntu0.24.04.1
Vulkan driver	RADV GFX1201
Vulkan features	fp16: yes, bf16: no, warp size: 64, KHR_coopmat: yes
llama.cpp	commit `c5a778891` (build 8233), 2026-03-07
llama.cpp commit	`c5a778891ba0ddbd4cbb507c823f970595b1adc2`
CMake build type	Release
CMake options	`GGML_VULKAN=ON`, `GGML_NATIVE=ON`
Compiler	gcc 13.3.0
CMake	3.28.3

3. Model

Property	Value
Model	Qwen3-30B-A3B (Mixture of Experts)
File	`Qwen3-30B-A3B-Q4_K_M.gguf`
Quantization	Q4_K - Medium (4.86 BPW)
File size	17.28 GiB (18,556,685,824 bytes)
MD5	`1f64e39ce36af9eeb8ae165b84a709ac`
GGUF version	V3 (latest)
Total parameters	30.53B
Architecture	qwen3moe, 48 layers
Hidden size	2048
Attention heads	32 query heads, 4 KV heads (GQA ratio 8)
Head dim	128
Experts	128 per layer, top-8 active (~3.3B MoE expert parameters active per token)
Expert FFN size	768
Shared expert FFN	6144 (always active, in addition to top-8 routed experts)
Context (train)	40,960
RoPE base	1,000,000
Vocab	151,936

4. Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout c5a778891ba0ddbd4cbb507c823f970595b1adc2

mkdir build_vulkan && cd build_vulkan
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

5. Run the benchmark

Note: You must specify -dev with the correct device index for your discrete GPU. By default, llama-bench may select the integrated GPU (device 0), which will give significantly lower performance (~120 tokens/s in our case).

Check your device list first:

./bin/llama-bench --list-devices

On our system, the output is:

Available devices:
  Vulkan0: AMD Radeon Graphics (RADV RAPHAEL_MENDOCINO) (33004 MiB, 31982 MiB free)  <-- iGPU
  Vulkan1: AMD Radeon AI PRO R9700 (RADV GFX1201) (32624 MiB, 32566 MiB free)       <-- discrete GPU (used)
  Vulkan2: AMD Radeon AI PRO R9700 (RADV GFX1201) (32624 MiB, 32566 MiB free)       <-- same physical GPU

Decode benchmark (Flash Attention OFF):

./bin/llama-bench \
  -m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  -t 1 \
  -dev Vulkan1 \
  -fa 0 \
  -p 0 \
  -n 128,256,512,1024,2048,4096 \
  -r 5

Decode benchmark (Flash Attention ON):

./bin/llama-bench \
  -m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  -t 1 \
  -dev Vulkan1 \
  -fa 1 \
  -p 0 \
  -n 128,256,512,1024,2048,4096 \
  -r 5

Prefill benchmark:

./bin/llama-bench \
  -m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  -t 1 \
  -dev Vulkan1 \
  -fa 1 \
  -p 128,256,512,1024 \
  -n 0 \
  -r 5

Full combined benchmark (decode + prefill, FA ON and OFF):

# Flash Attention OFF
./bin/llama-bench \
  -m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 99 -t 1 -dev Vulkan1 -fa 0 \
  -p 128,256,512,1024 -n 128,256,512,1024,2048,4096 -r 5

# Flash Attention ON
./bin/llama-bench \
  -m /path/to/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 99 -t 1 -dev Vulkan1 -fa 1 \
  -p 128,256,512,1024 -n 128,256,512,1024,2048,4096 -r 5

6. Parameter explanation

Parameter	Value	Meaning
`-m`	model path	Path to GGUF file
`-ngl 99`	99	Offload all supported layers to GPU (model has 48 layers; 99 ensures all are offloaded)
`-t 1`	1 thread	Single CPU thread (minimal impact when fully GPU-offloaded, set for reproducibility)
`-dev Vulkan1`	Vulkan1	Select discrete GPU by device index. Verify with `--list-devices`
`-fa 0` / `-fa 1`	0 or 1	Flash attention off/on
`-p X`	prompt tokens	Number of prompt tokens to process (prefill). Use `0` to skip
`-n X`	generated tokens	Number of tokens to generate (decode). Use `0` to skip
`-r 5`	5 repeats	Number of benchmark repetitions for statistical averaging

Raw llama-bench Output

Flash Attention OFF (5 repeats)

| model                          |       size |     params | backend    | ngl | threads | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           pp128 |      1437.18 ± 27.23 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           pp256 |      2196.54 ± 27.42 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           pp512 |      3007.39 ± 25.65 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |          pp1024 |      2914.30 ± 24.80 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           tg128 |        177.87 ± 1.23 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           tg256 |        178.40 ± 0.42 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |           tg512 |        174.82 ± 0.32 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |          tg1024 |        169.53 ± 0.22 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |          tg2048 |        165.27 ± 0.10 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 | Vulkan1      |          tg4096 |        156.80 ± 0.09 |

build: c5a778891 (8233)

Flash Attention ON (5 repeats)

| model                          |       size |     params | backend    | ngl | threads | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------ | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           pp128 |      1451.97 ± 34.90 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           pp256 |      2197.18 ± 27.51 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           pp512 |      3032.59 ± 23.45 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |          pp1024 |      3009.02 ± 25.21 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           tg128 |        183.47 ± 1.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           tg256 |        183.76 ± 0.13 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |           tg512 |        183.21 ± 0.14 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |          tg1024 |        181.32 ± 0.05 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |          tg2048 |        177.05 ± 0.09 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |       1 |  1 | Vulkan1      |          tg4096 |        171.30 ± 0.01 |

build: c5a778891 (8233)

Notes

Device selection is critical. Our system has an integrated GPU (AMD Raphael Mendocino, Vulkan0) and a discrete GPU (R9700, Vulkan1). Without -dev Vulkan1, llama-bench may default to device 0 (the iGPU), resulting in significantly lower performance (~120 tokens/s). Always verify the active device with --list-devices.
All model layers fit in VRAM. The R9700 has 32 GB VRAM. The Q4_K_M model is 17.28 GiB. With KV cache at 4096 context, total VRAM usage is well under 32 GB. No model layers are offloaded to CPU.
Flash attention impact. FA ON provides +3-9% decode improvement, with the benefit increasing at longer context lengths (+3.1% at ctx=128, +9.3% at ctx=4096). This is consistent with the expected reduction in memory bandwidth pressure from the attention computation, which becomes proportionally more significant at longer sequences.
MoE architecture context. Qwen3-30B-A3B is a Mixture-of-Experts model. While it has 30.53B total parameters, only ~3.3B MoE expert parameters are active per token (top-8 out of 128 experts selected per layer). In addition, attention layers (~1.2B params) and shared experts (~1.4B params) are always active. The effective per-token weight bandwidth is therefore much smaller than a dense 30B model, which is the primary reason these decode speeds are achievable on a single consumer GPU.
Vulkan vs ROCm on RDNA4. In our testing with this specific model and llama.cpp build, the Vulkan backend was faster than ROCm HIP on the R9700: ~183 tokens/s (Vulkan) vs ~150 tokens/s (ROCm HIP) for decode at ctx=128. The Vulkan backend uses KHR_cooperative_matrix for RDNA4 matrix cores. This may not generalize to all models or builds — we only tested Qwen3-30B-A3B Q4_K_M with llama.cpp commit c5a778891.
Reproducibility. All results are averaged over 5 runs with std. dev. shown. The low variance (typically < 1 token/s for decode) confirms stable, reproducible measurements.
16 GB GPUs (RX 9070 XT). This model (17.28 GiB) does not fit in 16 GB VRAM. On a 16 GB GPU, partial CPU offload (-ngl < 48) would be required, which will significantly reduce performance. These results require >= 24 GB VRAM for full GPU offload.

Approximate Bandwidth Model

The R9700 has ~640 GB/s peak memory bandwidth. For Qwen3-30B-A3B with top-8 MoE routing, the per-token VRAM read can be estimated as follows.

All values assume Q4_K_M quantization at ~4.86 BPW (0.6075 bytes/param average). This is a weighted average: Q4_K_M uses q4_K blocks for most tensors (~4.5 BPW) and q6_K blocks for attention.v and FFN down_proj tensors (~6.5 BPW). The overall average depends on the model architecture's ratio of these tensor types.

Per-layer weight reads:
  Attention:
    QKV projection:     (32×128 + 4×128 + 4×128) × 2048 = 5120 × 2048 = 10.5M params × 0.61 B = 6.4 MB
    O projection:        (32×128) × 2048 = 4096 × 2048 = 8.4M params × 0.61 B = 5.1 MB
    Subtotal attention:  ~11.5 MB

  MoE (8 active experts):
    gate+up per expert:  1536 × 2048 = 3.1M params
    down per expert:     2048 × 768  = 1.6M params
    8 experts:           8 × 4.7M = 37.7M params × 0.61 B = 23.0 MB

  Shared expert (always active):
    gate+up:             12288 × 2048 = 25.2M params × 0.61 B = 15.3 MB
    down:                2048 × 6144  = 12.6M params × 0.61 B = 7.7 MB
    Subtotal shared:     ~23.0 MB

  Router:               128 × 2048 × 2 B (FP16) = 0.5 MB
  Norms + scales:       ~0.5 MB

  Per-layer total:      ~59 MB

48 layers:              48 × 59 MB = 2,832 MB

LM head:                151936 × 2048 × 0.61 B = ~190 MB

Total per token:        ~3.02 GB

Theoretical bandwidth-limited max: 640 GB/s ÷ 3.02 GB ≈ 212 tokens/s
Achieved (FA ON, ctx=128):         183 tokens/s
Effective bandwidth utilization:   ~86%

86% bandwidth utilization is a strong result for a Vulkan backend. The remaining ~14% accounts for attention KV-cache reads, kernel dispatch, synchronization, non-GEMV computation (softmax, SiLU, RMSNorm, RoPE, MoE routing), and memory controller overhead.

The key insight is that despite having 30.53B total parameters, MoE routing means only ~5.9B parameters (attention + shared expert + 8 routed experts) are read per token, making this model comparable in bandwidth requirements to a dense ~6B model.

23 replies

ubergarm Mar 21, 2026

@JohnTDI-cpu

it skips the pipeline parallelism machinery

fwiw i always compile with cmake .... -DGGML_SCHED_MAX_COPIES=1 ... to set pipeline parallelism to 1 instead of deafult of 4 on mainline llama.cpp. i have not benchmarked the effects of this on vulkan backend personally though, but it gives me some extra VRAM to play with on CUDA.

digitalscream Mar 22, 2026

Question...does anyone have a pair of Intel or Nvidia cards to test this with? It'd be good to have a bit more info before raising a bug.

0cc4m Mar 22, 2026
Collaborator

Even worse, you can also pair Intel+AMD+Nvidia. I have that kind of system. Just open an issue with your data, others can contribute more data there.

digitalscream Mar 22, 2026

Done:

#20862

It's my first issue raised on this repo, so by all means critique if I've missed something obvious or broken etiquette :)

JohnTDI-cpu Mar 25, 2026
Author

@JohnTDI-cpu

it skips the pipeline parallelism machinery

fwiw i always compile with cmake .... -DGGML_SCHED_MAX_COPIES=1 ... to set pipeline parallelism to 1 instead of deafult of 4 on mainline llama.cpp. i have not benchmarked the effects of this on vulkan backend personally though, but it gives me some extra VRAM to play with on CUDA.

Tested GGML_SCHED_MAX_COPIES=4 vs =1 on single R9700 Vulkan (RDNA4, gfx1201) with both RADV and AMDVLK — no measurable difference on either driver. Identical throughput within noise on both Qwen3.5-35B-A3B and Qwen3-30B-A3B

alexellis · 2026-03-24T20:37:49Z

alexellis
Mar 24, 2026

Claude sent me here. I'm wondering how AMD R9700 users are finding tok/s generation with the Qwen 3.5 27B dense model. On paper the nvidia 3090 should be fast but gets high 20s to low 30s. An RTX 6000 Pro doubles that number (and I assume runs about the same speed as a 5090) - that's based upon a 100-200k context size. When driving these models from Claude Code there's 25k context used up before you even start with the agent's prompt.

3 replies

zedbytes Mar 24, 2026

Claude sent me here. I'm wondering how AMD R9700 users are finding tok/s generation with the Qwen 3.5 27B dense model. On paper the nvidia 3090 should be fast but gets high 20s to low 30s. An RTX 6000 Pro doubles that number (and I assume runs about the same speed as a 5090)

As you may have seen in threads above, some of us have cracked 35+ tks on lower quant 27B dense models on a single R9700 GPU. 35B parameter model is more sensitive to performance tweaks , on which I've got around 145+ tks at Q4

ubergarm Mar 24, 2026

@alexellis

On paper the nvidia 3090 should be fast but gets high 20s to low 30s.

got a fresh data point for you on that with llama-sweep-bench on my 3090TI FE tuned with LACT undervolt/overclock. the quant is optimized more for quality than speed and merging qkv and gate|up on the fly with ik_llama.cpp:

Unlike the MoEs, increasing batch sizes with this dense model doesn't help much and isn't worth extra VRAM buffers, imo. This is CUDA, not Vulkan just as a comparison point.

👈 Details

title: "ik_llama.cpp main@233225db8 Full GPU Offload"
subtitle: "ubergarm/Qwen3.5-27B-GGUF IQ5_KS 18.532 GiB (5.919 BPW)"
hardware: "NVIDIA 3090TI FE 24GB, LACT undervolt & OC, Driver Version: 590.48.01 CUDA Version: 13.1"

default batch size, unquantized kv-cache, merged qkv & up|gate

model=/models/ubergarm/Qwen3.5-27B-GGUF/Qwen3.5-27B-IQ5_KS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 66560 \
  -ctk f16 -ctv f16 \
  --merge-qkv \
  -muge \
  -ngl 999 \
  --threads 1 \
  --no-mmap \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.333	1535.56	3.527	36.29
512	128	512	0.335	1528.64	3.529	36.27
512	128	1024	0.335	1527.39	3.527	36.29
512	128	1536	0.336	1525.24	3.531	36.25
512	128	2048	0.337	1518.74	3.535	36.21
512	128	2560	0.339	1509.45	3.535	36.21
512	128	3072	0.339	1509.37	3.538	36.17
512	128	3584	0.341	1503.27	3.538	36.18
512	128	4096	0.343	1491.86	3.557	35.99
512	128	4608	0.345	1485.30	3.561	35.95
512	128	5120	0.345	1482.26	3.569	35.86
512	128	5632	0.347	1476.61	3.576	35.79
512	128	6144	0.349	1467.63	3.579	35.76
512	128	6656	0.349	1468.91	3.587	35.69
512	128	7168	0.350	1462.20	3.593	35.63
512	128	7680	0.352	1454.57	3.595	35.61
512	128	8192	0.354	1446.54	3.625	35.31
512	128	8704	0.354	1445.21	3.626	35.30
512	128	9216	0.355	1441.09	3.633	35.24
512	128	9728	0.356	1436.44	3.632	35.24
512	128	10240	0.358	1430.55	3.634	35.23
512	128	10752	0.360	1420.63	3.641	35.16
512	128	11264	0.362	1414.81	3.640	35.16
512	128	11776	0.363	1411.75	3.643	35.14
512	128	12288	0.363	1411.07	3.644	35.12
512	128	12800	0.364	1404.94	3.646	35.11
512	128	13312	0.366	1397.49	3.646	35.10
512	128	13824	0.368	1391.25	3.649	35.08
512	128	14336	0.370	1385.33	3.653	35.04
512	128	14848	0.369	1385.89	3.656	35.01
512	128	15360	0.371	1381.15	3.659	34.99
512	128	15872	0.372	1374.97	3.663	34.94
512	128	16384	0.373	1371.32	3.691	34.68
512	128	16896	0.376	1361.02	3.691	34.68
512	128	17408	0.377	1358.57	3.696	34.63
512	128	17920	0.378	1353.09	3.696	34.63
512	128	18432	0.379	1350.61	3.697	34.62
512	128	18944	0.381	1344.36	3.704	34.56
512	128	19456	0.381	1343.53	3.704	34.56
512	128	19968	0.383	1338.13	3.706	34.54
512	128	20480	0.384	1332.64	3.707	34.53
512	128	20992	0.385	1329.71	3.711	34.49
512	128	21504	0.387	1322.58	3.713	34.47
512	128	22016	0.390	1312.46	3.719	34.41
512	128	22528	0.390	1312.76	3.719	34.42
512	128	23040	0.391	1309.99	3.722	34.39
512	128	23552	0.392	1304.73	3.727	34.35
512	128	24064	0.393	1303.44	3.734	34.28
512	128	24576	0.394	1299.49	3.758	34.06
512	128	25088	0.395	1295.88	3.755	34.08
512	128	25600	0.397	1289.91	3.757	34.07
512	128	26112	0.399	1282.95	3.757	34.07
512	128	26624	0.399	1282.54	3.760	34.04
512	128	27136	0.401	1275.49	3.769	33.96
512	128	27648	0.402	1273.33	3.769	33.96
512	128	28160	0.403	1269.80	3.771	33.94
512	128	28672	0.404	1267.77	3.773	33.93
512	128	29184	0.407	1258.18	3.777	33.89
512	128	29696	0.408	1255.89	3.779	33.87
512	128	30208	0.409	1252.09	3.783	33.84
512	128	30720	0.410	1250.16	3.789	33.79
512	128	31232	0.411	1245.68	3.793	33.75
512	128	31744	0.412	1241.78	3.794	33.73
512	128	32256	0.414	1235.46	3.798	33.71
512	128	32768	0.415	1234.24	3.824	33.48
512	128	33280	0.416	1230.15	3.823	33.48
512	128	33792	0.418	1225.92	3.827	33.45
512	128	34304	0.419	1221.51	3.830	33.42
512	128	34816	0.421	1217.55	3.832	33.40
512	128	35328	0.421	1215.08	3.839	33.35
512	128	35840	0.423	1210.19	3.837	33.36
512	128	36352	0.424	1206.44	3.839	33.34
512	128	36864	0.425	1206.03	3.841	33.33
512	128	37376	0.428	1197.55	3.844	33.30
512	128	37888	0.429	1194.58	3.845	33.29
512	128	38400	0.430	1191.62	3.848	33.26
512	128	38912	0.430	1190.90	3.852	33.23
512	128	39424	0.432	1184.21	3.854	33.21
512	128	39936	0.434	1180.23	3.856	33.20
512	128	40448	0.434	1178.60	3.862	33.14
512	128	40960	0.436	1173.71	3.890	32.91
512	128	41472	0.437	1170.30	3.891	32.89
512	128	41984	0.438	1168.43	3.894	32.87
512	128	42496	0.440	1163.44	3.897	32.85
512	128	43008	0.441	1159.99	3.897	32.84
512	128	43520	0.442	1158.81	3.905	32.78
512	128	44032	0.444	1153.23	3.906	32.77
512	128	44544	0.445	1149.81	3.908	32.75
512	128	45056	0.446	1147.00	3.912	32.72
512	128	45568	0.448	1143.27	3.915	32.70
512	128	46080	0.450	1138.86	3.921	32.65
512	128	46592	0.450	1138.30	3.923	32.63
512	128	47104	0.452	1132.31	3.926	32.60
512	128	47616	0.453	1130.16	3.930	32.57
512	128	48128	0.454	1128.64	3.932	32.55
512	128	48640	0.457	1120.91	3.938	32.51
512	128	49152	0.456	1121.92	3.965	32.28
512	128	49664	0.459	1116.01	3.964	32.29
512	128	50176	0.459	1115.65	3.965	32.28
512	128	50688	0.461	1110.48	3.968	32.26
512	128	51200	0.463	1106.81	3.970	32.24
512	128	51712	0.464	1104.54	3.977	32.18
512	128	52224	0.465	1100.99	3.979	32.17
512	128	52736	0.466	1098.95	3.978	32.18
512	128	53248	0.468	1094.46	3.981	32.16
512	128	53760	0.468	1093.11	3.986	32.11
512	128	54272	0.471	1087.73	3.990	32.08
512	128	54784	0.471	1087.05	3.993	32.06
512	128	55296	0.472	1083.94	3.997	32.02
512	128	55808	0.474	1080.03	4.001	31.99
512	128	56320	0.476	1075.91	4.003	31.97
512	128	56832	0.477	1073.45	4.008	31.93
512	128	57344	0.477	1072.82	4.034	31.73
512	128	57856	0.479	1069.10	4.035	31.72
512	128	58368	0.480	1066.89	4.034	31.73
512	128	58880	0.482	1062.56	4.037	31.71
512	128	59392	0.483	1059.32	4.039	31.69
512	128	59904	0.484	1057.93	4.047	31.63
512	128	60416	0.487	1051.62	4.046	31.63
512	128	60928	0.487	1052.40	4.049	31.61
512	128	61440	0.489	1047.75	4.052	31.59
512	128	61952	0.490	1045.88	4.054	31.58
512	128	62464	0.492	1041.63	4.056	31.56
512	128	62976	0.491	1042.14	4.056	31.55
512	128	63488	0.493	1038.00	4.064	31.50
512	128	64000	0.494	1036.80	4.069	31.46
512	128	64512	0.496	1032.48	4.070	31.45
512	128	65024	0.499	1026.88	4.076	31.40
512	128	65536	0.498	1027.11	4.101	31.21
512	128	66048	0.500	1023.27	4.103	31.19

Same hardware running the 35B MoE:

sweep-bench-Qwen3 5-35B-A3B-IQ4_KS-3090TI-v2

👈 Details

--- title: "ik_llama.cpp main@233225db8 Full GPU Offload" subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS 19.799 GiB (4.907 BPW)" hardware: "NVIDIA 3090TI FE 24GB, LACT undervolt & OC, Driver Version: 590.48.01 CUDA Version: 13.1" ---

-ub 1024, unquantized kv-cache, merged qkv & up|gate

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 66560 \
  -ctk f16 -ctv f16 \
  -ub 1024 -b 2048 \
  -cuda fa-offset=0 \
  --merge-qkv \
  -muge \
  -ngl 999 \
  --threads 1 \
  --no-mmap \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	0.208	4927.25	1.043	122.72
1024	128	1024	0.212	4831.05	1.052	121.62
1024	128	2048	0.214	4778.75	1.055	121.37
1024	128	3072	0.217	4712.16	1.056	121.17
1024	128	4096	0.219	4686.50	1.059	120.90
1024	128	5120	0.222	4615.63	1.061	120.68
1024	128	6144	0.226	4531.60	1.063	120.46
1024	128	7168	0.229	4479.05	1.067	119.95
1024	128	8192	0.230	4454.63	1.069	119.78
1024	128	9216	0.233	4390.78	1.074	119.20
1024	128	10240	0.236	4333.86	1.079	118.62
1024	128	11264	0.238	4306.86	1.096	116.77
1024	128	12288	0.240	4265.56	1.097	116.67
1024	128	13312	0.243	4208.52	1.098	116.61
1024	128	14336	0.246	4159.49	1.097	116.73
1024	128	15360	0.249	4109.28	1.099	116.50
1024	128	16384	0.251	4083.31	1.100	116.40
1024	128	17408	0.253	4049.66	1.103	116.07
1024	128	18432	0.256	3997.86	1.104	115.99
1024	128	19456	0.258	3963.97	1.107	115.60
1024	128	20480	0.260	3939.96	1.110	115.36
1024	128	21504	0.264	3882.43	1.124	113.90
1024	128	22528	0.265	3859.87	1.128	113.43
1024	128	23552	0.267	3832.13	1.129	113.36
1024	128	24576	0.272	3768.60	1.131	113.22
1024	128	25600	0.274	3734.27	1.131	113.19
1024	128	26624	0.276	3708.06	1.131	113.21
1024	128	27648	0.280	3658.46	1.134	112.91
1024	128	28672	0.280	3654.81	1.134	112.91
1024	128	29696	0.283	3612.44	1.137	112.55
1024	128	30720	0.287	3565.63	1.139	112.38
1024	128	31744	0.290	3536.33	1.144	111.93
1024	128	32768	0.292	3510.91	1.159	110.42
1024	128	33792	0.294	3477.58	1.160	110.36
1024	128	34816	0.297	3449.96	1.159	110.40
1024	128	35840	0.299	3425.32	1.161	110.20
1024	128	36864	0.302	3393.77	1.161	110.28
1024	128	37888	0.306	3343.79	1.162	110.17
1024	128	38912	0.307	3331.33	1.164	110.01
1024	128	39936	0.310	3304.10	1.166	109.76
1024	128	40960	0.313	3272.71	1.167	109.64
1024	128	41984	0.316	3241.91	1.170	109.40
1024	128	43008	0.317	3228.87	1.184	108.08
1024	128	44032	0.321	3193.90	1.191	107.50
1024	128	45056	0.322	3176.42	1.190	107.55
1024	128	46080	0.326	3141.90	1.192	107.40
1024	128	47104	0.328	3121.32	1.191	107.44
1024	128	48128	0.332	3087.88	1.193	107.30
1024	128	49152	0.333	3076.05	1.193	107.26
1024	128	50176	0.336	3049.23	1.194	107.16
1024	128	51200	0.337	3036.82	1.197	106.91
1024	128	52224	0.341	3005.58	1.200	106.67
1024	128	53248	0.343	2982.74	1.202	106.45
1024	128	54272	0.347	2954.17	1.218	105.07
1024	128	55296	0.349	2930.82	1.219	104.97
1024	128	56320	0.352	2906.79	1.220	104.93
1024	128	57344	0.355	2888.01	1.221	104.81
1024	128	58368	0.357	2869.28	1.223	104.64
1024	128	59392	0.360	2842.90	1.224	104.61
1024	128	60416	0.361	2834.75	1.225	104.46
1024	128	61440	0.363	2819.34	1.228	104.26
1024	128	62464	0.367	2791.44	1.230	104.09
1024	128	63488	0.369	2772.45	1.232	103.92
1024	128	64512	0.372	2750.55	1.245	102.81
1024	128	65536	0.373	2747.00	1.249	102.47

To steer this more on-topic, thanks @0cc4m, here is a comparison including mainline llama.cpp and compatible quant mentioned above (note the quant does not have the pre-fused gate/up tensors which would likely give a boost to PP on mainline).

👈 Details

title: "mainline vs ik_llama.cpp Full GPU Offload"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF Q4_0 19.776 GiB (4.901 BPW)"
hardware: "NVIDIA 3090TI FE 24GB, LACT undervolt & OC, Driver Version: 590.48.01 CUDA Version: 13.1"

ik_llama.cpp main@233225db8

model=/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 66560 \
  -ctk f16 -ctv f16 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  -muge \
  -ngl 999 \
  --threads 1 \
  --no-mmap \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	0.209	4890.70	0.895	142.94
1024	128	1024	0.214	4790.28	0.902	141.89
1024	128	2048	0.216	4748.21	0.905	141.40
1024	128	3072	0.218	4707.14	0.907	141.10
1024	128	4096	0.220	4659.63	0.909	140.84
1024	128	5120	0.224	4580.59	0.911	140.58
1024	128	6144	0.227	4503.45	0.912	140.30
1024	128	7168	0.229	4467.40	0.916	139.72
1024	128	8192	0.232	4417.56	0.919	139.23
1024	128	9216	0.234	4375.68	0.927	138.06
1024	128	10240	0.237	4317.60	0.930	137.57
1024	128	11264	0.239	4276.97	0.949	134.84
1024	128	12288	0.243	4217.12	0.948	134.98
1024	128	13312	0.244	4197.32	0.949	134.94
1024	128	14336	0.247	4146.91	0.948	135.02
1024	128	15360	0.250	4091.89	0.949	134.82
1024	128	16384	0.252	4056.65	0.950	134.76
1024	128	17408	0.255	4015.64	0.953	134.30
1024	128	18432	0.257	3988.44	0.954	134.23
1024	128	19456	0.261	3927.96	0.956	133.85
1024	128	20480	0.262	3913.24	0.959	133.46
1024	128	21504	0.264	3884.53	0.975	131.23
1024	128	22528	0.266	3845.75	0.978	130.84
1024	128	23552	0.269	3808.65	0.979	130.79
1024	128	24576	0.273	3744.44	0.980	130.67
1024	128	25600	0.275	3728.10	0.980	130.59
1024	128	26624	0.278	3682.02	0.981	130.53
1024	128	27648	0.280	3659.60	0.983	130.19
1024	128	28672	0.282	3632.07	0.984	130.04
1024	128	29696	0.286	3580.13	0.987	129.69
1024	128	30720	0.289	3549.23	0.989	129.37
1024	128	31744	0.290	3535.26	0.994	128.84
1024	128	32768	0.293	3490.51	1.009	126.83
1024	128	33792	0.296	3460.87	1.010	126.71
1024	128	34816	0.298	3438.55	1.011	126.60
1024	128	35840	0.300	3416.01	1.011	126.57
1024	128	36864	0.304	3364.39	1.013	126.36
1024	128	37888	0.306	3342.48	1.014	126.28
1024	128	38912	0.308	3321.47	1.015	126.08
1024	128	39936	0.311	3287.35	1.019	125.66
1024	128	40960	0.314	3263.32	1.019	125.62
1024	128	41984	0.317	3231.11	1.021	125.35
1024	128	43008	0.318	3215.36	1.035	123.70
1024	128	44032	0.321	3194.66	1.042	122.84
1024	128	45056	0.324	3163.49	1.043	122.76
1024	128	46080	0.326	3139.99	1.042	122.80
1024	128	47104	0.328	3118.03	1.042	122.87
1024	128	48128	0.332	3080.92	1.044	122.64
1024	128	49152	0.336	3049.89	1.047	122.26
1024	128	50176	0.337	3038.49	1.047	122.22
1024	128	51200	0.338	3033.12	1.049	122.04
1024	128	52224	0.342	2991.60	1.052	121.63
1024	128	53248	0.344	2973.20	1.053	121.51
1024	128	54272	0.349	2934.92	1.070	119.65
1024	128	55296	0.350	2921.76	1.070	119.60
1024	128	56320	0.351	2917.60	1.070	119.60
1024	128	57344	0.355	2884.89	1.071	119.53
1024	128	58368	0.357	2864.83	1.073	119.27
1024	128	59392	0.360	2841.83	1.074	119.17
1024	128	60416	0.363	2818.70	1.076	118.98
1024	128	61440	0.365	2804.03	1.077	118.88
1024	128	62464	0.369	2776.51	1.078	118.73
1024	128	63488	0.369	2777.39	1.082	118.32
1024	128	64512	0.372	2750.09	1.095	116.88
1024	128	65536	0.375	2727.31	1.099	116.45

mainline llama.cpp master@9f102a14 + ug/port-sweep-bench non-fused ffn_(up|gate)_exps

model=/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 66560 \
  -ctk f16 -ctv f16 \
  -ub 1024 -b 2048 \
  -ngl 99 \
  --threads 1 \
  --no-mmap \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	0.241	4255.57	0.839	152.56
1024	128	1024	0.244	4200.53	0.845	151.48
1024	128	2048	0.247	4153.06	0.857	149.32
1024	128	3072	0.250	4088.00	0.882	145.18
1024	128	4096	0.251	4075.09	0.884	144.82
1024	128	5120	0.254	4030.62	0.884	144.78
1024	128	6144	0.260	3941.45	0.886	144.44
1024	128	7168	0.262	3910.11	0.888	144.15
1024	128	8192	0.264	3885.91	0.891	143.61
1024	128	9216	0.267	3830.92	0.894	143.16
1024	128	10240	0.269	3800.62	0.896	142.86
1024	128	11264	0.272	3764.22	0.898	142.48
1024	128	12288	0.275	3723.35	0.902	141.85
1024	128	13312	0.279	3675.09	0.903	141.67
1024	128	14336	0.282	3628.81	0.909	140.88
1024	128	15360	0.285	3588.21	0.911	140.56
1024	128	16384	0.287	3565.27	0.913	140.19
1024	128	17408	0.291	3523.27	0.917	139.58
1024	128	18432	0.294	3482.55	0.918	139.41
1024	128	19456	0.297	3452.72	0.923	138.62
1024	128	20480	0.298	3435.76	0.926	138.27
1024	128	21504	0.301	3399.53	0.931	137.55
1024	128	22528	0.304	3363.92	0.932	137.28
1024	128	23552	0.307	3334.41	0.935	136.84
1024	128	24576	0.311	3292.42	0.939	136.32
1024	128	25600	0.314	3258.61	0.941	135.95
1024	128	26624	0.318	3222.81	0.945	135.41
1024	128	27648	0.319	3206.16	0.948	135.03
1024	128	28672	0.323	3174.72	0.952	134.42
1024	128	29696	0.326	3137.61	0.954	134.14
1024	128	30720	0.328	3117.93	0.956	133.85
1024	128	31744	0.331	3091.83	0.960	133.30
1024	128	32768	0.336	3047.78	0.963	132.87
1024	128	33792	0.338	3030.82	0.966	132.45
1024	128	34816	0.340	3013.02	0.969	132.08
1024	128	35840	0.344	2972.92	0.973	131.53
1024	128	36864	0.349	2937.28	0.975	131.30
1024	128	37888	0.349	2932.85	0.983	130.22
1024	128	38912	0.352	2906.94	0.987	129.64
1024	128	39936	0.356	2874.51	0.986	129.84
1024	128	40960	0.359	2854.58	0.990	129.27
1024	128	41984	0.363	2823.62	0.992	129.06
1024	128	43008	0.364	2809.84	0.996	128.58
1024	128	44032	0.367	2793.05	0.998	128.32
1024	128	45056	0.370	2766.56	1.000	127.96
1024	128	46080	0.373	2747.35	1.004	127.53
1024	128	47104	0.377	2717.44	1.006	127.30
1024	128	48128	0.380	2698.19	1.011	126.60
1024	128	49152	0.382	2679.96	1.014	126.25
1024	128	50176	0.387	2645.24	1.016	125.99
1024	128	51200	0.386	2652.06	1.019	125.62
1024	128	52224	0.391	2621.59	1.022	125.26
1024	128	53248	0.393	2607.05	1.026	124.78
1024	128	54272	0.397	2577.07	1.028	124.54
1024	128	55296	0.400	2559.56	1.032	124.02
1024	128	56320	0.403	2541.65	1.035	123.68
1024	128	57344	0.407	2518.51	1.039	123.25

JohnTDI-cpu Mar 26, 2026
Author

#21043

Qwen 3.5 27B Q4_K_M benchmark test on R9700

zedbytes · 2026-03-28T20:40:26Z

zedbytes
Mar 28, 2026

@digitalscream hi again, this may be a dumb question but I'm often hitting more than 262K context on kilocode local llm set up with qwen3.5 and qwen3-coder-next models ( I got another R9700 )
how do you manage larger context , do you optimise 262k context lengths , any tips ?
I saw this unsloth Qwen3-Coder-30B-A3B-Instruct-1M-UD-Q8_K_XL which can easily fit 64GB VRAM but not sure if its as good as qwen3.5 or qwen3-coder-next

4 replies

digitalscream Mar 28, 2026

To be honest, I tend to just let Cline compact the conversation to fit when the context starts creeping towards the limit. Obviously, I've quantised the context, but it always seems to work nicely for me.

For what it's worth, I'm using the REAM 60B version of Qwen3-Coder-Next with 262k context for the agentic workloads, and the Cerebras REAP version of Qwen3-Coder-30B with 16k context for inline predictions (using Continue, which has magically started working better than llama-vscode lately). That comes in at 27.8GB on one GPU and 26GB on the other.

I tend to keep my tasks small and focused, though; no large, sweeping tasks here.

zedbytes Mar 30, 2026

@digitalscream legend ! , so kilocode wants to force users to use non-local model mistralai/codestral-2508 for completion, I think I might just leave kilo entirely and go with cline/roocode

btw I think you meant cerebras_Qwen3-Coder-REAP-25B-A3B , couldn't find a cerebras 30B qwen3 , also RAEM 60B REAM is much higher disk size, think you meant 48B or 40B ?

digitalscream Mar 30, 2026

Yep, cerebras_Qwen3-Coder-REAP-25B-A3B, and it is the 60B version (~36GB in Q4_K_M). Here's my config for the two of them:

[REAM_Qwen3-Coder-Next-UD-Q4_K_M]
mmap=0
cache-type-k=q8_0
cache-type-v=q8_0
ctx-size=256000
flash-attn=on
model=/opt/working/models.1/Qwen3-Coder-Next-REAM.Q4_K_M.gguf
n-predict=-1
n-gpu-layers=999
threads=1
top-p       = 0.8
top-k       = 20
min-p       = 0.0
temp        = 0.7
device=Vulkan0,Vulkan1
split-mode=row
spec-type           = ngram-mod
spec-ngram-size-n   = 24
spec-ngram-size-m   = 12
spec-ngram-min-hits = 1
draft-min           = 24
draft-max           = 64

[cerebras_Qwen3-Coder-REAP-25B-A3B-Q4_K_M]
mmap=0
cache-type-k=q8_0
cache-type-v=q8_0
ctx-size=32000
flash-attn=on
model=/opt/working/models.1/cerebras_Qwen3-Coder-REAP-25B-A3B-Q4_K_M.gguf
n-predict=-1
n-gpu-layers=999
threads=1
top-p       = 0.8
top-k       = 20
min-p       = 0.0
temp        = 0.7
device=Vulkan1,Vulkan0
split-mode=row
spec-type           = ngram-mod
spec-ngram-size-n   = 24
spec-ngram-size-m   = 12
spec-ngram-min-hits = 1
draft-min           = 24
draft-max           = 64

The speculative decoding stuff doesn't work on Qwen3-Coder-Next yet, but I believe there are a couple of PRs which should address that at some point soon. It does work on Qwen3-Coder, though, which certainly helps with the inline completion stuff. Incidentally, the Continue.dev extension can even do next-edit prediction with Qwen3-Coder, which is absolute magic and works waaaaay better than Copilot with GPT-4.1.

I've tried Qwen3.5-35B-3B for inline completion, it's a bit of an annoying mess and misses the point half the time - worse, it often tries to delete stuff that you really want to keep. Alibaba have committed to continuing the work with Qwen, so hopefully we'll see a Qwen3.5-Coder at some point soon.

Looking forward to the TurboQuant stuff - when we get that, we might even be able to fit the full version of Qwen3-Coder-Next.

zedbytes Mar 31, 2026

So good, continue & code completion flying right now ! Thanks

jstumpin · 2026-04-11T04:07:02Z

jstumpin
Apr 11, 2026

~50% less cost yet only ~10% less capable than RTX 5090...

"For anyone running local LLMs, doing LoRA fine-tuning, or generating AI video, the 32 GB frame buffer alone justifies the cost for serious workloads, and ROCm 7.2 support is finally first-class. Hitting 91.8% of RTX 5090 CUDA performance on identical fine-tuning workloads on a consumer desktop — versus NVIDIA's flagship on a 120-core datacenter server — is a remarkable result that reframes the whole AMD-vs-NVIDIA AI conversation. If you use Windows only and won't touch WSL2, buy something else. If you're on Linux or willing to run WSL2, this is the most exciting prosumer AI GPU released in years."
https://my6.my/asus/r9700ai/review.html

4 replies

digitalscream Apr 11, 2026

Well, I wouldn't go that far - in terms of AI workloads, LoRA fine-tuning is probably the only area in which they're even close to the 5090. Being realistic, nobody expects them to be anywhere close, they're not even in the same class of card. On the other hand, they compare very favourably to the RTX 4000 and 5000 Ada cards at a significantly lower price.

RTX 5090 (CUDA) vs Radeon AI PRO R9700 (Vulkan) — Qwen3.5-35B-A3B MoE Q4_K_XL llama-bench results #19890

Uh oh!

RTX 5090 (CUDA) vs AMD Radeon AI PRO R9700 (Vulkan) — llama-bench results

System Specs

Benchmark Parameters

Results: Prompt Processing (PP) [t/s]

Results: Token Generation (TG) [t/s]

Observations

Replies: 9 comments · 72 replies

Uh oh!

Uh oh!

Uh oh!

JohnTDI-cpu Feb 26, 2026 Author

Follow-up: Unsloth UD-Q4_K_XL vs ubergarm Q4_0 on AMD Radeon AI PRO R9700 (Vulkan)

System

Models

Prompt Processing (PP) [t/s]

Token Generation (TG) [t/s]

Analysis

Uh oh!

Uh oh!

Uh oh!

0cc4m Mar 3, 2026 Collaborator

Uh oh!

Uh oh!

0cc4m Mar 3, 2026 Collaborator

Uh oh!

Uh oh!

JohnTDI-cpu Mar 3, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohnTDI-cpu Mar 14, 2026 Author

Qwen3-30B-A3B Q4_K_M Benchmark on AMD Radeon AI PRO R9700 (RDNA4) — Vulkan

Summary

Results

Text Generation (Decode)

Prompt Processing (Prefill)

Benchmark Methodology

Exact Reproduction Steps

1. Hardware

2. Software

3. Model

4. Build llama.cpp

5. Run the benchmark

Decode benchmark (Flash Attention OFF):

Decode benchmark (Flash Attention ON):

Prefill benchmark:

Full combined benchmark (decode + prefill, FA ON and OFF):

6. Parameter explanation

Raw llama-bench Output

Flash Attention OFF (5 repeats)

Flash Attention ON (5 repeats)

Notes

Approximate Bandwidth Model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0cc4m Mar 22, 2026 Collaborator

Uh oh!

Uh oh!

Replies: 9 comments 72 replies

JohnTDI-cpu
Feb 26, 2026
Author

0cc4m Mar 3, 2026
Collaborator

0cc4m Mar 3, 2026
Collaborator

JohnTDI-cpu
Mar 3, 2026
Author

JohnTDI-cpu
Mar 14, 2026
Author

0cc4m Mar 22, 2026
Collaborator