Name and Version
b8738 and later
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library), llama-server
Command line
./llama-server \
--alias local-ai \
-m /models/google_gemma-4-31B-it-Q4_K_S.gguf \
--host 0.0.0.0 \
--port 8080 \
-np 1 \
-ngl 99 \
-c 131072 \
-n 16000 \
--mmproj /models/mmproj-google_gemma-4-31B-it-bf16.gguf \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--cache-ram 2048 \
-ctxcp 2 \
--flash-attn on \
--jinja \
--chat-template-file /templates/google-gemma-4-31B-it-
interleaved.jinja \
--no-prefill-assistant \
--chat-template-kwargs '{"enable_thinking":true}' \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--min-p 0.0 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
--metrics
Problem description & steps to reproduce
On a single RTX 3090, pre b8738 has lower VRAM idle and under load than at b8738+.
Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.
A stress test with tuned full context works pre b8738, and fails on b8738+.
According to investigation by AI, the bad state is that ggml_cuda_init() initiatilzes NCCL communicators even on single GPU runs, which uses uneccessary VRAM. This should be fixed so it only initializes NCCL when there are more than one GPU.
Reproduce by running pre b8738 and measure idle VRAM, and then do the same on b8738+. At least with a single RTX 3090 on this setup.
First Bad Commit
d6f3030047f85a98b009189e76f441fe818ea44db8738
Relevant log output
Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.
Name and Version
b8738 and later
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library), llama-server
Command line
Problem description & steps to reproduce
On a single RTX 3090, pre b8738 has lower VRAM idle and under load than at b8738+.
Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.
A stress test with tuned full context works pre b8738, and fails on b8738+.
According to investigation by AI, the bad state is that ggml_cuda_init() initiatilzes NCCL communicators even on single GPU runs, which uses uneccessary VRAM. This should be fixed so it only initializes NCCL when there are more than one GPU.
Reproduce by running pre b8738 and measure idle VRAM, and then do the same on b8738+. At least with a single RTX 3090 on this setup.
First Bad Commit
d6f3030047f85a98b009189e76f441fe818ea44db8738
Relevant log output
Pre b8738 idle VRAM is 22726mb, while in b8738+ idle VRAM increases to 23172mb.