CUDA: initialize NCCL comms lazily by JohannesGaessler · Pull Request #21746 · ggml-org/llama.cpp

JohannesGaessler · 2026-04-10T21:53:46Z

On master NCCL comms are created for all GPUs eagerly and unconditionally. However, it seems that NCCL comms consume several hundred MB of VRAM per device Also, if only a subset of visible CUDA devices is used this causes the AllReduce to hang indefinitely on master. This PR makes it so that NCCL comms are created lazily and per vector of device ids. The VRAM for them is never freed until the program terminates but this is an issue with the CUDA backend in general.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: debugging

ggerganov · 2026-04-11T07:00:02Z

ggml/src/ggml-cuda/common.cuh

+#ifdef GGML_USE_NCCL
+static std::map<std::vector<int>, std::vector<ncclComm_t>> comms;
+
+static std::vector<ncclComm_t> ggml_cuda_get_nccl_comms(const std::vector<int> & devs) {
+    if (comms.find(devs) == comms.end()) {
+        comms[devs].resize(devs.size());
+        NCCL_CHECK(ncclCommInitAll(comms[devs].data(), devs.size(), devs.data()));
+    }
+    return comms[devs];
+}
+#endif // GGML_USE_NCCL


I think this is not thread-safe

It is not but the way NCCL comms are used in ggml_backend_cuda_allreduce_tensor is to my knowledge also not thread safe anyways.

I added a mutex both for the creation and usage of NCCL comms.

tha80 · 2026-04-11T13:44:21Z

I can confirm that this PR fixes the problem for me. 👍

Rotatingxenomorph · 2026-04-11T17:17:52Z

Fixed for -sm row/layer with 2 gpus. Too bad tensor steals your vram because i went from 15 t/s to 20 t/s even without nccl installed (not sure I even can on ubuntu 25.04) but, life is unfair.

jmig1109 · 2026-04-11T20:51:42Z

Hi guys, as a user with a mixed-GPU setup (two 3090s and RTX Pro 4000 Blackwell), I tried this PR and noticed that it forces my two 3090s to allocate ~230 MB of VRAM to map P2P memory to each other, even when the router process has no models loaded. The blackwell card stays at 0.
I have no coding background and no pretentions here, but while trying to find a fix for my setup with Gemini, it suggested moving the P2P checks to be lazy as well (only running when the worker process loads a model across those specific GPUs) would prevent wasting this VRAM on the router.

CUDA: initialize NCCL comms lazily

134b44d

JohannesGaessler requested a review from a team as a code owner April 10, 2026 21:53

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 10, 2026

am17an approved these changes Apr 11, 2026

View reviewed changes

ggerganov reviewed Apr 11, 2026

View reviewed changes

JohannesGaessler mentioned this pull request Apr 11, 2026

CUDA: only init NCCL for setups with multi GPU #21761

Open

CISC linked an issue Apr 11, 2026 that may be closed by this pull request

Misc. bug: Higher VRAM usage after b8738 #21759

Open

add mutex

a0a8859

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: initialize NCCL comms lazily#21746

CUDA: initialize NCCL comms lazily#21746
JohannesGaessler wants to merge 2 commits intoggml-org:masterfrom
JohannesGaessler:cuda-lazy-nccl

JohannesGaessler commented Apr 10, 2026 •

edited

Loading

Uh oh!

ggerganov Apr 11, 2026

Uh oh!

JohannesGaessler Apr 11, 2026

Uh oh!

JohannesGaessler Apr 11, 2026

Uh oh!

tha80 commented Apr 11, 2026

Uh oh!

Rotatingxenomorph commented Apr 11, 2026 •

edited

Loading

Uh oh!

jmig1109 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JohannesGaessler commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

ggerganov Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

tha80 commented Apr 11, 2026

Uh oh!

Rotatingxenomorph commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmig1109 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JohannesGaessler commented Apr 10, 2026 •

edited

Loading

Rotatingxenomorph commented Apr 11, 2026 •

edited

Loading