Skip to content

CUDA: initialize NCCL comms lazily#21746

Open
JohannesGaessler wants to merge 2 commits intoggml-org:masterfrom
JohannesGaessler:cuda-lazy-nccl
Open

CUDA: initialize NCCL comms lazily#21746
JohannesGaessler wants to merge 2 commits intoggml-org:masterfrom
JohannesGaessler:cuda-lazy-nccl

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Apr 10, 2026

Fixes #21692 .
Fixes #21719 .

On master NCCL comms are created for all GPUs eagerly and unconditionally. However, it seems that NCCL comms consume several hundred MB of VRAM per device Also, if only a subset of visible CUDA devices is used this causes the AllReduce to hang indefinitely on master. This PR makes it so that NCCL comms are created lazily and per vector of device ids. The VRAM for them is never freed until the program terminates but this is an issue with the CUDA backend in general.

Requirements

@JohannesGaessler JohannesGaessler requested a review from a team as a code owner April 10, 2026 21:53
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 10, 2026
Comment on lines +1154 to +1164
#ifdef GGML_USE_NCCL
static std::map<std::vector<int>, std::vector<ncclComm_t>> comms;

static std::vector<ncclComm_t> ggml_cuda_get_nccl_comms(const std::vector<int> & devs) {
if (comms.find(devs) == comms.end()) {
comms[devs].resize(devs.size());
NCCL_CHECK(ncclCommInitAll(comms[devs].data(), devs.size(), devs.data()));
}
return comms[devs];
}
#endif // GGML_USE_NCCL
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not thread-safe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not but the way NCCL comms are used in ggml_backend_cuda_allreduce_tensor is to my knowledge also not thread safe anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mutex both for the creation and usage of NCCL comms.

@tha80
Copy link
Copy Markdown

tha80 commented Apr 11, 2026

I can confirm that this PR fixes the problem for me. 👍

@Rotatingxenomorph
Copy link
Copy Markdown

Rotatingxenomorph commented Apr 11, 2026

Fixed for -sm row/layer with 2 gpus. Too bad tensor steals your vram because i went from 15 t/s to 20 t/s even without nccl installed (not sure I even can on ubuntu 25.04) but, life is unfair.

@jmig1109
Copy link
Copy Markdown

Hi guys, as a user with a mixed-GPU setup (two 3090s and RTX Pro 4000 Blackwell), I tried this PR and noticed that it forces my two 3090s to allocate ~230 MB of VRAM to map P2P memory to each other, even when the router process has no models loaded. The blackwell card stays at 0.
I have no coding background and no pretentions here, but while trying to find a fix for my setup with Gemini, it suggested moving the P2P checks to be lazy as well (only running when the worker process loads a model across those specific GPUs) would prevent wasting this VRAM on the router.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

6 participants