CUDA: initialize NCCL comms lazily#21746
CUDA: initialize NCCL comms lazily#21746JohannesGaessler wants to merge 2 commits intoggml-org:masterfrom
Conversation
| #ifdef GGML_USE_NCCL | ||
| static std::map<std::vector<int>, std::vector<ncclComm_t>> comms; | ||
|
|
||
| static std::vector<ncclComm_t> ggml_cuda_get_nccl_comms(const std::vector<int> & devs) { | ||
| if (comms.find(devs) == comms.end()) { | ||
| comms[devs].resize(devs.size()); | ||
| NCCL_CHECK(ncclCommInitAll(comms[devs].data(), devs.size(), devs.data())); | ||
| } | ||
| return comms[devs]; | ||
| } | ||
| #endif // GGML_USE_NCCL |
There was a problem hiding this comment.
I think this is not thread-safe
There was a problem hiding this comment.
It is not but the way NCCL comms are used in ggml_backend_cuda_allreduce_tensor is to my knowledge also not thread safe anyways.
There was a problem hiding this comment.
I added a mutex both for the creation and usage of NCCL comms.
|
I can confirm that this PR fixes the problem for me. 👍 |
|
Fixed for -sm row/layer with 2 gpus. Too bad tensor steals your vram because i went from 15 t/s to 20 t/s even without nccl installed (not sure I even can on ubuntu 25.04) but, life is unfair. |
|
Hi guys, as a user with a mixed-GPU setup (two 3090s and RTX Pro 4000 Blackwell), I tried this PR and noticed that it forces my two 3090s to allocate ~230 MB of VRAM to map P2P memory to each other, even when the router process has no models loaded. The blackwell card stays at 0. |
Fixes #21692 .
Fixes #21719 .
On master NCCL comms are created for all GPUs eagerly and unconditionally. However, it seems that NCCL comms consume several hundred MB of VRAM per device Also, if only a subset of visible CUDA devices is used this causes the AllReduce to hang indefinitely on master. This PR makes it so that NCCL comms are created lazily and per vector of device ids. The VRAM for them is never freed until the program terminates but this is an issue with the CUDA backend in general.
Requirements