NCCL reorganization for NCCL 2.19+ including modern NCCL implementations with new group semantics and changes to what config.blocking seems to do by JunoRavin · Pull Request #1007 · ammarhakim/gkeyll

JunoRavin · 2026-05-08T16:41:38Z

Many of our production simulations on certain clusters, such as Perlmutter, are utilizing NCCL 2.18. The most recent version of NCCL is 2.29 and certain clusters such as Delta (machine files added in this PR) only have more modern NCCL implementations such as NCCL 2.25.

To my understanding: In NCCL 2.18 and earlier, the library provided implicit communicator-level serialization. Operations submitted to a communicator were internally queued and matched across ranks, even across separate ncclGroupStart/ncclGroupEnd boundaries. This meant that:

Per-direction groups (separate group start/end for each edge direction) worked because NCCL would internally match send/recv pairs across group boundaries.
Ranks that raced ahead to the next group call would have their operations queued behind pending operations from other ranks.
Empty groups (ncclGroupStart + ncclGroupEnd with no operations) were effectively no-ops.

Now, this implicit cross-group serialization is gone in NCCL 2.19+ and further, the config.blocking flag we were setting was causing all ncclGroupEnd calls to return ncclInProgress (and even collective calls like allreduce and broadcast were returning ncclInProgress). If ncclInProgress is returned, we require ncclCommGetAsyncError() until ncclSuccess is returned and this call and wait is expensive.

We have thus rewritten significant chunks of nccl_comm.c to do the following:

We have consolidated groups in array_per_sync. All multi-rank periodic send/recv operations are batched into a single ncclGroupStart/ncclGroupEnd call. Self-periodic cases (same-rank copies) are handled before the group with local buffer copies only. With per-direction groups, a send on rank A's lower-x group and the matching recv on rank B's upper-x group could be split across different groups, and ranks deadlock waiting for matches that never arrive within the same group in the new NCCL 2.19+ group semantics. We have also introduced an empty-group guard; the consolidated group is itself wrapped in an if (has_nccl_ops) check so that interior ranks with no periodic neighbors skip ncclGroupStart/ncclGroupEnd entirely. This avoids a segfault that NCCL 2.25 produces when one rank posts an empty group on the same communicator that other ranks have actual operations on.
Because CUDA operations and NCCL operations share the same stream, we have moved CUDA operations outside of the ncclGroupStart and ncclGroupEnd calls. All buffer preparation (range computation, resize, copy_to_buffer) happens in Phase 1 before ncclGroupStart. Only ncclRecv/ncclSend calls appear between ncclGroupStart and ncclGroupEnd. Buffer readout (copy_from_buffer) happens after cudaStreamSynchronize. This change is due to a note in the documentation "Caution: When called inside a group, stream operations (like ncclAllReduce) can return without having enqueued the operation on the stream. Stream operations like cudaStreamSynchronize can therefore be called only after ncclGroupEnd returns." from https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/groups.html

While it may be unnecessary to move all operations which use the stream outside of ncclGroupStart and ncclGroupEnd, it does make the structure of the communication transparent and avoids Nvidia hardening any other stream and group semantics on us in the future.

We have introduced a proper barrier on the MPI communicator inside the NCCL communicator. Because of these stricter group semantics and comments like "Although NCCL group allows different operations to be issued in one shot, users still need to guarantee the same issuing order of the operations among different GPUs no matter whether the operations are issued to the same or different communicators." in the current documentation (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/groups.html), we wish to properly barrier in between the array_per_sync and array_sync (since interior ranks return early from array_per_sync and can race ahead to array_sync and submit new NCCL group operations on the communicator while edge ranks are still executing their per_sync group).
Switch config.blocking=1 for performance reasons and delete all calls to ncclCommSetAsyncError() to avoid these issues associated with just how expensive this call is.

Unit tests pass and Vlasov multi-GPU tests are running fine, but I am tagging in the GK team for testing on Perlmutter and other multi-GPU machines to make sure production GK simulations are unaffected by these changes (and perhaps you can revisit running on Perlmutter with NCCL 2.29).

… can reproduce an error in their configuration with a particular combination of NCCL and CUDA (NCCL 2.25 and CUDA 12.8) that is a segmentation fault when running with periodic boundary conditions and having domain decomposition in the periodic boundary condition direction. I originally thought this was a NCCL version-ing issue but I can't reproduce this issue on Perlmutter with the latest CUDA 13 and NCCL 2.29 installation we are using there. Nevertheless, it does seem like we might be doing something incorrectly giving some digging around in the NCCL documentation. I don't know if sometimes when clusters configure NCCL they can make it behave differently, but the essential problem seems to be the following: 1. In newer versions of NCCL, they appear to intend for NCCL group semantics to be stricter and so the periodic boundary condition program flow we currently utilize where the ncclGroupStart and ncclGroupEnd are performed per edge leads to an issue where, because we use non-blocking communication, a ncclGroupEnd may return a ncclInProgress status, and if you try calling ncclGroupStart while the status is ncclInProgress, the system errors out. The solution is to reorganize the periodic boundary conditions where we loop over all edges, collect the communications needed, and then post a single set of communication calls inside a single ncclGroupStart and ncclGroupEnd block. 2. Our barrier call for NCCL barrier doesn't synchronize the MPI ranks. I am not sure if this is strictly necessary but it has made debugging exactly where the program gets easier (since then I can flush the output to the screen more easily and see more precisely what ranks fail where). 3. More generally, the direction of NCCL seemingly wanting to enforce strict group semantics such that it can optimize all the communication all at once (instead of serializing the groupStart and groupEnd calls to organize communication sequences) is forcing an experiment on making the NCCL communicator not embed CUDA copy calls inside ncclGroupStart and ncclGroupEnd calls (since everything uses the same cuda stream and there are potential issues now if NCCL is not implicitly serializing the communication sequence). Indeed, my latest experiment is to no longer have a single CUDA stream and just try to rigidly set up boundaries between the communication calls for species vs. fields. Just reordering the communication in periodic boundary conditions to have a single set of group calls allows Vlasov simulations to get past initial conditions, and 2 and 3 are right now my attempt to figure out how to get the system to take time steps in case there is some rigidity to what NCCL expects (depending on how it's configured or other details of the cluster). Fingers crossed.

…as I can tell that on Delta posting empty group calls (which we have been doing for awhile with periodic boundary conditions, and seem to not be an issue on Perlmutter with NCCL 2.29) causes the system to seg fault. Is this something with Delta's NCCL configuration? A bug in NCCL 2.25 that they fixed? Unclear. Let's see if getting rid of empty group calls helps though (since interior ranks always think they're touching an edge in velocity space in the phase space communicator).

…ike empty group calls. Perhaps this error persists into newer versions of NCCL but is a more silent error, or perhaps Nvidia did not intend for this to be a bug. Regardless, we know who should be communicating and we can make sure no empty group calls are made. Restore the extend comm to be using the parent cuda stream. I want to go backwards and revert some of the other experimentation too I think, but at the very least the refactor of periodic boundary conditions is likely safer long term for the collection of all communication together instead of group creation per edge. And if there's no performance penalty we probably should do all the communication setup, make single group calls, and then finish the communication, just because we are having NCCL and CUDA share a CUDA stream and it's possible Nvidia will try to further optimize this behavior and if we're not careful things will start failing due to race conditions.

…d the machine scripts for Delta. There's still something wrong (we can run now, but the communication cost is absolute garbage which makes me think I'm doing something wrong in *which* of three MPIs seem to be internal to the nvhpc 25.3 library they've built). But progress nevertheless

… not sure this is right, but if I use the cray-mpich that Delta has loaded by default, I get access to srun, which seems strictly correct, though the parallel performance is still total shite (spending a factor of 10x more time communicating compared to Perlmutter for the same simulation).

communication with NCCL; it is only a flag for a host side enqueue that necessitates a use of ncclCommGetAsyncError to determine if there were any errors in the communication. This ncclCommGetAsyncError is horrendously expensive on Delta. By deleting these calls and flipping the sign of the flag the cost of a multi-GPU simulation on 4 GPUs drops by a factor of 50 (communication taking 99.99+ percent of the time to taking 25% of the run time). The other refactors are still necessary in this branch because of how modern NCCL expects us to use groups (group semantics genuinely changed from NCCL 2.18 to NCCL 2.19) but also this particular flag seems to have a rather pernicious performance impact on modern NCCL implementations (and maybe was being ignored in older NCCL implementations?).

Antoinehoff · 2026-05-11T22:46:18Z

I merge this into the sheath AI branch and here's a comparison between a restart before and after. This is done on Perlmutter, interactive node, 4 GPUs. The command is

srun -u -n 4 ./gkeyll -g -M -e 4 -r 100 -s 100

ran in /pscratch/sd/a/ah1032/gkeyll_sheath_ai/tcv_pt_0.5mw/wk and the modules are

Currently Loaded Modules:
  1) conda/Miniforge3-25.11.0-1       4) craype/2.7.35    (c)   7) craype-network-ofi  (cpe)     10) cudatoolkit/13.0      (g)    13) nccl/2.29.2-cu13 (g)
  2) python/3.13-26.1.0               5) cray-dsmml/0.3.1       8) cray-libsci/25.09.0           11) craype-accel-nvidia80 (cpe)
  3) gcc-native/14              (c)   6) libfabric/1.22.0       9) PrgEnv-gnu/8.6.0    (prgenv)  12) cray-mpich/9.0.1      (mpi)

Before

Restarting from frame 100 at time = 0.000200004
Starting main loop ...
Taking time-step at t = 200.004 mus ... dt = 0.00465438 mus
Taking time-step at t = 200.465 mus ... dt = 0.00465524 mus
 ... finished

Number of update calls 100
Number of forward-Euler calls 300
Number of RK stage-2 failures 0
Number of RK stage-3 failures 0
Number of write calls 4
Timing:
  - Time loop:                         2.4015e+01 sec.

With the NCCL fix

Restarting from frame 100 at time = 0.000200009
Starting main loop ...
Taking time-step at t = 200.009 mus ... dt = 0.00465358 mus
Taking time-step at t = 200.47 mus ... dt = 0.00465527 mus
 ... finished

Number of update calls 100
Number of forward-Euler calls 300
Number of RK stage-2 failures 0
Number of RK stage-3 failures 0
Number of write calls 4
Timing:
  - Time loop:                         2.3864e+01 sec.

Not clear if there is a performance gain. However we see that the dt are not exactly the same...

JunoRavin · 2026-05-12T00:39:48Z

@Antoinehoff is there a reason the restart time is different in both cases? Presumably the data is the same from before and after the restart, but the restart time itself is different and that along could explain a small variation in the size of the time step.

Antoinehoff · 2026-05-12T13:46:57Z

@Antoinehoff is there a reason the restart time is different in both cases? Presumably the data is the same from before and after the restart, but the restart time itself is different and that along could explain a small variation in the size of the time step.

I don't understand why the restart does not occur at the same time 🤔

JunoRavin · 2026-05-12T14:27:36Z

@Antoinehoff is there a reason the restart time is different in both cases? Presumably the data is the same from before and after the restart, but the restart time itself is different and that along could explain a small variation in the size of the time step.

I don't understand why the restart does not occur at the same time 🤔

Are these two different simulations, or are you restarting the same simulation two different times? We don't precisely write out data at time t = t_end (so if the turbulence has some small variation in the potential that leads to a small change to the time step, you could write out the final frame at 200.005 versus 200.009 or something like that).

Antoinehoff · 2026-05-12T14:57:05Z

Both runs are restarts from the exact same frame. I would have assumed that the time displayed in

Restarting from frame 100 at time = 0.000200009

was the time of frame which should not change upon a restart right?

JunoRavin · 2026-05-12T15:31:09Z

Both runs are restarts from the exact same frame. I would have assumed that the time displayed in
Restarting from frame 100 at time = 0.000200009
was the time of frame which should not change upon a restart right?

Okay, that seems like something with the actual restart functionality then or what's stored via I/O, because that time there is supposed to be read from the meta-data in the output file, and I don't know what the time there would be different restart to restart.

manauref · 2026-05-13T20:23:48Z

I'll try a multiblock regression test.

It'd be good to check if the

mctest_multib_allgather
mctest_multib_sync

unit tests still work.

JunoRavin added 7 commits March 22, 2026 08:17

Merge branch 'main' into nccl-reorg-periodic-bcs

587beb0

JunoRavin requested review from Antoinehoff, Maxwell-Rosen, akashukla, ammarhakim, dingyunl, manauref and tnbernard May 8, 2026 16:41

JunoRavin added the bug Something isn't working label May 8, 2026

Fix bug in multiblock unit test.

4b90264

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL reorganization for NCCL 2.19+ including modern NCCL implementations with new group semantics and changes to what config.blocking seems to do#1007

NCCL reorganization for NCCL 2.19+ including modern NCCL implementations with new group semantics and changes to what config.blocking seems to do#1007
JunoRavin wants to merge 8 commits into
mainfrom
nccl-reorg-periodic-bcs

JunoRavin commented May 8, 2026

Uh oh!

Antoinehoff commented May 11, 2026 •

edited

Loading

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

Antoinehoff commented May 12, 2026

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

Antoinehoff commented May 12, 2026

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

manauref commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JunoRavin commented May 8, 2026

Uh oh!

Antoinehoff commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

Antoinehoff commented May 12, 2026

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

Antoinehoff commented May 12, 2026

Uh oh!

JunoRavin commented May 12, 2026

Uh oh!

manauref commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Antoinehoff commented May 11, 2026 •

edited

Loading

manauref commented May 13, 2026 •

edited

Loading