Always happens, increasing world size aggravates the problem. For the same data the drops occur on different steps, data agnostic.
Diagnose via the profiler, is not deterministically reproducible.
256 and 512 GPU runs:
4 GPU run:
Compile is the only difference
All runs on JUPITER, same behavior can be seen locally
Always happens, increasing world size aggravates the problem. For the same data the drops occur on different steps, data agnostic.
Diagnose via the profiler, is not deterministically reproducible.
256 and 512 GPU runs:
4 GPU run:
Compile is the only difference
All runs on JUPITER, same behavior can be seen locally