Add RCCL and training warmup for HYBRID_SHARD stability by oyazdanb · Pull Request #83 · ROCm/aorta

oyazdanb · 2026-02-03T23:00:45Z

Add warmup.py with RCCL communicator warmup, manual param sync, and training collectives warmup functions
Integrate RCCL warmup in build_fsdp_model before FSDP init
Add training warmup in main() before training loop
Add TORCH_DIST_INIT_TIMEOUT support for dist.init_process_group and dist.new_group calls
Add cuda.synchronize() before gradient clipping to prevent race
Improve exception handling in finally block
Update set_env_variables.sh with DOCKER_ENV_VARS array for automatic env propagation to Docker containers
Remove duplicate logging in local_launch.sh (logs now only in logs/ directory via master_launch.sh)
Add warmup config params to shampoo_opt_multi_node.yaml

- Add warmup.py with RCCL communicator warmup, manual param sync, and training collectives warmup functions - Integrate RCCL warmup in build_fsdp_model before FSDP init - Add training warmup in main() before training loop - Add TORCH_DIST_INIT_TIMEOUT support for dist.init_process_group and dist.new_group calls - Add cuda.synchronize() before gradient clipping to prevent race - Improve exception handling in finally block - Update set_env_variables.sh with DOCKER_ENV_VARS array for automatic env propagation to Docker containers - Remove duplicate logging in local_launch.sh (logs now only in logs/ directory via master_launch.sh) - Add warmup config params to shampoo_opt_multi_node.yaml Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This pull request adds RCCL communicator and training warmup functionality to improve stability of HYBRID_SHARD distributed training, particularly addressing race conditions in inter-node RDMA setup during FSDP initialization. The PR also enhances timeout handling, exception management in cleanup code, and improves environment variable management in shell scripts.

Changes:

Adds warmup.py module with functions for RCCL communicator warmup, manual parameter synchronization, and training collectives warmup
Integrates RCCL warmup before FSDP initialization and training warmup before the main training loop in fsdp_trainer.py
Adds configurable timeout support using TORCH_DIST_INIT_TIMEOUT environment variable for distributed initialization
Refactors shell scripts to centralize environment variable configuration with automatic Docker environment propagation

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/aorta/utils/warmup.py	New module providing RCCL communicator warmup, manual parameter sync, and training collectives warmup functions
src/aorta/utils/init.py	Exports new warmup functions to the public API
src/aorta/training/fsdp_trainer.py	Integrates warmup functions, adds timeout configuration, improves exception handling, and adds cuda.synchronize() before gradient clipping
scripts/multi_node/set_env_variables.sh	Centralizes NCCL/RCCL environment variables with DOCKER_ENV_VARS array and build_docker_env_flags helper function
scripts/multi_node/local_launch.sh	Updates to use centralized environment variables and removes duplicate logging
config/multi_node/shampoo_opt_multi_node.yaml	Adds warmup configuration parameters (rccl_warmup_iterations, training_warmup_steps, skip flags)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T05:36:03Z

+    # Ensure GPU operations are complete before FSDP wrapping
+    # This helps prevent race conditions with inter-node communicators
+    torch.cuda.synchronize()
+    dist.barrier()


The dist.barrier() call on line 303 is executed unconditionally for all sharding strategies, not just HYBRID_SHARD. While this is likely safe, it adds unnecessary synchronization overhead for non-HYBRID_SHARD configurations. Consider moving this barrier inside the HYBRID_SHARD if-block (after line 298) to avoid unnecessary synchronization when using other sharding strategies.

Suggested change

# Ensure GPU operations are complete before FSDP wrapping

# This helps prevent race conditions with inter-node communicators

torch.cuda.synchronize()

dist.barrier()

# Ensure all HYBRID_SHARD ranks reach this point before FSDP wrapping

dist.barrier()

# Ensure GPU operations are complete before FSDP wrapping

# This helps prevent race conditions with inter-node communicators

torch.cuda.synchronize()

Copilot · 2026-02-04T05:36:04Z

+    local flags=""
+    for var in "${DOCKER_ENV_VARS[@]}"; do
+        local value="${!var}"
+        flags+=" -e ${var}=${value}"


The build_docker_env_flags function does not quote the environment variable values when building the flags string. If any environment variable contains spaces or special characters (e.g., NCCL_SOCKET_IFNAME=enp49s0f0np0,fenic0), this could cause issues with shell word splitting when the flags are used in the docker exec command. Consider quoting the values: flags+=" -e ${var}="${value}""

Suggested change

flags+=" -e ${var}=${value}"

flags+=" -e ${var}=$(printf '%q' "$value")"

Copilot · 2026-02-04T05:36:05Z

+    HSA_ENABLE_SDMA
+    # Protocol/Channels
+    NCCL_PROTO
+    NCCL_MIN_NCHANNELS


The DOCKER_ENV_VARS array includes NCCL_MIN_NCHANNELS (line 91), but the actual export statement for NCCL_MIN_NCHANNELS is commented out (line 44). This means the build_docker_env_flags function will pass an undefined or empty value for NCCL_MIN_NCHANNELS to the Docker container. Either uncomment the export on line 44 or remove NCCL_MIN_NCHANNELS from the DOCKER_ENV_VARS array.

Suggested change

NCCL_MIN_NCHANNELS

Copilot · 2026-02-04T05:36:05Z

+    Args:
+        model: The FSDP-wrapped model
+        replicate_group: Inter-node replicate process group for broadcasting
+    """


The function calls dist.get_rank() without first checking if distributed is initialized. If distributed training is not initialized, this will raise an exception. Consider adding a check with dist.is_initialized() at the beginning of the function, or document that this function must only be called when distributed is initialized.

Suggested change

"""

"""

if not dist.is_available() or not dist.is_initialized():

log.warning(

"manual_sync_params called but torch.distributed is not initialized; "

"skipping parameter synchronization."

)

return

Copilot · 2026-02-04T05:36:05Z

+        dist.barrier()
+
+        log.debug("Warmup step %d complete (rank=%d, loss=%.4f)", warmup_step, rank, loss.item())
+
+    # Reset optimizer state after warmup to not affect actual training
+    optimizer.zero_grad(set_to_none=True)
+    torch.cuda.synchronize()
+    dist.barrier()


The function calls dist.barrier() on lines 241 and 248 without checking if distributed is initialized. While line 198 conditionally gets the rank, the barrier calls will fail if distributed is not initialized. Consider adding checks before these barrier calls or documenting that this function requires distributed training to be initialized.

Suggested change

dist.barrier()

log.debug("Warmup step %d complete (rank=%d, loss=%.4f)", warmup_step, rank, loss.item())

# Reset optimizer state after warmup to not affect actual training

optimizer.zero_grad(set_to_none=True)

torch.cuda.synchronize()

dist.barrier()

if dist.is_available() and dist.is_initialized():

dist.barrier()

log.debug("Warmup step %d complete (rank=%d, loss=%.4f)", warmup_step, rank, loss.item())

# Reset optimizer state after warmup to not affect actual training

optimizer.zero_grad(set_to_none=True)

torch.cuda.synchronize()

if dist.is_available() and dist.is_initialized():

dist.barrier()

Copilot · 2026-02-04T05:36:06Z

+    # Training warmup: run a few forward/backward/optimizer steps to warm up collectives
+    if not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:


The training warmup logic uses fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps even when dist_mode is "ddp" (line 909-910). This means the warmup will be skipped or use default values for DDP mode, which may not be the desired behavior. Consider either: (1) adding warmup configuration to DDPConfig, (2) checking dist_mode before applying warmup, or (3) documenting that warmup is only intended for FSDP mode.

Suggested change

# Training warmup: run a few forward/backward/optimizer steps to warm up collectives

if not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:

# Training warmup: run a few forward/backward/optimizer steps to warm up collectives.

# This warmup is currently configured via FSDPConfig and is only applied in non-DDP modes.

if dist_mode != "ddp" and not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:

@oyazdanb please take a look at this review comment. I'm not sure if the code design is as you intended.

Copilot · 2026-02-04T05:36:06Z

@@ -136,18 +155,16 @@ if [ "${ENABLE_ROCPROF}" = "true" ]; then
        log "Running with rocprofv3 kernel tracing inside Docker"
        ${DOCKER_EXEC} bash -c "rocprofv3 ${ROCPROF_ARGS} -d ${ROCPROF_DIR} -- \
            ${BASE_CMD} ${BASE_OVERRIDES} \
-            --override training.output_dir=${OUTPUT_DIR_DOCKER}" \
-            2>&1 | tee -a "${LOG_FILE}"
+            --override training.output_dir=${OUTPUT_DIR_DOCKER}" 2>&1
    fi
 else
    log "Running inside Docker container"
    log "Command: ${BASE_CMD} ${BASE_OVERRIDES} --override training.output_dir=${OUTPUT_DIR_DOCKER}"
    ${DOCKER_EXEC} bash -c "${BASE_CMD} ${BASE_OVERRIDES} \
-        --override training.output_dir=${OUTPUT_DIR_DOCKER}" \
-        2>&1 | tee -a "${LOG_FILE}"
+        --override training.output_dir=${OUTPUT_DIR_DOCKER}" 2>&1


The expansion of DOCKER_EXEC here is vulnerable to shell command injection because it is built from untrusted inputs (the DOCKER_CONTAINER CLI argument and environment-derived values in DOCKER_ENV_FLAGS) and then expanded unquoted in a command context. An attacker who can influence DOCKER_CONTAINER or one of the environment variables (e.g., by including characters like ;, && or $(...)) can break out of the intended docker exec invocation and execute arbitrary additional commands on the host. To mitigate this, avoid constructing a full command string in DOCKER_EXEC and instead pass docker exec and its arguments as separate, properly quoted words (or an array), and ensure DOCKER_CONTAINER is strictly validated/whitelisted to be a simple container name.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

amd-vivekag · 2026-02-04T10:49:02Z

+    """
+    rank = dist.get_rank()
+    # Use a larger tensor for more thorough warmup
+    warmup_tensor = torch.ones(8192, device=device, dtype=torch.float32)


instead of magic numbers, use code something like:

# At module level WARMUP_TENSOR_SIZE = 8192 # 32KB per GPU (float32) # Chosen to exercise memory channels without excessive overhead # In function: warmup_tensor = torch.ones(WARMUP_TENSOR_SIZE, device=device, dtype=torch.float32)

amd-vivekag · 2026-02-04T10:51:32Z

+# -----------------------------------------------------------------------------
+export NCCL_TIMEOUT_MS=12000                 # 12 second timeout (legacy, not used by PyTorch)
+export NCCL_TIMEOUT=100                      # 100 second timeout - first backward can be slow due to JIT/init
+export TORCH_DIST_INIT_TIMEOUT=150           # Match collective timeout for consistency


timeout NCCL_TIMEOUT and TORCH_DIST_INIT_TIMEOUT are pointing to different values. Any reason why do we have different values and different environment variables for same thing? Which one is effective in our code?

amd-vivekag · 2026-02-04T10:52:15Z

+    # Final synchronization with extra delay
+    torch.cuda.synchronize()
+    dist.barrier()
+    torch.cuda.synchronize()


double calls of sync and barrier. Need to remove one set

amd-vivekag · 2026-02-04T11:09:40Z

+        log.debug("Warmup step %d complete (rank=%d, loss=%.4f)", warmup_step, rank, loss.item())
+
+    # Reset optimizer state after warmup to not affect actual training
+    optimizer.zero_grad(set_to_none=True)


Issue: zero_grad() only clears gradients, not optimizer state (momentum, learning rate schedule, etc.). Optimizers like Adam/Shampoo maintain state that's updated during warmup steps.
Problems:

Warmup steps consume learning rate schedule steps

Momentum buffers are initialized with warmup gradients

Shampoo preconditioner is updated with warmup data

There are 3 options:

save optimizer state before warm up and restore after warmup.

# After warmup, reset optimizer to initial state optimizer_state_backup = optimizer.state_dict() # ... run warmup ... # Restore initial state (but this is tricky with FSDP)

Create fresh optimizer after warmup

Document that warmup affects optimizer state

amd-vivekag · 2026-02-04T11:10:07Z

+def warmup_training_collectives(
+    model: nn.Module,
+    optimizer: torch.optim.Optimizer,
+    dataloader,


missing type hints

oyazdanb requested a review from amd-vivekag February 3, 2026 23:00

amd-vivekag requested a review from Copilot February 4, 2026 05:27

Copilot started reviewing on behalf of amd-vivekag February 4, 2026 05:28 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

amd-vivekag and others added 3 commits February 4, 2026 11:43

removes hard coding on docker container name

bc511a6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

adds a log if both shard_group and replicate_group are None

8623e15

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

makes comment consistent with the code

6d36362

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

amd-vivekag requested changes Feb 4, 2026

View reviewed changes

	flags+=" -e ${var}=${value}"
	flags+=" -e ${var}=$(printf '%q' "$value")"

-    """
+    """
+    if not dist.is_available() or not dist.is_initialized():
+        log.warning(
+            "manual_sync_params called but torch.distributed is not initialized; "
+            "skipping parameter synchronization."
+        )
+        return

		# Training warmup: run a few forward/backward/optimizer steps to warm up collectives
		if not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:

-    # Training warmup: run a few forward/backward/optimizer steps to warm up collectives
-    if not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:
+    # Training warmup: run a few forward/backward/optimizer steps to warm up collectives.
+    # This warmup is currently configured via FSDPConfig and is only applied in non-DDP modes.
+    if dist_mode != "ddp" and not fsdp_cfg.skip_training_warmup and fsdp_cfg.training_warmup_steps > 0:

Conversation

oyazdanb commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

amd-vivekag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants