You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 22, 2026. It is now read-only.
timeout=0
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
timeout=$((timeout + 1))
if [ $timeout -gt 240 ]; then
echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
exit 1
fi
log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
sleep 1
lines=$(uniq $HOST_FILE_PATH|wc -l)
done
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.
For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems: --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0
A small modification to reflect whatever the application returned in the status of the job:
<user's logic>
RESULT_CODE=$?
sleep 2
log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.
For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems:
--mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0A small modification to reflect whatever the application returned in the status of the job: