Few trivial changes we've noted 

1. Consider a timeout as nodes are joining, eg:
  ```
  timeout=0
  while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
  do
    timeout=$((timeout + 1))
    if [ $timeout -gt 240 ]; then
      echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
      exit 1
    fi
    log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
    sleep 1
    lines=$(uniq $HOST_FILE_PATH|wc -l)
  done
  ```
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.

2. For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems: ` --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0 ` 

3. A small modification to reflect whatever the application returned in the status of the job:
```  
  <user's logic>
  RESULT_CODE=$?
  sleep 2
  log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE
```
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few trivial changes we've noted #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Few trivial changes we've noted #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions