Should probably get each process to save checkpoint to its own directory. E.g. model_dir/`hvd.local_rank()`/ I think multiple processes are in a race to write checkpoints to the same file.
Should probably get each process to save checkpoint to its own directory.
E.g.
model_dir/
hvd.local_rank()/I think multiple processes are in a race to write checkpoints to the same file.