Background information
Working on a project, one part of which runs multiple prun commands in parallel from multiple processes to launch multiple tasks, some of these commands with --add-hostfile option to extend existing DVM.
What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
55536ef
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
openpmix/openpmix@bde8038
Please describe the system on which you are running
- Operating system/version: AlmaLinux 8.7 (Stone Smilodon)
- Computer hardware: aarch64
- Network type:
Details of the problem
Steps to reproduce
- Get two nodes
- Create a
hostfile with one node and add_hostfile with another node
- Use
hostfile to start DVM prte --report-uri dvm.uri --hostfile hostfile
- Spawn two processes using python multiprocessing and run two
prun commands in parallel, one with add-hostfile option and another without as the following.
from multiprocessing import Pool
import subprocess
def run(x):
print(x)
process = subprocess.Popen(x, stdout=subprocess.PIPE, shell=True)
output,error = process.communicate()
prun_commands = ["prun --display allocation --dvm-uri file:dvm.uri --map-by ppr:2:node -n 2 hostname > out0", "prun --display allocation --dvm-uri file:dvm.uri --add-hostfile add_hostfile --map-by ppr:2:node -n 2 hostname > out1"]
with Pool(2) as p:
p.map(run, prun_commands)
It outputs the following error.
[st-master:2320852] PMIx_Spawn failed (-25): UNREACHABLE
[st-master:2320851] PMIx_Spawn failed (-25): UNREACHABLE
If no add-hostfile option is given, both processes run without error.
While debugging, it can be seen that the daemon was launched in the added node but throws following error during initialization and terminates the PMIX Server.
A process or daemon was unable to complete a TCP connection
to another process:
Local host: st27
Remote host: 192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
@hppritcha @rhc54 Please advise.
Background information
Working on a project, one part of which runs multiple
pruncommands in parallel from multiple processes to launch multiple tasks, some of these commands with--add-hostfileoption to extend existing DVM.What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
55536ef
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
openpmix/openpmix@bde8038
Please describe the system on which you are running
Details of the problem
Steps to reproduce
hostfilewith one node andadd_hostfilewith another nodehostfileto start DVMprte --report-uri dvm.uri --hostfile hostfilepruncommands in parallel, one withadd-hostfileoption and another without as the following.It outputs the following error.
If no
add-hostfileoption is given, both processes run without error.While debugging, it can be seen that the daemon was launched in the added node but throws following error during initialization and terminates the PMIX Server.
@hppritcha @rhc54 Please advise.