Bug 6993 - mpirun fails with "An ORTE daemon has unexpectedly failed after launch..."
Summary: mpirun fails with "An ORTE daemon has unexpectedly failed after launch..."
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 19.05.x
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-09 10:59 MDT by Jordan Hayes
Modified: 2019-09-27 17:30 MDT (History)
2 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jordan Hayes 2019-05-09 10:59:54 MDT
Installed Slurm from Github, branch slurm-19.05, commit 7a6fb06ea6837414df657ce33066de8f38ef496f

Outside of Slurm, mpirun works:
  mpirun -np 4 -H c38:2,c39:2 mpiTest

Srun directly works:
  srun -p batch -N 2 -n 2 mpiTest

Inside sbatch or srun, does not work:
  batch -p batch -N 2 -n 2 --wrap 'mpirun mpiTest'
OR
  srun -p batch -N 2 -n 2 --pty bash -l
  mpirun mpiTest

Both of the above result in the following error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

My logs for installing Slurm and OMPI as well as the sample program mpiTest.c are here:
  https://cluster.hpcc.ucr.edu/~jhayes/slurm/19.05.0/

Any help with this is much appreciated.
Also, you can add University of California, to your list of sites ;)

Thanks!
Comment 1 Jordan Hayes 2019-05-10 11:16:44 MDT
I recompiled and installed Slurmctld and Slurmd with the supported RC:
    https://download.schedmd.com/slurm/slurm-19.05.0-0rc1.tar.bz2

This issue does still persist.

However, I have found something interesting (from within a srun session):
    mpirun -mca plm rsh a.out   # Works
    mpirun -mca plm slurm a.out # Produces the same error
Comment 2 Jordan Hayes 2019-05-10 11:41:31 MDT
Tried to do an strace, which looks like OMPI calls srun and then dies. But I am still unsure of why.

strace mpirun -mca plm slurm a.out
...
stat("/opt/linux/centos/7.x/x86_64/pkgs/slurm/19.05.0/bin/srun", {st_mode=S_IFREG|0755, st_size=690448, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fab93bbca10) = 30289
setpgid(30289, 30289)                   = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30289, si_uid=1384, si_status=255, si_utime=0, si_stime=0} ---
sendto(3, "\21", 1, 0, NULL, 0)         = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1
recvfrom(4, 0x7fab91432fe0, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 30289
wait4(-1, 0x7ffe09c7c9f0, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, 0) = 0 (Timeout)
open("/opt/linux/centos/7.x/x86_64/pkgs/openmpi/3.1.4-slurm-19.05.0/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 27
ioctl(27, TCGETS, 0x7ffe09c7c740)       = -1 ENOTTY (Inappropriate ioctl for device)
brk(NULL)                               = 0x1183000
brk(0x11ac000)                          = 0x11ac000
fstat(27, {st_mode=S_IFREG|0644, st_size=4147, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fab93bd0000
read(27, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(27, "", 8192)                      = 0
close(27)                               = 0
munmap(0x7fab93bd0000, 8192)            = 0
brk(NULL)                               = 0x11ac000
brk(NULL)                               = 0x11ac000
brk(0x1198000)                          = 0x1198000
brk(NULL)                               = 0x1198000
write(2, "--------------------------------"..., 518--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
...
Comment 3 Jordan Hayes 2019-05-10 14:05:55 MDT
Ah, OK.
I believe I have found the cause:

mpirun --mca plm_base_verbose 100 -mca plm slurm a.out
[c38:05395] mca: base: components_register: registering framework plm components
[c38:05395] mca: base: components_register: found loaded component slurm
[c38:05395] mca: base: components_register: component slurm register function successful
[c38:05395] mca: base: components_open: opening plm components
[c38:05395] mca: base: components_open: found loaded component slurm
[c38:05395] mca: base: components_open: component slurm open function successful
[c38:05395] mca:base:select: Auto-selecting plm components
[c38:05395] mca:base:select:(  plm) Querying component [slurm]
[c38:05395] [[INVALID],INVALID] plm:slurm: available for selection
[c38:05395] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[c38:05395] mca:base:select:(  plm) Selected component [slurm]
[c38:05395] plm:base:set_hnp_name: initial bias 5395 nodename hash 3205578609
[c38:05395] plm:base:set_hnp_name: final jobfam 38259
[c38:05395] [[38259,0],0] plm:base:receive start comm
[c38:05395] [[38259,0],0] plm:base:setup_job
[c38:05395] [[38259,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[c38:05395] [[38259,0],0] plm:base:setup_vm
[c38:05395] [[38259,0],0] plm:base:setup_vm creating map
[c38:05395] [[38259,0],0] plm:base:setup_vm add new daemon [[38259,0],1]
[c38:05395] [[38259,0],0] plm:base:setup_vm assigning new daemon [[38259,0],1] to node c39
[c38:05395] [[38259,0],0] plm:slurm: launching on nodes c39
[c38:05395] [[38259,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 --nodelist=c39 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "2507341824" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "c[2:38-39]@0(2)" -mca orte_hnp_uri "2507341824.0;tcp://10.102.11.38,10.104.11.38:49214" --mca plm_base_verbose "100"                              
srun: unrecognized option '--cpu_bind=none'
srun: unrecognized option '--cpu_bind=none'
Try "srun --help" for more information
[c38:05395] [[38259,0],0] plm:slurm: srun returned non-zero exit status (65280) from launching the per-node daemon
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[c38:05395] [[38259,0],0] plm:base:orted_cmd sending orted_exit commands
[c38:05395] [[38259,0],0] plm:base:receive stop comm
[c38:05395] mca: base: close: component slurm closed
[c38:05395] mca: base: close: unloading component slurm

Looks like Slurm 19.05.0 no longer has an option cpu_bind for srun?
I can patch OMPI to not pass the cpu_bind argument, or would that be a bad idea?
Is this argument necessary? Is there a more recent alternative in Slurm 19.05.0?
Comment 4 Jordan Hayes 2019-05-10 14:27:38 MDT
Ah ha!

In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo:

    Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none");

Should be
    Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none");

OMPI has an underscore instead of a dash.
I will create the appropriate patch on my side.
Consider this ticket closed.
Thanks
Comment 5 Moe Jette 2019-05-15 14:43:11 MDT
Bug in OpenMPI code:

In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo:

    Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none");

Should be
    Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none");

OMPI has an underscore instead of a dash.
I will create the appropriate patch on my side.
Consider this ticket closed.
Comment 6 Carl Ponder 2019-09-27 16:41:07 MDT
This is horrible! Do you know when it will be fixed in an OpenMPI release?
Comment 7 Jordan Hayes 2019-09-27 17:30:20 MDT
https://github.com/open-mpi/ompi/issues/6743