Bug 6993

Summary:	mpirun fails with "An ORTE daemon has unexpectedly failed after launch..."
Product:	Slurm	Reporter:	Jordan Hayes <jhayes>
Component:	Other	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	CPonder, hayesjordan
Version:	19.05.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7311
Site:	-Other-	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	CentOS
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Jordan Hayes 2019-05-09 10:59:54 MDT

Installed Slurm from Github, branch slurm-19.05, commit 7a6fb06ea6837414df657ce33066de8f38ef496f

Outside of Slurm, mpirun works:
  mpirun -np 4 -H c38:2,c39:2 mpiTest

Srun directly works:
  srun -p batch -N 2 -n 2 mpiTest

Inside sbatch or srun, does not work:
  batch -p batch -N 2 -n 2 --wrap 'mpirun mpiTest'
OR
  srun -p batch -N 2 -n 2 --pty bash -l
  mpirun mpiTest

Both of the above result in the following error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

My logs for installing Slurm and OMPI as well as the sample program mpiTest.c are here:
  https://cluster.hpcc.ucr.edu/~jhayes/slurm/19.05.0/

Any help with this is much appreciated.
Also, you can add University of California, to your list of sites ;)

Thanks!

Comment 1 Jordan Hayes 2019-05-10 11:16:44 MDT

I recompiled and installed Slurmctld and Slurmd with the supported RC:
    https://download.schedmd.com/slurm/slurm-19.05.0-0rc1.tar.bz2

This issue does still persist.

However, I have found something interesting (from within a srun session):
    mpirun -mca plm rsh a.out   # Works
    mpirun -mca plm slurm a.out # Produces the same error

Comment 2 Jordan Hayes 2019-05-10 11:41:31 MDT

Tried to do an strace, which looks like OMPI calls srun and then dies. But I am still unsure of why.

strace mpirun -mca plm slurm a.out
...
stat("/opt/linux/centos/7.x/x86_64/pkgs/slurm/19.05.0/bin/srun", {st_mode=S_IFREG|0755, st_size=690448, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fab93bbca10) = 30289
setpgid(30289, 30289)                   = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30289, si_uid=1384, si_status=255, si_utime=0, si_stime=0} ---
sendto(3, "\21", 1, 0, NULL, 0)         = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1
recvfrom(4, 0x7fab91432fe0, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 30289
wait4(-1, 0x7ffe09c7c9f0, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, 0) = 0 (Timeout)
open("/opt/linux/centos/7.x/x86_64/pkgs/openmpi/3.1.4-slurm-19.05.0/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 27
ioctl(27, TCGETS, 0x7ffe09c7c740)       = -1 ENOTTY (Inappropriate ioctl for device)
brk(NULL)                               = 0x1183000
brk(0x11ac000)                          = 0x11ac000
fstat(27, {st_mode=S_IFREG|0644, st_size=4147, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fab93bd0000
read(27, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(27, "", 8192)                      = 0
close(27)                               = 0
munmap(0x7fab93bd0000, 8192)            = 0
brk(NULL)                               = 0x11ac000
brk(NULL)                               = 0x11ac000
brk(0x1198000)                          = 0x1198000
brk(NULL)                               = 0x1198000
write(2, "--------------------------------"..., 518--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
...

Comment 3 Jordan Hayes 2019-05-10 14:05:55 MDT

Ah, OK.
I believe I have found the cause:

mpirun --mca plm_base_verbose 100 -mca plm slurm a.out
[c38:05395] mca: base: components_register: registering framework plm components
[c38:05395] mca: base: components_register: found loaded component slurm
[c38:05395] mca: base: components_register: component slurm register function successful
[c38:05395] mca: base: components_open: opening plm components
[c38:05395] mca: base: components_open: found loaded component slurm
[c38:05395] mca: base: components_open: component slurm open function successful
[c38:05395] mca:base:select: Auto-selecting plm components
[c38:05395] mca:base:select:(  plm) Querying component [slurm]
[c38:05395] [[INVALID],INVALID] plm:slurm: available for selection
[c38:05395] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[c38:05395] mca:base:select:(  plm) Selected component [slurm]
[c38:05395] plm:base:set_hnp_name: initial bias 5395 nodename hash 3205578609
[c38:05395] plm:base:set_hnp_name: final jobfam 38259
[c38:05395] [[38259,0],0] plm:base:receive start comm
[c38:05395] [[38259,0],0] plm:base:setup_job
[c38:05395] [[38259,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[c38:05395] [[38259,0],0] plm:base:setup_vm
[c38:05395] [[38259,0],0] plm:base:setup_vm creating map
[c38:05395] [[38259,0],0] plm:base:setup_vm add new daemon [[38259,0],1]
[c38:05395] [[38259,0],0] plm:base:setup_vm assigning new daemon [[38259,0],1] to node c39
[c38:05395] [[38259,0],0] plm:slurm: launching on nodes c39
[c38:05395] [[38259,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 --nodelist=c39 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "2507341824" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "c[2:38-39]@0(2)" -mca orte_hnp_uri "2507341824.0;tcp://10.102.11.38,10.104.11.38:49214" --mca plm_base_verbose "100"                              
srun: unrecognized option '--cpu_bind=none'
srun: unrecognized option '--cpu_bind=none'
Try "srun --help" for more information
[c38:05395] [[38259,0],0] plm:slurm: srun returned non-zero exit status (65280) from launching the per-node daemon
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[c38:05395] [[38259,0],0] plm:base:orted_cmd sending orted_exit commands
[c38:05395] [[38259,0],0] plm:base:receive stop comm
[c38:05395] mca: base: close: component slurm closed
[c38:05395] mca: base: close: unloading component slurm

Looks like Slurm 19.05.0 no longer has an option cpu_bind for srun?
I can patch OMPI to not pass the cpu_bind argument, or would that be a bad idea?
Is this argument necessary? Is there a more recent alternative in Slurm 19.05.0?

Comment 4 Jordan Hayes 2019-05-10 14:27:38 MDT

Ah ha!

In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo:

    Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none");

Should be
    Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none");

OMPI has an underscore instead of a dash.
I will create the appropriate patch on my side.
Consider this ticket closed.
Thanks

Comment 5 Moe Jette 2019-05-15 14:43:11 MDT

Bug in OpenMPI code:

In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo:

    Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none");

Should be
    Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none");

OMPI has an underscore instead of a dash.
I will create the appropriate patch on my side.
Consider this ticket closed.

Comment 6 Carl Ponder 2019-09-27 16:41:07 MDT

This is horrible! Do you know when it will be fixed in an OpenMPI release?

Comment 7 Jordan Hayes 2019-09-27 17:30:20 MDT

https://github.com/open-mpi/ompi/issues/6743