Installed Slurm from Github, branch slurm-19.05, commit 7a6fb06ea6837414df657ce33066de8f38ef496f Outside of Slurm, mpirun works: mpirun -np 4 -H c38:2,c39:2 mpiTest Srun directly works: srun -p batch -N 2 -n 2 mpiTest Inside sbatch or srun, does not work: batch -p batch -N 2 -n 2 --wrap 'mpirun mpiTest' OR srun -p batch -N 2 -n 2 --pty bash -l mpirun mpiTest Both of the above result in the following error: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- My logs for installing Slurm and OMPI as well as the sample program mpiTest.c are here: https://cluster.hpcc.ucr.edu/~jhayes/slurm/19.05.0/ Any help with this is much appreciated. Also, you can add University of California, to your list of sites ;) Thanks!
I recompiled and installed Slurmctld and Slurmd with the supported RC: https://download.schedmd.com/slurm/slurm-19.05.0-0rc1.tar.bz2 This issue does still persist. However, I have found something interesting (from within a srun session): mpirun -mca plm rsh a.out # Works mpirun -mca plm slurm a.out # Produces the same error
Tried to do an strace, which looks like OMPI calls srun and then dies. But I am still unsure of why. strace mpirun -mca plm slurm a.out ... stat("/opt/linux/centos/7.x/x86_64/pkgs/slurm/19.05.0/bin/srun", {st_mode=S_IFREG|0755, st_size=690448, ...}) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fab93bbca10) = 30289 setpgid(30289, 30289) = 0 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30289, si_uid=1384, si_status=255, si_utime=0, si_stime=0} --- sendto(3, "\21", 1, 0, NULL, 0) = 1 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = 1 ([{fd=4, revents=POLLIN}]) recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1 recvfrom(4, 0x7fab91432fe0, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 30289 wait4(-1, 0x7ffe09c7c9f0, WNOHANG, NULL) = -1 ECHILD (No child processes) poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, 0) = 0 (Timeout) open("/opt/linux/centos/7.x/x86_64/pkgs/openmpi/3.1.4-slurm-19.05.0/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 27 ioctl(27, TCGETS, 0x7ffe09c7c740) = -1 ENOTTY (Inappropriate ioctl for device) brk(NULL) = 0x1183000 brk(0x11ac000) = 0x11ac000 fstat(27, {st_mode=S_IFREG|0644, st_size=4147, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fab93bd0000 read(27, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147 read(27, "", 8192) = 0 close(27) = 0 munmap(0x7fab93bd0000, 8192) = 0 brk(NULL) = 0x11ac000 brk(NULL) = 0x11ac000 brk(0x1198000) = 0x1198000 brk(NULL) = 0x1198000 write(2, "--------------------------------"..., 518-------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). ...
Ah, OK. I believe I have found the cause: mpirun --mca plm_base_verbose 100 -mca plm slurm a.out [c38:05395] mca: base: components_register: registering framework plm components [c38:05395] mca: base: components_register: found loaded component slurm [c38:05395] mca: base: components_register: component slurm register function successful [c38:05395] mca: base: components_open: opening plm components [c38:05395] mca: base: components_open: found loaded component slurm [c38:05395] mca: base: components_open: component slurm open function successful [c38:05395] mca:base:select: Auto-selecting plm components [c38:05395] mca:base:select:( plm) Querying component [slurm] [c38:05395] [[INVALID],INVALID] plm:slurm: available for selection [c38:05395] mca:base:select:( plm) Query of component [slurm] set priority to 75 [c38:05395] mca:base:select:( plm) Selected component [slurm] [c38:05395] plm:base:set_hnp_name: initial bias 5395 nodename hash 3205578609 [c38:05395] plm:base:set_hnp_name: final jobfam 38259 [c38:05395] [[38259,0],0] plm:base:receive start comm [c38:05395] [[38259,0],0] plm:base:setup_job [c38:05395] [[38259,0],0] plm:slurm: LAUNCH DAEMONS CALLED [c38:05395] [[38259,0],0] plm:base:setup_vm [c38:05395] [[38259,0],0] plm:base:setup_vm creating map [c38:05395] [[38259,0],0] plm:base:setup_vm add new daemon [[38259,0],1] [c38:05395] [[38259,0],0] plm:base:setup_vm assigning new daemon [[38259,0],1] to node c39 [c38:05395] [[38259,0],0] plm:slurm: launching on nodes c39 [c38:05395] [[38259,0],0] plm:slurm: final top-level argv: srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 --nodelist=c39 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "2507341824" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "c[2:38-39]@0(2)" -mca orte_hnp_uri "2507341824.0;tcp://10.102.11.38,10.104.11.38:49214" --mca plm_base_verbose "100" srun: unrecognized option '--cpu_bind=none' srun: unrecognized option '--cpu_bind=none' Try "srun --help" for more information [c38:05395] [[38259,0],0] plm:slurm: srun returned non-zero exit status (65280) from launching the per-node daemon -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [c38:05395] [[38259,0],0] plm:base:orted_cmd sending orted_exit commands [c38:05395] [[38259,0],0] plm:base:receive stop comm [c38:05395] mca: base: close: component slurm closed [c38:05395] mca: base: close: unloading component slurm Looks like Slurm 19.05.0 no longer has an option cpu_bind for srun? I can patch OMPI to not pass the cpu_bind argument, or would that be a bad idea? Is this argument necessary? Is there a more recent alternative in Slurm 19.05.0?
Ah ha! In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo: Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none"); Should be Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none"); OMPI has an underscore instead of a dash. I will create the appropriate patch on my side. Consider this ticket closed. Thanks
Bug in OpenMPI code: In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo: Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none"); Should be Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none"); OMPI has an underscore instead of a dash. I will create the appropriate patch on my side. Consider this ticket closed.
This is horrible! Do you know when it will be fixed in an OpenMPI release?
https://github.com/open-mpi/ompi/issues/6743