Summary: | mpirun fails with "An ORTE daemon has unexpectedly failed after launch..." | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jordan Hayes <jhayes> |
Component: | Other | Assignee: | Jacob Jenson <jacob> |
Status: | RESOLVED INVALID | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | CPonder, hayesjordan |
Version: | 19.05.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7311 | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | CentOS |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Jordan Hayes
2019-05-09 10:59:54 MDT
I recompiled and installed Slurmctld and Slurmd with the supported RC: https://download.schedmd.com/slurm/slurm-19.05.0-0rc1.tar.bz2 This issue does still persist. However, I have found something interesting (from within a srun session): mpirun -mca plm rsh a.out # Works mpirun -mca plm slurm a.out # Produces the same error Tried to do an strace, which looks like OMPI calls srun and then dies. But I am still unsure of why. strace mpirun -mca plm slurm a.out ... stat("/opt/linux/centos/7.x/x86_64/pkgs/slurm/19.05.0/bin/srun", {st_mode=S_IFREG|0755, st_size=690448, ...}) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fab93bbca10) = 30289 setpgid(30289, 30289) = 0 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30289, si_uid=1384, si_status=255, si_utime=0, si_stime=0} --- sendto(3, "\21", 1, 0, NULL, 0) = 1 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, -1) = 1 ([{fd=4, revents=POLLIN}]) recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1 recvfrom(4, 0x7fab91432fe0, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 30289 wait4(-1, 0x7ffe09c7c9f0, WNOHANG, NULL) = -1 ECHILD (No child processes) poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}], 5, 0) = 0 (Timeout) open("/opt/linux/centos/7.x/x86_64/pkgs/openmpi/3.1.4-slurm-19.05.0/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 27 ioctl(27, TCGETS, 0x7ffe09c7c740) = -1 ENOTTY (Inappropriate ioctl for device) brk(NULL) = 0x1183000 brk(0x11ac000) = 0x11ac000 fstat(27, {st_mode=S_IFREG|0644, st_size=4147, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fab93bd0000 read(27, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147 read(27, "", 8192) = 0 close(27) = 0 munmap(0x7fab93bd0000, 8192) = 0 brk(NULL) = 0x11ac000 brk(NULL) = 0x11ac000 brk(0x1198000) = 0x1198000 brk(NULL) = 0x1198000 write(2, "--------------------------------"..., 518-------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). ... Ah, OK. I believe I have found the cause: mpirun --mca plm_base_verbose 100 -mca plm slurm a.out [c38:05395] mca: base: components_register: registering framework plm components [c38:05395] mca: base: components_register: found loaded component slurm [c38:05395] mca: base: components_register: component slurm register function successful [c38:05395] mca: base: components_open: opening plm components [c38:05395] mca: base: components_open: found loaded component slurm [c38:05395] mca: base: components_open: component slurm open function successful [c38:05395] mca:base:select: Auto-selecting plm components [c38:05395] mca:base:select:( plm) Querying component [slurm] [c38:05395] [[INVALID],INVALID] plm:slurm: available for selection [c38:05395] mca:base:select:( plm) Query of component [slurm] set priority to 75 [c38:05395] mca:base:select:( plm) Selected component [slurm] [c38:05395] plm:base:set_hnp_name: initial bias 5395 nodename hash 3205578609 [c38:05395] plm:base:set_hnp_name: final jobfam 38259 [c38:05395] [[38259,0],0] plm:base:receive start comm [c38:05395] [[38259,0],0] plm:base:setup_job [c38:05395] [[38259,0],0] plm:slurm: LAUNCH DAEMONS CALLED [c38:05395] [[38259,0],0] plm:base:setup_vm [c38:05395] [[38259,0],0] plm:base:setup_vm creating map [c38:05395] [[38259,0],0] plm:base:setup_vm add new daemon [[38259,0],1] [c38:05395] [[38259,0],0] plm:base:setup_vm assigning new daemon [[38259,0],1] to node c39 [c38:05395] [[38259,0],0] plm:slurm: launching on nodes c39 [c38:05395] [[38259,0],0] plm:slurm: final top-level argv: srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 --nodelist=c39 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "2507341824" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "c[2:38-39]@0(2)" -mca orte_hnp_uri "2507341824.0;tcp://10.102.11.38,10.104.11.38:49214" --mca plm_base_verbose "100" srun: unrecognized option '--cpu_bind=none' srun: unrecognized option '--cpu_bind=none' Try "srun --help" for more information [c38:05395] [[38259,0],0] plm:slurm: srun returned non-zero exit status (65280) from launching the per-node daemon -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [c38:05395] [[38259,0],0] plm:base:orted_cmd sending orted_exit commands [c38:05395] [[38259,0],0] plm:base:receive stop comm [c38:05395] mca: base: close: component slurm closed [c38:05395] mca: base: close: unloading component slurm Looks like Slurm 19.05.0 no longer has an option cpu_bind for srun? I can patch OMPI to not pass the cpu_bind argument, or would that be a bad idea? Is this argument necessary? Is there a more recent alternative in Slurm 19.05.0? Ah ha! In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo: Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none"); Should be Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none"); OMPI has an underscore instead of a dash. I will create the appropriate patch on my side. Consider this ticket closed. Thanks Bug in OpenMPI code: In "orte/mca/plm/slurm/plm_slurm_module.c" from OMPI it seems like a typo: Line ~281: opal_argv_append(&argc, &argv, "--cpu_bind=none"); Should be Line ~281: opal_argv_append(&argc, &argv, "--cpu-bind=none"); OMPI has an underscore instead of a dash. I will create the appropriate patch on my side. Consider this ticket closed. This is horrible! Do you know when it will be fixed in an OpenMPI release? |