Summary: | srun SLURM_NNODES error when running multiple tasks on a single node | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kaylea Nelson <kaylea.nelson> |
Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 17.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Yale | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | slurm.conf |
Description
Kaylea Nelson
2018-03-26 15:25:22 MDT
Hi Kaylea. I can't reproduce just by executing your same requests: alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash alex@ibiza:~/t$ srun mpi/mpi_hello Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4 alex@ibiza:~/t$ What type of MPI implementation are you using? For my testing, I'm using OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD. I see this commit available since 17.11.1 that might be related: https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd with this RELEASE_NOTES note: NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST. These latter variables have been obsolete for some time please update any scripts still using them. Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request? Thanks. We are getting this error/warning with the OpenMPI versions we have tested (and are commonly used our clusters): 2.1.1 (built with Intel 15.0.2) 1.10.3 (build with GCC 5.4.0) 1.8.1 (built with Intel 14.0.2). All versions where built with the --with-pmi flag and we set MpiDefault=pmi2 in our slurm.conf Thanks! Kaylea --------------- Kaylea Nelson, PhD | kaylea.nelson@yale.edu Computational Research Support Analyst Yale Center for Research Computing <http://research.computing.yale.edu/> On Tue, Mar 27, 2018 at 4:57 AM, <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c1> on bug > 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez > <alex@schedmd.com> * > > Hi Kaylea. I can't reproduce just by executing your same requests: > > alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash > alex@ibiza:~/t$ srun mpi/mpi_hello > Hello world from process 1 of 4 > Hello world from process 2 of 4 > Hello world from process 3 of 4 > Hello world from process 0 of 4 > alex@ibiza:~/t$ > > What type of MPI implementation are you using? For my testing, I'm using > OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD. > > I see this commit available since 17.11.1 that might be related: > https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd > > with this RELEASE_NOTES note: > > NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and > SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST. These > latter variables have been obsolete for some time please update any > scripts still using them. > > Is your MPI implementation relying upon any of these variables or are you > setting any node-related input environment variables affecting the request? > > Thanks. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request? Can you attach your slurm.conf? Thanks Created attachment 6484 [details] slurm.conf I don’t think so, just regular installations of those versions of OpenMPI. I can reproduce with a simple hello world test. (see attached for slurm.conf) --------------- Kaylea Nelson, PhD | kaylea.nelson@yale.edu Computational Research Support Analyst Yale Center for Research Computing <http://research.computing.yale.edu/> On Tue, Mar 27, 2018 at 10:59 AM, <bugs@schedmd.com> wrote: > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c3> on bug > 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez > <alex@schedmd.com> * > > Is your MPI implementation relying upon any of these variables or are you > setting any node-related input environment variables affecting the request? > > Can you attach your slurm.conf? > > Thanks > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I can reproduce now with an emulated system with 2 sockets. We'll come back to you with an answer. Hi Kaylea. It turns out I could reproduce in 17.11.4 and not in 17.11.5. I found the commit fixing this was: https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c So either upgrade to 17.11.5 or directly apply this patch: https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c.patch and recompile srun command. I'm gonna tag this as a duplicate of bug 4769. Please, reopen if after running with the patch applied you have further issues. Thanks. *** This bug has been marked as a duplicate of bug 4769 *** |