If I launch an interactive job where I ask for multiple tasks on a single node, I get the following error when I try to launch parallel code: [kln26@grace1 ~]$ srun --pty --ntasks=4 --nodes=1 bash [kln26@c38n01 tests]$ srun c_mpitest srun: error: SLURM_NNODES environment variable conflicts with allocated node count (4 != 1). Depending on the code, this is just a warning, but other codes will fail to run. It appears that somewhere the number of nodes is getting incorrectly set to the number of tasks? We don't get this error if the allocation spans multiple nodes. Also, jobs run properly if we launch the executable with "mpirun" instead of "srun". We are seeing this on two clusters, running 17.11.3-2 and 17.11.2, respectively. Thanks, Kaylea
Hi Kaylea. I can't reproduce just by executing your same requests: alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash alex@ibiza:~/t$ srun mpi/mpi_hello Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4 alex@ibiza:~/t$ What type of MPI implementation are you using? For my testing, I'm using OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD. I see this commit available since 17.11.1 that might be related: https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd with this RELEASE_NOTES note: NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST. These latter variables have been obsolete for some time please update any scripts still using them. Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request? Thanks.
We are getting this error/warning with the OpenMPI versions we have tested (and are commonly used our clusters): 2.1.1 (built with Intel 15.0.2) 1.10.3 (build with GCC 5.4.0) 1.8.1 (built with Intel 14.0.2). All versions where built with the --with-pmi flag and we set MpiDefault=pmi2 in our slurm.conf Thanks! Kaylea --------------- Kaylea Nelson, PhD | kaylea.nelson@yale.edu Computational Research Support Analyst Yale Center for Research Computing <http://research.computing.yale.edu/> On Tue, Mar 27, 2018 at 4:57 AM, <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c1> on bug > 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez > <alex@schedmd.com> * > > Hi Kaylea. I can't reproduce just by executing your same requests: > > alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash > alex@ibiza:~/t$ srun mpi/mpi_hello > Hello world from process 1 of 4 > Hello world from process 2 of 4 > Hello world from process 3 of 4 > Hello world from process 0 of 4 > alex@ibiza:~/t$ > > What type of MPI implementation are you using? For my testing, I'm using > OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD. > > I see this commit available since 17.11.1 that might be related: > https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd > > with this RELEASE_NOTES note: > > NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and > SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST. These > latter variables have been obsolete for some time please update any > scripts still using them. > > Is your MPI implementation relying upon any of these variables or are you > setting any node-related input environment variables affecting the request? > > Thanks. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request? Can you attach your slurm.conf? Thanks
Created attachment 6484 [details] slurm.conf I don’t think so, just regular installations of those versions of OpenMPI. I can reproduce with a simple hello world test. (see attached for slurm.conf) --------------- Kaylea Nelson, PhD | kaylea.nelson@yale.edu Computational Research Support Analyst Yale Center for Research Computing <http://research.computing.yale.edu/> On Tue, Mar 27, 2018 at 10:59 AM, <bugs@schedmd.com> wrote: > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c3> on bug > 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez > <alex@schedmd.com> * > > Is your MPI implementation relying upon any of these variables or are you > setting any node-related input environment variables affecting the request? > > Can you attach your slurm.conf? > > Thanks > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I can reproduce now with an emulated system with 2 sockets. We'll come back to you with an answer.
Hi Kaylea. It turns out I could reproduce in 17.11.4 and not in 17.11.5. I found the commit fixing this was: https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c So either upgrade to 17.11.5 or directly apply this patch: https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c.patch and recompile srun command. I'm gonna tag this as a duplicate of bug 4769. Please, reopen if after running with the patch applied you have further issues. Thanks. *** This bug has been marked as a duplicate of bug 4769 ***