Bug 4985

Summary: srun SLURM_NNODES error when running multiple tasks on a single node
Product: Slurm Reporter: Kaylea Nelson <kaylea.nelson>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Kaylea Nelson 2018-03-26 15:25:22 MDT
If I launch an interactive job where I ask for multiple tasks on a single node, I get the following error when I try to launch parallel code:

[kln26@grace1 ~]$ srun --pty --ntasks=4 --nodes=1 bash
[kln26@c38n01 tests]$ srun c_mpitest
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (4 != 1).

Depending on the code, this is just a warning, but other codes will fail to run.

It appears that somewhere the number of nodes is getting incorrectly set to the number of tasks? We don't get this error if the allocation spans multiple nodes. Also, jobs run properly if we launch the executable with "mpirun" instead of "srun".

We are seeing this on two clusters, running 17.11.3-2 and 17.11.2, respectively.

Thanks,
Kaylea
Comment 1 Alejandro Sanchez 2018-03-27 02:57:13 MDT
Hi Kaylea. I can't reproduce just by executing your same requests:

alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash
alex@ibiza:~/t$ srun mpi/mpi_hello
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 0 of 4
alex@ibiza:~/t$

What type of MPI implementation are you using? For my testing, I'm using OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD.

I see this commit available since 17.11.1 that might be related:

https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd

with this RELEASE_NOTES note:

NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and
      SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST.  These
      latter variables have been obsolete for some time please update any
      scripts still using them.

Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request?

Thanks.
Comment 2 Kaylea Nelson 2018-03-27 08:57:49 MDT
We are getting this error/warning with the OpenMPI versions we have tested
(and are commonly used our clusters):

2.1.1 (built with Intel 15.0.2)
1.10.3 (build with GCC 5.4.0)
1.8.1 (built with Intel 14.0.2).

All versions where built with the --with-pmi flag and we set
MpiDefault=pmi2 in our slurm.conf

Thanks!
Kaylea



---------------
Kaylea Nelson, PhD | kaylea.nelson@yale.edu
Computational Research Support Analyst
Yale Center for Research Computing <http://research.computing.yale.edu/>

On Tue, Mar 27, 2018 at 4:57 AM, <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c1> on bug
> 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez
> <alex@schedmd.com> *
>
> Hi Kaylea. I can't reproduce just by executing your same requests:
>
> alex@ibiza:~/t$ srun --pty --ntasks=4 --nodes=1 bash
> alex@ibiza:~/t$ srun mpi/mpi_hello
> Hello world from process 1 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> Hello world from process 0 of 4
> alex@ibiza:~/t$
>
> What type of MPI implementation are you using? For my testing, I'm using
> OpenMPI 3.1 and PMIx 2.1 with Slurm 17.11 current HEAD.
>
> I see this commit available since 17.11.1 that might be related:
> https://github.com/SchedMD/slurm/commit/46378a3fa7c79bfa666d887ae2c3bdd
>
> with this RELEASE_NOTES note:
>
> NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and
>       SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST.  These
>       latter variables have been obsolete for some time please update any
>       scripts still using them.
>
> Is your MPI implementation relying upon any of these variables or are you
> setting any node-related input environment variables affecting the request?
>
> Thanks.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Alejandro Sanchez 2018-03-27 08:59:34 MDT
Is your MPI implementation relying upon any of these variables or are you setting any node-related input environment variables affecting the request?

Can you attach your slurm.conf?

Thanks
Comment 4 Kaylea Nelson 2018-03-27 09:07:45 MDT
Created attachment 6484 [details]
slurm.conf

I don’t think so, just regular installations of those versions of OpenMPI.
I can reproduce with a simple hello world test. (see attached for
slurm.conf)

---------------
Kaylea Nelson, PhD | kaylea.nelson@yale.edu
Computational Research Support Analyst
Yale Center for Research Computing <http://research.computing.yale.edu/>

On Tue, Mar 27, 2018 at 10:59 AM, <bugs@schedmd.com> wrote:

> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=4985#c3> on bug
> 4985 <https://bugs.schedmd.com/show_bug.cgi?id=4985> from Alejandro Sanchez
> <alex@schedmd.com> *
>
> Is your MPI implementation relying upon any of these variables or are you
> setting any node-related input environment variables affecting the request?
>
> Can you attach your slurm.conf?
>
> Thanks
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 5 Alejandro Sanchez 2018-03-28 09:08:20 MDT
I can reproduce now with an emulated system with 2 sockets. We'll come back to you with an answer.
Comment 7 Alejandro Sanchez 2018-03-28 11:20:12 MDT
Hi Kaylea. It turns out I could reproduce in 17.11.4 and not in 17.11.5. I found the commit fixing this was:

https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c

So either upgrade to 17.11.5 or directly apply this patch:

https://github.com/SchedMD/slurm/commit/de842149f0fc2188f6504e0c.patch

and recompile srun command. I'm gonna tag this as a duplicate of bug 4769. Please, reopen if after running with the patch applied you have further issues. Thanks.

*** This bug has been marked as a duplicate of bug 4769 ***