Bug 4769

Summary: SLURM_JOB_NUM_NODES incorrectly set in allocation done by srun
Product: Slurm Reporter: Valentin Plugaru <valentin.plugaru>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: Normal CC: jess, kaylea.nelson, valentin.plugaru
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: University of Luxembourg Alineos Sites: ---
Bull/Atos Sites: --- Confidential Site: ---
Cray Sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
SFW Sites: --- SNIC sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.5
Target Release: --- DevPrio: ---

Description Valentin Plugaru 2018-02-12 09:36:40 MST
Dear SchedMD,

After the upgrade from 17.02.9 to 17.11.3-2 we're seeing the following problem with the SLURM_NNODES vs SLURM_JOB_NUM_NODES environment variables, when creating a job allocation directly with srun:

$ srun -N 2 --ntasks-per-node 28 -p admin --time=5:0 --pty bash -i
$ env | grep SLURM | grep NODES
SLURM_NNODES=2
SLURM_STEP_NUM_NODES=2
SLURM_JOB_NUM_NODES=56

Trying to run an MPI hello world in this context then brings the following messages:

$ srun ./hellompi
srun: Warning: can't honor --ntasks-per-node set to 28 which doesn't match the requested tasks 56 with the number of requested nodes 56. Ignoring --ntasks-per-node.
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (56 != 2).

As per https://slurm.schedmd.com/sbatch.html SLURM_NNODES is the legacy variable, but it's the one getting correctly set in this case.
When running a simple printenv within a batch script ran with sbatch, both are getting set correctly, and the MPI hello world runs fine.

Can you please investigate this issue?

Kind regards,
Valentin
Comment 3 Alejandro Sanchez 2018-03-08 06:22:06 MST
Hi Valentin. I've been able to reproduce and fix the problem in the following commit, which will be available since Slurm 17.11.5 and up:

https://github.com/SchedMD/slurm/commit/de842149f0fc2

Before the patch:

alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i
alex@ibiza:~$ env | grep NODES
SLURM_NNODES=2
SLURM_JOB_NUM_NODES=16 <---
SLURM_STEP_NUM_NODES=2
alex@ibiza:~$ srun /home/alex/t/mpi/xthi2
srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 16 with the number of requested nodes 16. Ignoring --ntasks-per-node.
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 2).
Hello from rank 1, thread 0, on ibiza. (core affinity = 4)
Hello from rank 2, thread 0, on ibiza. (core affinity = 1)
Hello from rank 3, thread 0, on ibiza. (core affinity = 5)
Hello from rank 4, thread 0, on ibiza. (core affinity = 2)
Hello from rank 5, thread 0, on ibiza. (core affinity = 6)
Hello from rank 6, thread 0, on ibiza. (core affinity = 3)
Hello from rank 7, thread 0, on ibiza. (core affinity = 7)
Hello from rank 8, thread 0, on ibiza. (core affinity = 0)
Hello from rank 10, thread 0, on ibiza. (core affinity = 1)
Hello from rank 11, thread 0, on ibiza. (core affinity = 5)
Hello from rank 9, thread 0, on ibiza. (core affinity = 4)
Hello from rank 12, thread 0, on ibiza. (core affinity = 2)
Hello from rank 13, thread 0, on ibiza. (core affinity = 6)
Hello from rank 14, thread 0, on ibiza. (core affinity = 3)
Hello from rank 15, thread 0, on ibiza. (core affinity = 7)
Hello from rank 0, thread 0, on ibiza. (core affinity = 0)
alex@ibiza:~$

After the patch:

alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i
alex@ibiza:~$ env | grep NODES
SLURM_NNODES=2
SLURM_JOB_NUM_NODES=2 <---
SLURM_STEP_NUM_NODES=2
alex@ibiza:~$ srun /home/alex/t/mpi/xthi2
Hello from rank 1, thread 0, on ibiza. (core affinity = 4)
Hello from rank 2, thread 0, on ibiza. (core affinity = 1)
Hello from rank 3, thread 0, on ibiza. (core affinity = 5)
Hello from rank 4, thread 0, on ibiza. (core affinity = 2)
Hello from rank 5, thread 0, on ibiza. (core affinity = 6)
Hello from rank 6, thread 0, on ibiza. (core affinity = 3)
Hello from rank 7, thread 0, on ibiza. (core affinity = 7)
Hello from rank 8, thread 0, on ibiza. (core affinity = 0)
Hello from rank 11, thread 0, on ibiza. (core affinity = 5)
Hello from rank 9, thread 0, on ibiza. (core affinity = 4)
Hello from rank 12, thread 0, on ibiza. (core affinity = 2)
Hello from rank 10, thread 0, on ibiza. (core affinity = 1)
Hello from rank 14, thread 0, on ibiza. (core affinity = 3)
Hello from rank 13, thread 0, on ibiza. (core affinity = 6)
Hello from rank 0, thread 0, on ibiza. (core affinity = 0)
Hello from rank 15, thread 0, on ibiza. (core affinity = 7)
alex@ibiza:~$
Comment 4 Valentin Plugaru 2018-03-08 12:31:59 MST
Dear Alejandro,

Thank you for the quick fix, we're looking forward to the 17.11.5 release.

Kind regards,
Valentin
Comment 5 Alejandro Sanchez 2018-03-28 11:20:12 MDT
*** Bug 4985 has been marked as a duplicate of this bug. ***