Bug 4769 - SLURM_JOB_NUM_NODES incorrectly set in allocation done by srun
Summary: SLURM_JOB_NUM_NODES incorrectly set in allocation done by srun
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 17.11.3
Hardware: Linux Linux
: Normal 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 4985 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-02-12 09:36 MST by Valentin Plugaru
Modified: 2018-03-28 11:20 MDT (History)
3 users (show)

See Also:
Site: University of Luxembourg
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Valentin Plugaru 2018-02-12 09:36:40 MST
Dear SchedMD,

After the upgrade from 17.02.9 to 17.11.3-2 we're seeing the following problem with the SLURM_NNODES vs SLURM_JOB_NUM_NODES environment variables, when creating a job allocation directly with srun:

$ srun -N 2 --ntasks-per-node 28 -p admin --time=5:0 --pty bash -i
$ env | grep SLURM | grep NODES
SLURM_NNODES=2
SLURM_STEP_NUM_NODES=2
SLURM_JOB_NUM_NODES=56

Trying to run an MPI hello world in this context then brings the following messages:

$ srun ./hellompi
srun: Warning: can't honor --ntasks-per-node set to 28 which doesn't match the requested tasks 56 with the number of requested nodes 56. Ignoring --ntasks-per-node.
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (56 != 2).

As per https://slurm.schedmd.com/sbatch.html SLURM_NNODES is the legacy variable, but it's the one getting correctly set in this case.
When running a simple printenv within a batch script ran with sbatch, both are getting set correctly, and the MPI hello world runs fine.

Can you please investigate this issue?

Kind regards,
Valentin
Comment 3 Alejandro Sanchez 2018-03-08 06:22:06 MST
Hi Valentin. I've been able to reproduce and fix the problem in the following commit, which will be available since Slurm 17.11.5 and up:

https://github.com/SchedMD/slurm/commit/de842149f0fc2

Before the patch:

alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i
alex@ibiza:~$ env | grep NODES
SLURM_NNODES=2
SLURM_JOB_NUM_NODES=16 <---
SLURM_STEP_NUM_NODES=2
alex@ibiza:~$ srun /home/alex/t/mpi/xthi2
srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 16 with the number of requested nodes 16. Ignoring --ntasks-per-node.
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 2).
Hello from rank 1, thread 0, on ibiza. (core affinity = 4)
Hello from rank 2, thread 0, on ibiza. (core affinity = 1)
Hello from rank 3, thread 0, on ibiza. (core affinity = 5)
Hello from rank 4, thread 0, on ibiza. (core affinity = 2)
Hello from rank 5, thread 0, on ibiza. (core affinity = 6)
Hello from rank 6, thread 0, on ibiza. (core affinity = 3)
Hello from rank 7, thread 0, on ibiza. (core affinity = 7)
Hello from rank 8, thread 0, on ibiza. (core affinity = 0)
Hello from rank 10, thread 0, on ibiza. (core affinity = 1)
Hello from rank 11, thread 0, on ibiza. (core affinity = 5)
Hello from rank 9, thread 0, on ibiza. (core affinity = 4)
Hello from rank 12, thread 0, on ibiza. (core affinity = 2)
Hello from rank 13, thread 0, on ibiza. (core affinity = 6)
Hello from rank 14, thread 0, on ibiza. (core affinity = 3)
Hello from rank 15, thread 0, on ibiza. (core affinity = 7)
Hello from rank 0, thread 0, on ibiza. (core affinity = 0)
alex@ibiza:~$

After the patch:

alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i
alex@ibiza:~$ env | grep NODES
SLURM_NNODES=2
SLURM_JOB_NUM_NODES=2 <---
SLURM_STEP_NUM_NODES=2
alex@ibiza:~$ srun /home/alex/t/mpi/xthi2
Hello from rank 1, thread 0, on ibiza. (core affinity = 4)
Hello from rank 2, thread 0, on ibiza. (core affinity = 1)
Hello from rank 3, thread 0, on ibiza. (core affinity = 5)
Hello from rank 4, thread 0, on ibiza. (core affinity = 2)
Hello from rank 5, thread 0, on ibiza. (core affinity = 6)
Hello from rank 6, thread 0, on ibiza. (core affinity = 3)
Hello from rank 7, thread 0, on ibiza. (core affinity = 7)
Hello from rank 8, thread 0, on ibiza. (core affinity = 0)
Hello from rank 11, thread 0, on ibiza. (core affinity = 5)
Hello from rank 9, thread 0, on ibiza. (core affinity = 4)
Hello from rank 12, thread 0, on ibiza. (core affinity = 2)
Hello from rank 10, thread 0, on ibiza. (core affinity = 1)
Hello from rank 14, thread 0, on ibiza. (core affinity = 3)
Hello from rank 13, thread 0, on ibiza. (core affinity = 6)
Hello from rank 0, thread 0, on ibiza. (core affinity = 0)
Hello from rank 15, thread 0, on ibiza. (core affinity = 7)
alex@ibiza:~$
Comment 4 Valentin Plugaru 2018-03-08 12:31:59 MST
Dear Alejandro,

Thank you for the quick fix, we're looking forward to the 17.11.5 release.

Kind regards,
Valentin
Comment 5 Alejandro Sanchez 2018-03-28 11:20:12 MDT
*** Bug 4985 has been marked as a duplicate of this bug. ***