Dear SchedMD, After the upgrade from 17.02.9 to 17.11.3-2 we're seeing the following problem with the SLURM_NNODES vs SLURM_JOB_NUM_NODES environment variables, when creating a job allocation directly with srun: $ srun -N 2 --ntasks-per-node 28 -p admin --time=5:0 --pty bash -i $ env | grep SLURM | grep NODES SLURM_NNODES=2 SLURM_STEP_NUM_NODES=2 SLURM_JOB_NUM_NODES=56 Trying to run an MPI hello world in this context then brings the following messages: $ srun ./hellompi srun: Warning: can't honor --ntasks-per-node set to 28 which doesn't match the requested tasks 56 with the number of requested nodes 56. Ignoring --ntasks-per-node. srun: error: SLURM_NNODES environment variable conflicts with allocated node count (56 != 2). As per https://slurm.schedmd.com/sbatch.html SLURM_NNODES is the legacy variable, but it's the one getting correctly set in this case. When running a simple printenv within a batch script ran with sbatch, both are getting set correctly, and the MPI hello world runs fine. Can you please investigate this issue? Kind regards, Valentin
Hi Valentin. I've been able to reproduce and fix the problem in the following commit, which will be available since Slurm 17.11.5 and up: https://github.com/SchedMD/slurm/commit/de842149f0fc2 Before the patch: alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i alex@ibiza:~$ env | grep NODES SLURM_NNODES=2 SLURM_JOB_NUM_NODES=16 <--- SLURM_STEP_NUM_NODES=2 alex@ibiza:~$ srun /home/alex/t/mpi/xthi2 srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 16 with the number of requested nodes 16. Ignoring --ntasks-per-node. srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 2). Hello from rank 1, thread 0, on ibiza. (core affinity = 4) Hello from rank 2, thread 0, on ibiza. (core affinity = 1) Hello from rank 3, thread 0, on ibiza. (core affinity = 5) Hello from rank 4, thread 0, on ibiza. (core affinity = 2) Hello from rank 5, thread 0, on ibiza. (core affinity = 6) Hello from rank 6, thread 0, on ibiza. (core affinity = 3) Hello from rank 7, thread 0, on ibiza. (core affinity = 7) Hello from rank 8, thread 0, on ibiza. (core affinity = 0) Hello from rank 10, thread 0, on ibiza. (core affinity = 1) Hello from rank 11, thread 0, on ibiza. (core affinity = 5) Hello from rank 9, thread 0, on ibiza. (core affinity = 4) Hello from rank 12, thread 0, on ibiza. (core affinity = 2) Hello from rank 13, thread 0, on ibiza. (core affinity = 6) Hello from rank 14, thread 0, on ibiza. (core affinity = 3) Hello from rank 15, thread 0, on ibiza. (core affinity = 7) Hello from rank 0, thread 0, on ibiza. (core affinity = 0) alex@ibiza:~$ After the patch: alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i alex@ibiza:~$ env | grep NODES SLURM_NNODES=2 SLURM_JOB_NUM_NODES=2 <--- SLURM_STEP_NUM_NODES=2 alex@ibiza:~$ srun /home/alex/t/mpi/xthi2 Hello from rank 1, thread 0, on ibiza. (core affinity = 4) Hello from rank 2, thread 0, on ibiza. (core affinity = 1) Hello from rank 3, thread 0, on ibiza. (core affinity = 5) Hello from rank 4, thread 0, on ibiza. (core affinity = 2) Hello from rank 5, thread 0, on ibiza. (core affinity = 6) Hello from rank 6, thread 0, on ibiza. (core affinity = 3) Hello from rank 7, thread 0, on ibiza. (core affinity = 7) Hello from rank 8, thread 0, on ibiza. (core affinity = 0) Hello from rank 11, thread 0, on ibiza. (core affinity = 5) Hello from rank 9, thread 0, on ibiza. (core affinity = 4) Hello from rank 12, thread 0, on ibiza. (core affinity = 2) Hello from rank 10, thread 0, on ibiza. (core affinity = 1) Hello from rank 14, thread 0, on ibiza. (core affinity = 3) Hello from rank 13, thread 0, on ibiza. (core affinity = 6) Hello from rank 0, thread 0, on ibiza. (core affinity = 0) Hello from rank 15, thread 0, on ibiza. (core affinity = 7) alex@ibiza:~$
Dear Alejandro, Thank you for the quick fix, we're looking forward to the 17.11.5 release. Kind regards, Valentin
*** Bug 4985 has been marked as a duplicate of this bug. ***