Summary: | SLURM_JOB_NUM_NODES incorrectly set in allocation done by srun | ||
---|---|---|---|
Product: | Slurm | Reporter: | Valentin Plugaru <valentin.plugaru> |
Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | Normal | CC: | jess, kaylea.nelson, valentin.plugaru |
Version: | 17.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | University of Luxembourg | Alineos Sites: | --- |
Bull/Atos Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 17.11.5 | |
Target Release: | --- | DevPrio: | --- |
Description
Valentin Plugaru
2018-02-12 09:36:40 MST
Hi Valentin. I've been able to reproduce and fix the problem in the following commit, which will be available since Slurm 17.11.5 and up: https://github.com/SchedMD/slurm/commit/de842149f0fc2 Before the patch: alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i alex@ibiza:~$ env | grep NODES SLURM_NNODES=2 SLURM_JOB_NUM_NODES=16 <--- SLURM_STEP_NUM_NODES=2 alex@ibiza:~$ srun /home/alex/t/mpi/xthi2 srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 16 with the number of requested nodes 16. Ignoring --ntasks-per-node. srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 2). Hello from rank 1, thread 0, on ibiza. (core affinity = 4) Hello from rank 2, thread 0, on ibiza. (core affinity = 1) Hello from rank 3, thread 0, on ibiza. (core affinity = 5) Hello from rank 4, thread 0, on ibiza. (core affinity = 2) Hello from rank 5, thread 0, on ibiza. (core affinity = 6) Hello from rank 6, thread 0, on ibiza. (core affinity = 3) Hello from rank 7, thread 0, on ibiza. (core affinity = 7) Hello from rank 8, thread 0, on ibiza. (core affinity = 0) Hello from rank 10, thread 0, on ibiza. (core affinity = 1) Hello from rank 11, thread 0, on ibiza. (core affinity = 5) Hello from rank 9, thread 0, on ibiza. (core affinity = 4) Hello from rank 12, thread 0, on ibiza. (core affinity = 2) Hello from rank 13, thread 0, on ibiza. (core affinity = 6) Hello from rank 14, thread 0, on ibiza. (core affinity = 3) Hello from rank 15, thread 0, on ibiza. (core affinity = 7) Hello from rank 0, thread 0, on ibiza. (core affinity = 0) alex@ibiza:~$ After the patch: alex@ibiza:~$ srun -N 2 --ntasks-per-node 8 --pty bash -i alex@ibiza:~$ env | grep NODES SLURM_NNODES=2 SLURM_JOB_NUM_NODES=2 <--- SLURM_STEP_NUM_NODES=2 alex@ibiza:~$ srun /home/alex/t/mpi/xthi2 Hello from rank 1, thread 0, on ibiza. (core affinity = 4) Hello from rank 2, thread 0, on ibiza. (core affinity = 1) Hello from rank 3, thread 0, on ibiza. (core affinity = 5) Hello from rank 4, thread 0, on ibiza. (core affinity = 2) Hello from rank 5, thread 0, on ibiza. (core affinity = 6) Hello from rank 6, thread 0, on ibiza. (core affinity = 3) Hello from rank 7, thread 0, on ibiza. (core affinity = 7) Hello from rank 8, thread 0, on ibiza. (core affinity = 0) Hello from rank 11, thread 0, on ibiza. (core affinity = 5) Hello from rank 9, thread 0, on ibiza. (core affinity = 4) Hello from rank 12, thread 0, on ibiza. (core affinity = 2) Hello from rank 10, thread 0, on ibiza. (core affinity = 1) Hello from rank 14, thread 0, on ibiza. (core affinity = 3) Hello from rank 13, thread 0, on ibiza. (core affinity = 6) Hello from rank 0, thread 0, on ibiza. (core affinity = 0) Hello from rank 15, thread 0, on ibiza. (core affinity = 7) alex@ibiza:~$ Dear Alejandro, Thank you for the quick fix, we're looking forward to the 17.11.5 release. Kind regards, Valentin |