Hi, A user has just discovered that the recent upgrade of Slurm to 20.11 broke his MPI code. That user uses mpiexec from OpenMPI because srun does not offer all of the same capability. He uses "--map-by ppr:12:socket" because his code works best when processes are distributed at a certain number by socket. The bug can be seen with this trivial example: [mboisson@cedar1]$ salloc --exclusive --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0 salloc: Pending job allocation 57435524 salloc: job 57435524 queued and waiting for resources salloc: job 57435524 has been allocated resources salloc: Granted job allocation 57435524 [mboisson@cdr768]$ mpiexec --map-by ppr:12:socket hostname | sort | uniq -c 24 cdr768.int.cedar.computecanada.ca 12 cdr774.int.cedar.computecanada.ca This happens irrespective of the version of mpiexec that I use (tested OpenMPI 2.1.1 and 4.0.3). It goes without saying that I have 2 sockets per node, hence there should be 48 processes started, not 36.
More information. cpu bindings are all messed up with Slurm 20.11 and OpenMPI : [mboisson@cedar1 def-mboisson]$ cat test.sh #!/bin/bash mpiexec --map-by ppr:12:socket --bind-to core:overload-allowed hostname | sort | uniq -c mpiexec -n 64 --report-bindings numactl --show | grep physcpubind | sort | uniq -c [mboisson@cedar1 def-mboisson]$ sbatch --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0 --ntasks=64 test.sh [mboisson@cedar1 def-mboisson]$ cat slurm-57436990.out 24 cdr1311.int.cedar.computecanada.ca 12 cdr1313.int.cedar.computecanada.ca 32 physcpubind: 0 32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Same thing with Slurm 20.02 on Graham: $ cat slurm-41965236.out | grep -v "socket" 24 gra535 24 gra66 32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 physcpubind: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
See other comments on OpenMPI github : https://github.com/open-mpi/ompi-www/pull/342
It appears that setting the environment variable SLURM_WHOLE=1 resolves the problem. But does not appear to be any documentation available that explains the effects of setting this variable: What are the effects of setting SLURM_WHOLE=1 always? What functionality is not available when SLURM_WHOLE=1 is set? Should SLURM_WHOLE=1 be set only if --exclusive is specified?
Maxime, Thanks for opening the ticket, but it's actually a duplicate of Bug 10383. We're aware of the discussion under the pull request to openmpi-www updating Slurm FAQ there, actually I'm the author of it. Martin, >But does not appear to be any documentation available that explains the effects of setting this variable Yes - this was missing, but got fixed in Bug 10430. >What functionality is not available when SLURM_WHOLE=1 is set? >Should SLURM_WHOLE=1 be set only if --exclusive is specified? The brief answer is that the variable follows the convention for srun input variables and seting it is equivalent to `srun --whole` option. It's not disabling any functionality, but creates a step with access to all job resources - not only requested for the step. I think you may find discussion under Bug 10383 interesting and more detailed. cheers, Marcin *** This bug has been marked as a duplicate of bug 10383 ***
Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one.
>Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one. Understood, I'm happy you reached out to us and we were able to match that. The initial "errors" experienced in Bug 10383 were in fact different, but since they have the same root cause we prefer to merge them. It's one issue, but with different symptoms depending on the case.