Summary: | IntelMPI jobs bound to wrong cpus when --nodes >= 128 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Bjørn-Helge Mevik <b.h.mevik> |
Component: | Other | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | alex |
Version: | 17.02.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Sigma2 Norway | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
C code for test program
Main slurm config file Nodes config file cgroup config file |
Hi. Would you mind attaching your slurm.conf and cgroup.conf files? I'd be interested in your complete salloc request you executed for your test. Thanks. I'm lowering the severity to 3 if you don't mind, since this isn't a high-impact problem that is causing sporadic outages. See: https://www.schedmd.com/support.php Created attachment 5387 [details]
Main slurm config file
Created attachment 5388 [details]
Nodes config file
Created attachment 5389 [details]
cgroup config file
(In reply to Alejandro Sanchez from comment #1) > Hi. Would you mind attaching your slurm.conf and cgroup.conf files? I'd be > interested in your complete salloc request you executed for your test. Sure! Config files added now (slurm.conf, slurmnodes.conf and cgroups.conf). The salloc command we used (with a varying number of nodes in the list), was: salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodelist=c[15-18,20-25,27-36]-[1-12],c38-[1-12],c19-[1-4] (this was with 256 nodes). We have also tried --nodes instead, like this: salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodes=128 with the same results. We have also discovered a strange warning that we get only when specifying from 33 to 127 nodes (inclusive), for instance: 319 (0) $ salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodes=33 salloc: Granted job allocation 8028 336 (0) $ module load intel/2017a 337 (0) $ mpirun hostname > _log srun: Warning: can't honor --ntasks-per-node set to 32 which doesn't match the requested tasks 33 with the number of requested nodes 33. Ignoring --ntasks-per-node. 338 (0) $ wc -l _log 1056 _log but the number of lines in _log is always correct (32 * --nodes). The warning goes away when --nodes <= 32 or --nodes >= 128. Thanks for the attached files. Looking at your configuration, I'd suggest switching to: TaskPlugin=task/affinity,task/cgroup in slurm.conf and TaskAffinity=no in cgroup.conf This uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence job into the specified memory, gpus, etc., thus combining the best of both pieces. I see you already have ConstrainCores=yes in cgroup.conf, which is fine. I'd also launch the steps with srun when possible. Could you please try that and see if things are improved? (In reply to Alejandro Sanchez from comment #7) > Thanks for the attached files. Looking at your configuration, I'd suggest > switching to: > > TaskPlugin=task/affinity,task/cgroup in slurm.conf > and > TaskAffinity=no in cgroup.conf > > This uses the task/affinity plugin for setting the affinity of the tasks > (which is better and different than task/cgroup) and uses the task/cgroup > plugin to fence job into the specified memory, gpus, etc., thus combining > the best of both pieces. I see you already have ConstrainCores=yes in > cgroup.conf, which is fine. Thanks, we'll test this. > I'd also launch the steps with srun when possible. Could you please try that > and see if things are improved? We are looking into that, but so far, using srun has led to programs starting and running slower than using mpirun. (In reply to Alejandro Sanchez from comment #7) > TaskPlugin=task/affinity,task/cgroup in slurm.conf > and > TaskAffinity=no in cgroup.conf We have tested this now, and it seems to work very well. The strange binding problems with IntelMPI's mpirun are gone. Also, for OpenMPI, using srun is just as fast as using mpirun. We will have to investigate more to find out why using srun with intelmpi is somewhat slower (at least in startup) than using mpirun with intelmpi, but at least we have a setup we can live with, and which is reasonably easy for the users. We've also set MpiDefault=pmix, so users don't have to specify this all the time. Tanks! Glad to hear. Closing the bug. |
Created attachment 5385 [details] C code for test program We are running Slurm 17.02.7 on CentOS 7.4. We have observed a problem with MPI rank to CPU binding when running Intel MPI jobs using mpirun from within a Slurm allocation (mpirun -bootstrap slurm). Our slurm cgroups configuration on compute nodes uses TaskAffinity=true (which is needed for OpenMPI). We have observed that in order for IntelMPI's mpirun to work, we need to prohibit slurm from binding processes by setting SLURM_CPU_BIND=none. If we don't do that, all ranks are bound to the first CPU of a compute node. When we do export SLURM_CPU_BIND=none, we get correct bindings on allocations up to and including 127 compute nodes (32ppn), i.e., the binding is done internally by the Intel MPI library. However, on allocations of 128 compute nodes and up, ranks from 1024 and up have incorrect binding: they are forced to run on the first CPU of each compute node only. We tested this using the attached C program (affinity_test.c). We compile the program using Intel Compiler, and we execute it as follows: $ salloc <...> $ module load intel/2017a $ mpiicc affinity_test.c -o affinity_test $ which mpirun /cluster/software/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27/bin64/mpirun $ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2017 Update 1 Build 20161016 (id: 16418) [...] $ export SLURM_CPU_BIND=none $ mpirun ./affinity_test For jobs up to and including 127 compute nodes the program yields the following output (subsequent ranks are bound to subsequent cores of a compute node): rank 0/4096 ncpus 2 mask 1000000000000000000000000000000010000000000000000000000000000000 rank 1/4096 ncpus 2 mask 0100000000000000000000000000000001000000000000000000000000000000 rank 2/4096 ncpus 2 mask 0010000000000000000000000000000000100000000000000000000000000000 [...] rank 4061/4064 ncpus 2 mask 0000000000000000000000000000010000000000000000000000000000000100 rank 4062/4064 ncpus 2 mask 0000000000000000000000000000001000000000000000000000000000000010 rank 4063/4064 ncpus 2 mask 0000000000000000000000000000000100000000000000000000000000000001 For jobs over 127 compute nodes the output is similar up till 1024 ranks, but is wrong afterwards (all ranks are bound to the two threads of the first core only) rank 1020/4096 ncpus 2 mask 0000000000000000000000000000100000000000000000000000000000001000 rank 1021/4096 ncpus 2 mask 0000000000000000000000000000010000000000000000000000000000000100 rank 1022/4096 ncpus 2 mask 0000000000000000000000000000001000000000000000000000000000000010 rank 1023/4096 ncpus 2 mask 0000000000000000000000000000000100000000000000000000000000000001 rank 1024/4096 ncpus 1 mask 1000000000000000000000000000000000000000000000000000000000000000 rank 1025/4096 ncpus 1 mask 0000000000000000000000000000000010000000000000000000000000000000 rank 1026/4096 ncpus 1 mask 1000000000000000000000000000000000000000000000000000000000000000 Could you help us to solve this issue?