Bug 4268 - IntelMPI jobs bound to wrong cpus when --nodes >= 128
Summary: IntelMPI jobs bound to wrong cpus when --nodes >= 128
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 17.02.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-10-17 03:18 MDT by Bjørn-Helge Mevik
Modified: 2017-10-18 07:10 MDT (History)
1 user (show)

See Also:
Site: Sigma2 Norway
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
C code for test program (796 bytes, text/x-csrc)
2017-10-17 03:18 MDT, Bjørn-Helge Mevik
Details
Main slurm config file (4.56 KB, text/plain)
2017-10-17 07:22 MDT, Bjørn-Helge Mevik
Details
Nodes config file (1.44 KB, text/plain)
2017-10-17 07:23 MDT, Bjørn-Helge Mevik
Details
cgroup config file (803 bytes, text/plain)
2017-10-17 07:23 MDT, Bjørn-Helge Mevik
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bjørn-Helge Mevik 2017-10-17 03:18:08 MDT
Created attachment 5385 [details]
C code for test program

We are running Slurm 17.02.7 on CentOS 7.4.

We have observed a problem with MPI rank to CPU binding when running Intel MPI jobs using mpirun from within a Slurm allocation (mpirun -bootstrap slurm). Our slurm cgroups configuration on compute nodes uses TaskAffinity=true (which is needed for OpenMPI). We have observed that in order for IntelMPI's mpirun to work, we need to prohibit slurm from binding processes by setting SLURM_CPU_BIND=none. If we don't do that, all ranks are bound to the first CPU of a compute node. When we do export SLURM_CPU_BIND=none, we get correct bindings on allocations up to and including 127 compute nodes (32ppn), i.e., the binding is done internally by the Intel MPI library.

However, on allocations of 128 compute nodes and up, ranks from 1024 and up have incorrect binding: they are forced to run on the first CPU of each compute node only. We tested this using the attached C program (affinity_test.c).

We compile the program using Intel Compiler, and we execute it as follows:

$ salloc <...>
$ module load intel/2017a
$ mpiicc affinity_test.c -o affinity_test
$ which mpirun
/cluster/software/impi/2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27/bin64/mpirun
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2017 Update 1 Build 20161016 (id: 16418)
[...]
$ export SLURM_CPU_BIND=none
$ mpirun ./affinity_test

For jobs up to and including 127 compute nodes the program yields the following output (subsequent ranks are bound to subsequent cores of a compute node):

rank 0/4096 ncpus 2 mask
1000000000000000000000000000000010000000000000000000000000000000
rank 1/4096 ncpus 2 mask
0100000000000000000000000000000001000000000000000000000000000000
rank 2/4096 ncpus 2 mask
0010000000000000000000000000000000100000000000000000000000000000
[...]
rank 4061/4064 ncpus 2 mask
0000000000000000000000000000010000000000000000000000000000000100
rank 4062/4064 ncpus 2 mask
0000000000000000000000000000001000000000000000000000000000000010
rank 4063/4064 ncpus 2 mask
0000000000000000000000000000000100000000000000000000000000000001

For jobs over 127 compute nodes the output is similar up till 1024 ranks, but is wrong afterwards (all ranks are bound to the two threads of the first core only)

rank 1020/4096 ncpus 2 mask
0000000000000000000000000000100000000000000000000000000000001000
rank 1021/4096 ncpus 2 mask
0000000000000000000000000000010000000000000000000000000000000100
rank 1022/4096 ncpus 2 mask
0000000000000000000000000000001000000000000000000000000000000010
rank 1023/4096 ncpus 2 mask
0000000000000000000000000000000100000000000000000000000000000001
rank 1024/4096 ncpus 1 mask
1000000000000000000000000000000000000000000000000000000000000000
rank 1025/4096 ncpus 1 mask
0000000000000000000000000000000010000000000000000000000000000000
rank 1026/4096 ncpus 1 mask
1000000000000000000000000000000000000000000000000000000000000000

Could you help us to solve this issue?
Comment 1 Alejandro Sanchez 2017-10-17 04:28:13 MDT
Hi. Would you mind attaching your slurm.conf and cgroup.conf files? I'd be interested in your complete salloc request you executed for your test. Thanks.
Comment 2 Alejandro Sanchez 2017-10-17 04:30:02 MDT
I'm lowering the severity to 3 if you don't mind, since this isn't a high-impact problem that is causing sporadic outages. See:
https://www.schedmd.com/support.php
Comment 3 Bjørn-Helge Mevik 2017-10-17 07:22:37 MDT
Created attachment 5387 [details]
Main slurm config file
Comment 4 Bjørn-Helge Mevik 2017-10-17 07:23:04 MDT
Created attachment 5388 [details]
Nodes config file
Comment 5 Bjørn-Helge Mevik 2017-10-17 07:23:22 MDT
Created attachment 5389 [details]
cgroup config file
Comment 6 Bjørn-Helge Mevik 2017-10-17 07:39:46 MDT
(In reply to Alejandro Sanchez from comment #1)
> Hi. Would you mind attaching your slurm.conf and cgroup.conf files? I'd be
> interested in your complete salloc request you executed for your test.

Sure! Config files added now (slurm.conf, slurmnodes.conf and cgroups.conf).

The salloc command we used (with a varying number of nodes in the list), was:

salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodelist=c[15-18,20-25,27-36]-[1-12],c38-[1-12],c19-[1-4]

(this was with 256 nodes).  We have also tried --nodes instead, like this:

salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodes=128

with the same results.

We have also discovered a strange warning that we get only when specifying from 33 to 127 nodes (inclusive), for instance:

319 (0) $ salloc --account=nn9999k --time=10:0:0 --ntasks-per-node=32 --nodes=33
salloc: Granted job allocation 8028
336 (0) $ module load intel/2017a
337 (0) $ mpirun hostname > _log
srun: Warning: can't honor --ntasks-per-node set to 32 which doesn't match the requested tasks 33 with the number of requested nodes 33. Ignoring --ntasks-per-node.
338 (0) $ wc -l _log
1056 _log

but the number of lines in _log is always correct (32 * --nodes).  The warning goes away when --nodes <= 32 or --nodes >= 128.
Comment 7 Alejandro Sanchez 2017-10-17 07:40:41 MDT
Thanks for the attached files. Looking at your configuration, I'd suggest switching to:

TaskPlugin=task/affinity,task/cgroup in slurm.conf
and
TaskAffinity=no in cgroup.conf

This uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence job into the specified memory, gpus, etc., thus combining the best of both pieces. I see you already have ConstrainCores=yes in cgroup.conf, which is fine.

I'd also launch the steps with srun when possible. Could you please try that and see if things are improved?
Comment 8 Bjørn-Helge Mevik 2017-10-17 07:50:23 MDT
(In reply to Alejandro Sanchez from comment #7)
> Thanks for the attached files. Looking at your configuration, I'd suggest
> switching to:
> 
> TaskPlugin=task/affinity,task/cgroup in slurm.conf
> and
> TaskAffinity=no in cgroup.conf
> 
> This uses the task/affinity plugin for setting the affinity of the tasks
> (which is better and different than task/cgroup) and uses the task/cgroup
> plugin to fence job into the specified memory, gpus, etc., thus combining
> the best of both pieces. I see you already have ConstrainCores=yes in
> cgroup.conf, which is fine.

Thanks, we'll test this.
 
> I'd also launch the steps with srun when possible. Could you please try that
> and see if things are improved?

We are looking into that, but so far, using srun has led to programs starting and running slower than using mpirun.
Comment 9 Bjørn-Helge Mevik 2017-10-18 07:08:41 MDT
(In reply to Alejandro Sanchez from comment #7)

> TaskPlugin=task/affinity,task/cgroup in slurm.conf
> and
> TaskAffinity=no in cgroup.conf

We have tested this now, and it seems to work very well.  The strange binding problems with IntelMPI's mpirun are gone.  Also, for OpenMPI, using srun is just as fast as using mpirun.

We will have to investigate more to find out why using srun with intelmpi is somewhat slower (at least in startup) than using mpirun with intelmpi, but at least we have a setup we can live with, and which is reasonably easy for the users.

We've also set MpiDefault=pmix, so users don't have to specify this all the time.

Tanks!
Comment 10 Alejandro Sanchez 2017-10-18 07:10:20 MDT
Glad to hear. Closing the bug.