1049 – Core binding with srun vs mpirun

Ticket 1049 - Core binding with srun vs mpirun

Summary: Core binding with srun vs mpirun

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.03.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-08-18 12:45 MDT by Kilian Cavalotti
Modified:	2014-11-28 02:05 MST (History)
CC List:	2 users (show)

See Also:
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2014-08-18 12:45:47 MDT

Hi,

I'm experimenting with core binding and observe a different behavior when a task is executed through srun or mpirun. I know that the recommended way is to use srun, but we have users who like mpirun very much, and want to be able to use specific mpirun options, so I'm trying to allow them to use either way.

We're using Open MPI 1.6.5 (Slurm integration works fine), and the Slurm config contains:
>> slurm.conf:
TaskPlugin=task/cgroup
>> cgroups.conf:
TaskAffinity=yes # require hwloc
ConstrainCores=yes


When I submit using srun, I get correct task binding (ie. each MPI process gets its own core):

$ salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose 
salloc: Granted job allocation 180663

$ srun bash -c "cat /proc/self/status | grep Cpus_allowed_list"
cpu_bind=NULL - sh-0-2, task  3  1 [38121]: mask 0x4
cpu_bind=NULL - sh-0-2, task  2  0 [38120]: mask 0x1
cpu_bind=NULL - sh-0-1, task  0  0 [38575]: mask 0x1
cpu_bind=NULL - sh-0-1, task  1  1 [38576]: mask 0x4
Cpus_allowed_list:      2
Cpus_allowed_list:      0
Cpus_allowed_list:      0
Cpus_allowed_list:      2


But when I use mpirun instead, all the processes seem constrained to the same core on each node:

$ mpirun --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list"

 ========================   JOB MAP   ========================

 Data for node: sh-0-1  Num procs: 2
        Process OMPI jobid: [56586,1] Process rank: 0
        Process OMPI jobid: [56586,1] Process rank: 1

 Data for node: sh-0-2  Num procs: 2
        Process OMPI jobid: [56586,1] Process rank: 2
        Process OMPI jobid: [56586,1] Process rank: 3

 =============================================================
Cpus_allowed_list:      0
Cpus_allowed_list:      0
Cpus_allowed_list:      0
Cpus_allowed_list:      0


Could you please advise on why this happens? 

Thanks!

Comment 1 David Bigagli 2014-08-19 05:23:01 MDT

Hi,
  if you don't use srun then Slurm has no control over how mpirun binds tasks to cores as this is mpirun specific. You may want to look into mpirun specific options -bind-to-core and bind-to-socket.

David

Comment 2 Kilian Cavalotti 2014-08-19 05:38:09 MDT

Hi David, 

(In reply to David Bigagli from comment #1)
>   if you don't use srun then Slurm has no control over how mpirun binds
> tasks to cores as this is mpirun specific. You may want to look into mpirun
> specific options -bind-to-core and bind-to-socket.

The thing is that, on the contrary, it seems mpirun-launched processes *are* bound to CPU core although no CPU-binding option is used. I would expect  Cpus_allowed_list to contain all the CPUs Slurm allocated on the nodes, not just 0.

If I try the --bind-to-core option, mpirun tells me that it does get enough CPUs on the nodes to do it:

$ salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose
salloc: Granted job allocation 180728
$ mpirun --bind-to-core  --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list"

 ========================   JOB MAP   ========================

 Data for node: sh-1-9  Num procs: 2
        Process OMPI jobid: [39315,1] Process rank: 0
        Process OMPI jobid: [39315,1] Process rank: 1

 Data for node: sh-1-11 Num procs: 2
        Process OMPI jobid: [39315,1] Process rank: 2
        Process OMPI jobid: [39315,1] Process rank: 3

 =============================================================
--------------------------------------------------------------------------
Not enough processors were found on the local host to meet the requested
binding action:

  Local host:        sh-1-9
  Action requested:  bind-to-core
  Application name:  /bin/bash

Please revise the request and try again.
--------------------------------------------------------------------------
Cpus_allowed_list:      0
Cpus_allowed_list:      0
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error
on node sh-1-9. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start

Comment 3 David Bigagli 2014-08-19 05:46:11 MDT

I don't know how mpirun binds to cores by itself, I assume it uses the processor affinity. I will give it a try.

I don't think mpirun reads the environment variables set by Slurm.Ont he other hand if Open MPI is compiled with the Slurm option then it will integrate with srun launch mechanism.

David

Comment 4 Kilian Cavalotti 2014-08-19 05:50:21 MDT

(In reply to David Bigagli from comment #3)
> I don't know how mpirun binds to cores by itself, I assume it uses the
> processor affinity. I will give it a try.

Thanks.

> I don't think mpirun reads the environment variables set by Slurm.

It does, since just running "mpirun --display-map" showed that it spawned 2 processes and the 2 nodes that "salloc -n 4 --ntasks-per-node=2" allocated.

> Ont he
> other hand if Open MPI is compiled with the Slurm option then it will
> integrate with srun launch mechanism.

I know that srun is the preferred way and that it works, I'm just trying to understand why the regular mpirun way doesn't. Because https://www.open-mpi.org/faq/?category=slurm says it should work (although it doesn't say anything about CPU binding).

Thanks for looking into this!

Comment 5 David Bigagli 2014-08-19 05:53:51 MDT

I see you are right mpirun is Slurm aware. Why does it bind correctly then? :-)
Let me look into this and get back to you.

David

Comment 6 David Bigagli 2014-08-19 08:14:19 MDT

Hi,
  this is somehow related to cpuset and mpirun. If I use the TaskPlugin=task/affinity
I got consistent results from your 2 experiments:

david@prometeo ~/slurm/work>salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose
salloc: Granted job allocation 83806
83806->david@prometeo ~/slurm/work>srun bash -c "cat /proc/self/status | grep Cpus_allowed_list"
cpu_bind=MASK - prometeo, task  3  1 [26909]: mask 0x3 set
cpu_bind=MASK - prometeo, task  0  0 [26911]: mask 0x3 set
Cpus_allowed_list:      0-1
cpu_bind=MASK - prometeo, task  2  0 [26908]: mask 0x3 set
cpu_bind=MASK - prometeo, task  1  1 [26912]: mask 0x3 set
Cpus_allowed_list:      0-1
Cpus_allowed_list:      0-1
Cpus_allowed_list:      0-1
83806->david@prometeo ~/slurm/work> mpirun --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list"

 ========================   JOB MAP   ========================

 Data for node: regor1  Num procs: 2
        Process OMPI jobid: [60459,1] Process rank: 0
        Process OMPI jobid: [60459,1] Process rank: 1

 Data for node: regor2  Num procs: 2
        Process OMPI jobid: [60459,1] Process rank: 2
        Process OMPI jobid: [60459,1] Process rank: 3

 =============================================================
Cpus_allowed_list:      0-1
Cpus_allowed_list:      0-1
Cpus_allowed_list:      0-1
Cpus_allowed_list:      0-1

with OpenMPI 1.6.5 and 1.7.3.

When I switch to TaskPlugin=task/cgroup then I see exactly what you see,
all tasks are constrained to CPU 0.
When I look at the cgroup hierarchy I can see it is set up correctly
with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct
process ids in the tasks file yet all 4 tasks run on cpu 0.

David

Comment 7 David Bigagli 2014-08-19 08:22:20 MDT

There are some information about this, although from few years back:

https://groups.google.com/forum/#!msg/slurm-devel/P4rZFx2gIIQ/TxacDULLE2oJ

David

Comment 8 Kilian Cavalotti 2014-08-19 09:10:33 MDT

(In reply to David Bigagli from comment #6)
> When I switch to TaskPlugin=task/cgroup then I see exactly what you see,
> all tasks are constrained to CPU 0.
> When I look at the cgroup hierarchy I can see it is set up correctly
> with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct
> process ids in the tasks file yet all 4 tasks run on cpu 0.

So are you saying that's a bug with the task/cgroup plugin?
The 2009 conversation you pointed to does not seem to relate to cgroups.

Comment 9 David Bigagli 2014-08-19 09:19:33 MDT

I think it has to do with how mpirun uses srun to launch the orted daemon.
I understand that cgroup uses cpusets so I was assuming the problems were related.
I don't have a definitive conclusion yet I was just telling what I found. The first workaround for now is to use task affinity instead of cgroups.

David

Comment 10 Kilian Cavalotti 2014-08-19 09:29:48 MDT

(In reply to David Bigagli from comment #9)
> I think it has to do with how mpirun uses srun to launch the orted daemon.
> I understand that cgroup uses cpusets so I was assuming the problems were
> related.
> I don't have a definitive conclusion yet I was just telling what I found.
> The first workaround for now is to use task affinity instead of cgroups.

I see, thanks for the clarification.
We kind of need the cgroups task plugin to enforce memory limits, and it works great when users submit MPI jobs with srun instead of mpirun.

But for the sake of completeness, we would also like to support our users who prefer using mpirun (because it offers them a larger variety of tunables), and allow them to submit jobs that way, while not being constrained to just one core.

Thanks for looking into this!

Comment 11 David Bigagli 2014-08-19 09:33:54 MDT

Yes I understand the use case.
Another possibility is to stack the plugins so that we can use the affinity
to bind to cores and cgroups for memory. I am putting together an example 
for you.

David

Comment 12 David Bigagli 2014-08-19 09:56:46 MDT

Here is the configuration I tested.

In slurm.conf:

TaskPlugin=task/affinity,task/cgroup

In cgroup.conf:

TaskAffinity=no

In this way the cores are being managed by the affinity plugin while the memory by the cgroup.

David

Comment 13 Kilian Cavalotti 2014-08-19 10:42:38 MDT

(In reply to David Bigagli from comment #12)
> Here is the configuration I tested.
> 
> In slurm.conf:
> 
> TaskPlugin=task/affinity,task/cgroup
> 
> In cgroup.conf:
> 
> TaskAffinity=no
> 
> In this way the cores are being managed by the affinity plugin while the
> memory by the cgroup.

Thanks, that works great!

If you ever get to the bottom of this, I'm still interested by the reason why it doesn't work as expected with affinity managed by the cgroups plugin.

Thanks!

Comment 14 David Bigagli 2014-08-19 11:19:19 MDT

Very good.

David

Comment 15 Kilian Cavalotti 2014-08-19 12:36:39 MDT

(In reply to David Bigagli from comment #9)

> When I switch to TaskPlugin=task/cgroup then I see exactly what you see,
> all tasks are constrained to CPU 0.
> When I look at the cgroup hierarchy I can see it is set up correctly
> with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct
> process ids in the tasks file yet all 4 tasks run on cpu 0.

I observe the same thing, but then I feel that processes' affinity is defined *on top* of the cgroup CPU binding.

For instance, using mpirun and the original cgroups only settings, I get:

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    1  9492  9491  9491 ?           -1 Sl       0   0:00 slurmstepd: [181597.4]
 9492  9498  9498  9491 ?           -1 S    215845   0:00  \_ /share/sw/free/openmpi/1.6.5/intel/13sp1up1/bin/orted -mca ess slurm -m
 9498  9502  9498  9491 ?           -1 S    215845   0:00      \_ sleep 1000
 9498  9503  9498  9491 ?           -1 S    215845   0:00      \_ sleep 1000

And you're right, the cgroup is set correctly:
# cd /cgroup/cpuset/slurm/uid_215845/job_181597/step_4
# cat cgroup.procs
9492
9498
9502
9503
# cat cpuset.cpus
0,2

BUT, if I try to get the current affinity of processes, it's different:
[root@sh-0-1 step_4]# taskset -p 9492
pid 9492's current affinity mask: 5
[root@sh-0-1 step_4]# taskset -p 9498
pid 9498's current affinity mask: 1
[root@sh-0-1 step_4]# taskset -p 9503
pid 9503's current affinity mask: 1
[root@sh-0-1 step_4]# taskset -p 9502
pid 9502's current affinity mask: 1

So it looks like something sets a task affinity on top of the cgroup CPU bindings.

Moreover, I can set the task affinity to what it's supposed to be:
# taskset -pc 0,2 9502
pid 9502's current affinity list: 0
pid 9502's new affinity list: 0,2
# taskset -p 9502
pid 9502's current affinity mask: 5

But if I try to overstep the cgroup CPU binding, it fails and only allows to use the CPUs that the cgroup permit. 
# taskset -pc 0,2,4,5 9502
pid 9502's current affinity list: 0,2
pid 9502's new affinity list: 0,2

That's why I'm under the impression that there maybe an extra task affinity step going on that may be not necessary inside cgroups.

Comment 16 David Bigagli 2014-08-20 05:24:44 MDT

Hi Kilian,
          thanks for your analysis. I just wonder does the solution works for you or you found a problem with it?

David

Comment 17 Kilian Cavalotti 2014-08-20 05:30:09 MDT

Hi David, 

(In reply to David Bigagli from comment #16)
> Hi Kilian,
>           thanks for your analysis. I just wonder does the solution works
> for you or you found a problem with it?

The workaround you proposed (stacking 2 task plugins, task/cgroups for memory limit enforcement and task/affinity for CPU binding) works well, but I'm not entirely satisfied with it. It seems like CPU binding should work with the task/cgroups plugin only. That's why I'm trying to understand what's going on here and I believe it could be beneficial for other users too.

Comment 18 David Bigagli 2014-08-20 05:36:59 MDT

Thanks I understand now. It is on my list to understand better how the mpirun
with Slurm interacts with cgroups. Will let you know my findings.

David

Comment 19 Kilian Cavalotti 2014-08-20 05:59:00 MDT

(In reply to David Bigagli from comment #18)
> Thanks I understand now. It is on my list to understand better how the mpirun
> with Slurm interacts with cgroups. Will let you know my findings.

That would be great, thank you!

Comment 20 David Bigagli 2014-08-20 08:55:34 MDT

Hi,
   so the problem is indeed described in that thread, in the first email.
Since mpirun uses srun to launch the orted, orted is bound to the first core
on each machine and so are its children. 

Version 1.8.1 fixes the issue by adding --cpu_bind=none to the srun options.
I installed it and tried it and now it works as expected.

In 1.7 this is the srun executed by mpirun:

david    22230 22229  0 13:18 pts/11   00:00:00 srun --kill-on-bad-exit --ntasks=2 orted -mca orte_ess_jobid 3543072768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3543072768.0;tcp://192.168.1.78:59415" -mca oob tcp
david    22233 22230  0 13:18 pts/11   00:00:00 srun --kill-on-bad-exit --ntasks=2 orted -mca orte_ess_jobid 3543072768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3543072768.0;tcp://192.168.1.78:59415" -mca oob tcp

this is in 1.8.1:

david    28431 28429  0 13:39 pts/11   00:00:00 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --ntasks=2 orted -mca orte_ess_jobid 3942055936 
             ^^^^^^^^^^^^^^^
-mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3942055936.0;tcp://192.168.1.78:51381"
david    28434 28431  0 13:39 pts/11   00:00:00 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --ntasks=2 orted -mca orte_ess_jobid 3942055936 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3942055936.0;tcp://192.168.1.78:51381"

There is even a nice comment in the OMPI code
openmpi-1.8.1/orte/mca/plm/slurm/plm_slurm_module.c

   /* ensure the orteds are not bound to a single processor,
     * just in case the TaskAffinity option is set by default.
     * This will *not* release the orteds from any cpu-set
     * constraint, but will ensure it doesn't get
     * bound to only one processor
     */
    opal_argv_append(&argc, &argv, "--cpu_bind=none");

David

Comment 21 Kilian Cavalotti 2014-08-20 09:06:14 MDT

(In reply to David Bigagli from comment #20)
> Hi,
>    so the problem is indeed described in that thread, in the first email.
> Since mpirun uses srun to launch the orted, orted is bound to the first core
> on each machine and so are its children. 
> 
> Version 1.8.1 fixes the issue by adding --cpu_bind=none to the srun options.
> I installed it and tried it and now it works as expected.

> There is even a nice comment in the OMPI code
> openmpi-1.8.1/orte/mca/plm/slurm/plm_slurm_module.c

Oh, excellent, I understand now. Thanks a lot for getting to the bottom of this, much appreciated.

So I'll be able to advise user to move to Open MPI 1.8 if they prefer to use mpirun.

Thanks!

Comment 22 Rémi Palancher 2014-11-28 02:05:51 MST

There's another workaround when you're stuck to old versions of OpenMPI: you can also use MCA parameter plm_slurm_args to add the srun parameter yourself, such like this:

$ mpirun --mca plm_slurm_args '--cpu_bind=none'

It is also possible to set it in file openmpi-mca-params.conf or by using environment variables.