Hi, I'm experimenting with core binding and observe a different behavior when a task is executed through srun or mpirun. I know that the recommended way is to use srun, but we have users who like mpirun very much, and want to be able to use specific mpirun options, so I'm trying to allow them to use either way. We're using Open MPI 1.6.5 (Slurm integration works fine), and the Slurm config contains: >> slurm.conf: TaskPlugin=task/cgroup >> cgroups.conf: TaskAffinity=yes # require hwloc ConstrainCores=yes When I submit using srun, I get correct task binding (ie. each MPI process gets its own core): $ salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose salloc: Granted job allocation 180663 $ srun bash -c "cat /proc/self/status | grep Cpus_allowed_list" cpu_bind=NULL - sh-0-2, task 3 1 [38121]: mask 0x4 cpu_bind=NULL - sh-0-2, task 2 0 [38120]: mask 0x1 cpu_bind=NULL - sh-0-1, task 0 0 [38575]: mask 0x1 cpu_bind=NULL - sh-0-1, task 1 1 [38576]: mask 0x4 Cpus_allowed_list: 2 Cpus_allowed_list: 0 Cpus_allowed_list: 0 Cpus_allowed_list: 2 But when I use mpirun instead, all the processes seem constrained to the same core on each node: $ mpirun --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list" ======================== JOB MAP ======================== Data for node: sh-0-1 Num procs: 2 Process OMPI jobid: [56586,1] Process rank: 0 Process OMPI jobid: [56586,1] Process rank: 1 Data for node: sh-0-2 Num procs: 2 Process OMPI jobid: [56586,1] Process rank: 2 Process OMPI jobid: [56586,1] Process rank: 3 ============================================================= Cpus_allowed_list: 0 Cpus_allowed_list: 0 Cpus_allowed_list: 0 Cpus_allowed_list: 0 Could you please advise on why this happens? Thanks!
Hi, if you don't use srun then Slurm has no control over how mpirun binds tasks to cores as this is mpirun specific. You may want to look into mpirun specific options -bind-to-core and bind-to-socket. David
Hi David, (In reply to David Bigagli from comment #1) > if you don't use srun then Slurm has no control over how mpirun binds > tasks to cores as this is mpirun specific. You may want to look into mpirun > specific options -bind-to-core and bind-to-socket. The thing is that, on the contrary, it seems mpirun-launched processes *are* bound to CPU core although no CPU-binding option is used. I would expect Cpus_allowed_list to contain all the CPUs Slurm allocated on the nodes, not just 0. If I try the --bind-to-core option, mpirun tells me that it does get enough CPUs on the nodes to do it: $ salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose salloc: Granted job allocation 180728 $ mpirun --bind-to-core --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list" ======================== JOB MAP ======================== Data for node: sh-1-9 Num procs: 2 Process OMPI jobid: [39315,1] Process rank: 0 Process OMPI jobid: [39315,1] Process rank: 1 Data for node: sh-1-11 Num procs: 2 Process OMPI jobid: [39315,1] Process rank: 2 Process OMPI jobid: [39315,1] Process rank: 3 ============================================================= -------------------------------------------------------------------------- Not enough processors were found on the local host to meet the requested binding action: Local host: sh-1-9 Action requested: bind-to-core Application name: /bin/bash Please revise the request and try again. -------------------------------------------------------------------------- Cpus_allowed_list: 0 Cpus_allowed_list: 0 -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error on node sh-1-9. More information may be available above. -------------------------------------------------------------------------- 4 total processes failed to start
I don't know how mpirun binds to cores by itself, I assume it uses the processor affinity. I will give it a try. I don't think mpirun reads the environment variables set by Slurm.Ont he other hand if Open MPI is compiled with the Slurm option then it will integrate with srun launch mechanism. David
(In reply to David Bigagli from comment #3) > I don't know how mpirun binds to cores by itself, I assume it uses the > processor affinity. I will give it a try. Thanks. > I don't think mpirun reads the environment variables set by Slurm. It does, since just running "mpirun --display-map" showed that it spawned 2 processes and the 2 nodes that "salloc -n 4 --ntasks-per-node=2" allocated. > Ont he > other hand if Open MPI is compiled with the Slurm option then it will > integrate with srun launch mechanism. I know that srun is the preferred way and that it works, I'm just trying to understand why the regular mpirun way doesn't. Because https://www.open-mpi.org/faq/?category=slurm says it should work (although it doesn't say anything about CPU binding). Thanks for looking into this!
I see you are right mpirun is Slurm aware. Why does it bind correctly then? :-) Let me look into this and get back to you. David
Hi, this is somehow related to cpuset and mpirun. If I use the TaskPlugin=task/affinity I got consistent results from your 2 experiments: david@prometeo ~/slurm/work>salloc -n 4 --ntasks-per-node=2 --cpu_bind=verbose salloc: Granted job allocation 83806 83806->david@prometeo ~/slurm/work>srun bash -c "cat /proc/self/status | grep Cpus_allowed_list" cpu_bind=MASK - prometeo, task 3 1 [26909]: mask 0x3 set cpu_bind=MASK - prometeo, task 0 0 [26911]: mask 0x3 set Cpus_allowed_list: 0-1 cpu_bind=MASK - prometeo, task 2 0 [26908]: mask 0x3 set cpu_bind=MASK - prometeo, task 1 1 [26912]: mask 0x3 set Cpus_allowed_list: 0-1 Cpus_allowed_list: 0-1 Cpus_allowed_list: 0-1 83806->david@prometeo ~/slurm/work> mpirun --display-map bash -c "cat /proc/self/status | grep Cpus_allowed_list" ======================== JOB MAP ======================== Data for node: regor1 Num procs: 2 Process OMPI jobid: [60459,1] Process rank: 0 Process OMPI jobid: [60459,1] Process rank: 1 Data for node: regor2 Num procs: 2 Process OMPI jobid: [60459,1] Process rank: 2 Process OMPI jobid: [60459,1] Process rank: 3 ============================================================= Cpus_allowed_list: 0-1 Cpus_allowed_list: 0-1 Cpus_allowed_list: 0-1 Cpus_allowed_list: 0-1 with OpenMPI 1.6.5 and 1.7.3. When I switch to TaskPlugin=task/cgroup then I see exactly what you see, all tasks are constrained to CPU 0. When I look at the cgroup hierarchy I can see it is set up correctly with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct process ids in the tasks file yet all 4 tasks run on cpu 0. David
There are some information about this, although from few years back: https://groups.google.com/forum/#!msg/slurm-devel/P4rZFx2gIIQ/TxacDULLE2oJ David
(In reply to David Bigagli from comment #6) > When I switch to TaskPlugin=task/cgroup then I see exactly what you see, > all tasks are constrained to CPU 0. > When I look at the cgroup hierarchy I can see it is set up correctly > with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct > process ids in the tasks file yet all 4 tasks run on cpu 0. So are you saying that's a bug with the task/cgroup plugin? The 2009 conversation you pointed to does not seem to relate to cgroups.
I think it has to do with how mpirun uses srun to launch the orted daemon. I understand that cgroup uses cpusets so I was assuming the problems were related. I don't have a definitive conclusion yet I was just telling what I found. The first workaround for now is to use task affinity instead of cgroups. David
(In reply to David Bigagli from comment #9) > I think it has to do with how mpirun uses srun to launch the orted daemon. > I understand that cgroup uses cpusets so I was assuming the problems were > related. > I don't have a definitive conclusion yet I was just telling what I found. > The first workaround for now is to use task affinity instead of cgroups. I see, thanks for the clarification. We kind of need the cgroups task plugin to enforce memory limits, and it works great when users submit MPI jobs with srun instead of mpirun. But for the sake of completeness, we would also like to support our users who prefer using mpirun (because it offers them a larger variety of tunables), and allow them to submit jobs that way, while not being constrained to just one core. Thanks for looking into this!
Yes I understand the use case. Another possibility is to stack the plugins so that we can use the affinity to bind to cores and cgroups for memory. I am putting together an example for you. David
Here is the configuration I tested. In slurm.conf: TaskPlugin=task/affinity,task/cgroup In cgroup.conf: TaskAffinity=no In this way the cores are being managed by the affinity plugin while the memory by the cgroup. David
(In reply to David Bigagli from comment #12) > Here is the configuration I tested. > > In slurm.conf: > > TaskPlugin=task/affinity,task/cgroup > > In cgroup.conf: > > TaskAffinity=no > > In this way the cores are being managed by the affinity plugin while the > memory by the cgroup. Thanks, that works great! If you ever get to the bottom of this, I'm still interested by the reason why it doesn't work as expected with affinity managed by the cgroups plugin. Thanks!
Very good. David
(In reply to David Bigagli from comment #9) > When I switch to TaskPlugin=task/cgroup then I see exactly what you see, > all tasks are constrained to CPU 0. > When I look at the cgroup hierarchy I can see it is set up correctly > with 2 tasks on 2 nodes each having 2 CPUS each, 0-1, with the correct > process ids in the tasks file yet all 4 tasks run on cpu 0. I observe the same thing, but then I feel that processes' affinity is defined *on top* of the cgroup CPU binding. For instance, using mpirun and the original cgroups only settings, I get: PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 9492 9491 9491 ? -1 Sl 0 0:00 slurmstepd: [181597.4] 9492 9498 9498 9491 ? -1 S 215845 0:00 \_ /share/sw/free/openmpi/1.6.5/intel/13sp1up1/bin/orted -mca ess slurm -m 9498 9502 9498 9491 ? -1 S 215845 0:00 \_ sleep 1000 9498 9503 9498 9491 ? -1 S 215845 0:00 \_ sleep 1000 And you're right, the cgroup is set correctly: # cd /cgroup/cpuset/slurm/uid_215845/job_181597/step_4 # cat cgroup.procs 9492 9498 9502 9503 # cat cpuset.cpus 0,2 BUT, if I try to get the current affinity of processes, it's different: [root@sh-0-1 step_4]# taskset -p 9492 pid 9492's current affinity mask: 5 [root@sh-0-1 step_4]# taskset -p 9498 pid 9498's current affinity mask: 1 [root@sh-0-1 step_4]# taskset -p 9503 pid 9503's current affinity mask: 1 [root@sh-0-1 step_4]# taskset -p 9502 pid 9502's current affinity mask: 1 So it looks like something sets a task affinity on top of the cgroup CPU bindings. Moreover, I can set the task affinity to what it's supposed to be: # taskset -pc 0,2 9502 pid 9502's current affinity list: 0 pid 9502's new affinity list: 0,2 # taskset -p 9502 pid 9502's current affinity mask: 5 But if I try to overstep the cgroup CPU binding, it fails and only allows to use the CPUs that the cgroup permit. # taskset -pc 0,2,4,5 9502 pid 9502's current affinity list: 0,2 pid 9502's new affinity list: 0,2 That's why I'm under the impression that there maybe an extra task affinity step going on that may be not necessary inside cgroups.
Hi Kilian, thanks for your analysis. I just wonder does the solution works for you or you found a problem with it? David
Hi David, (In reply to David Bigagli from comment #16) > Hi Kilian, > thanks for your analysis. I just wonder does the solution works > for you or you found a problem with it? The workaround you proposed (stacking 2 task plugins, task/cgroups for memory limit enforcement and task/affinity for CPU binding) works well, but I'm not entirely satisfied with it. It seems like CPU binding should work with the task/cgroups plugin only. That's why I'm trying to understand what's going on here and I believe it could be beneficial for other users too.
Thanks I understand now. It is on my list to understand better how the mpirun with Slurm interacts with cgroups. Will let you know my findings. David
(In reply to David Bigagli from comment #18) > Thanks I understand now. It is on my list to understand better how the mpirun > with Slurm interacts with cgroups. Will let you know my findings. That would be great, thank you!
Hi, so the problem is indeed described in that thread, in the first email. Since mpirun uses srun to launch the orted, orted is bound to the first core on each machine and so are its children. Version 1.8.1 fixes the issue by adding --cpu_bind=none to the srun options. I installed it and tried it and now it works as expected. In 1.7 this is the srun executed by mpirun: david 22230 22229 0 13:18 pts/11 00:00:00 srun --kill-on-bad-exit --ntasks=2 orted -mca orte_ess_jobid 3543072768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3543072768.0;tcp://192.168.1.78:59415" -mca oob tcp david 22233 22230 0 13:18 pts/11 00:00:00 srun --kill-on-bad-exit --ntasks=2 orted -mca orte_ess_jobid 3543072768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3543072768.0;tcp://192.168.1.78:59415" -mca oob tcp this is in 1.8.1: david 28431 28429 0 13:39 pts/11 00:00:00 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --ntasks=2 orted -mca orte_ess_jobid 3942055936 ^^^^^^^^^^^^^^^ -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3942055936.0;tcp://192.168.1.78:51381" david 28434 28431 0 13:39 pts/11 00:00:00 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --ntasks=2 orted -mca orte_ess_jobid 3942055936 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "3942055936.0;tcp://192.168.1.78:51381" There is even a nice comment in the OMPI code openmpi-1.8.1/orte/mca/plm/slurm/plm_slurm_module.c /* ensure the orteds are not bound to a single processor, * just in case the TaskAffinity option is set by default. * This will *not* release the orteds from any cpu-set * constraint, but will ensure it doesn't get * bound to only one processor */ opal_argv_append(&argc, &argv, "--cpu_bind=none"); David
(In reply to David Bigagli from comment #20) > Hi, > so the problem is indeed described in that thread, in the first email. > Since mpirun uses srun to launch the orted, orted is bound to the first core > on each machine and so are its children. > > Version 1.8.1 fixes the issue by adding --cpu_bind=none to the srun options. > I installed it and tried it and now it works as expected. > There is even a nice comment in the OMPI code > openmpi-1.8.1/orte/mca/plm/slurm/plm_slurm_module.c Oh, excellent, I understand now. Thanks a lot for getting to the bottom of this, much appreciated. So I'll be able to advise user to move to Open MPI 1.8 if they prefer to use mpirun. Thanks!
There's another workaround when you're stuck to old versions of OpenMPI: you can also use MCA parameter plm_slurm_args to add the srun parameter yourself, such like this: $ mpirun --mca plm_slurm_args '--cpu_bind=none' It is also possible to set it in file openmpi-mca-params.conf or by using environment variables.