Bug 1111 - Slurm/Intel MPI integration: mpirun -np 16 translated to srun -n 1
Summary: Slurm/Intel MPI integration: mpirun -np 16 translated to srun -n 1
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 14.03.7
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-09-18 11:51 MDT by Kilian Cavalotti
Modified: 2014-09-23 09:01 MDT (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kilian Cavalotti 2014-09-18 11:51:39 MDT
Hi,

I noticed a weird thing in the Slurm / Intel MPI integration. Not sure if it's a Slurm or Intel MPI problem, but I thought it was worth reporting.

I'm trying to run the Amber14 benchmarks (http://ambermd.org/Amber14_Benchmark_Suite.tar.bz2). The script they provide uses mpirun (in its simple form: mpirun -np $NCPUS ...)

I've submitted the following script to our scheduler:
-- 8< --------------------------------------------------
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16

ml load amber/14-intel

cd /scratch/users/kilian/test

./run_bench_CPU+GPU.sh
-- 8< --------------------------------------------------

When using OpenMPI, everything is fine, mpirun spawns the right number of MPI tasks and their CPU bindings let them use all the CPUs on the machine.

When using Intel MPI (Version 5.0 Build 20140507), mpirun launches 16 tasks, but they're all bound to the same CPU core. 

I get the following process tree:
-- 8< --------------------------------------------------
    1 39267 39266 39266 ?           -1 Sl       0   0:00 slurmstepd: [338824]
39267 39271 39271 39266 ?           -1 S    215845   0:00  \_ /bin/bash /var/spool/slurmd/job338824/slurm_script
39271 39279 39271 39266 ?           -1 S    215845   0:00      \_ /bin/bash ./run_bench_CPU+GPU.sh
39279 39280 39271 39266 ?           -1 S    215845   0:00          \_ /bin/sh /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/mpirun -np 16 /share/sw/licensed/amber/amber14-intel/bin/pmemd.MPI -O -i mdin
39280 39308 39271 39266 ?           -1 S    215845   0:00              \_ mpiexec.hydra -machinefile /tmp/slurm_kilian.39280 -np 16 /share/sw/licensed/amber/amber14-intel/bin/pmemd.MPI -O -i mdin.CPU -o mdout. -inf mdinfo. -x mdcrd. -r restrt.
39308 39309 39309 39266 ?           -1 Sl   215845   0:00                  \_ /usr/bin/srun --nodelist sh-7-23 -N 1 -n 1 /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/pmi_proxy --control-port 10.210.46
39309 39310 39309 39266 ?           -1 S    215845   0:00                      \_ /usr/bin/srun --nodelist sh-7-23 -N 1 -n 1 /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/pmi_proxy --control-port 10.21
   
-- 8< --------------------------------------------------

"mpirun -np 16" calls "mpiexec.hydra -np 16" which in turns calls "srun -N 1 -n 1", and that's obviously the reason why the 16 MPI tasks are bound to the same CPU.

I don't know enough about PMI and Slurm/Intel MPI integration to understand who is responsible for what, but that seems to be a problem.


I know the recommended way is to use srun instead of mpirun, but I can't really modify the script (or, more exactly, I want users to be able to run it as is), so I was wondering if you had ideas about why the parameters passed from mpiexec.hydra to srun are wrong. 

Thanks!
Comment 1 David Bigagli 2014-09-19 05:27:48 MDT
Hi Kilian,
          this is probably the same as bug 1049. We don't have the MPI intel source code to set that --cpu_bind=node to the srun command line.

David
Comment 2 Kilian Cavalotti 2014-09-19 06:39:27 MDT
Hi David,

(In reply to David Bigagli from comment #1)
>           this is probably the same as bug 1049. We don't have the MPI intel
> source code to set that --cpu_bind=node to the srun command line.

Right it looks the same, although it's a different MPI. I opened an issue to Intel for this, so we'll see what they come up with, but I thought you guys should know about this too.

Thanks!
Comment 3 David Bigagli 2014-09-19 06:42:03 MDT
Yes thanks for the heads up. My question would be if you use the affinity plugin the issue goes away?

David
Comment 4 David Bigagli 2014-09-19 06:47:35 MDT
We can try to set SLURM_CPU_BIND=none in the mpirun environment. Let me bring up the Intel MPI environment and give it a try.

David
Comment 5 Kilian Cavalotti 2014-09-19 11:07:54 MDT
(In reply to David Bigagli from comment #4)
> We can try to set SLURM_CPU_BIND=none in the mpirun environment. Let me
> bring up the Intel MPI environment and give it a try.

Setting SLURM_CPU_BIND=none doesn't seem to change much:
-- 8< --------------------------------------------------------
kilian@sh-5-33:~$ salloc -w sh-5-33 -N 1 -n 16 -p test
salloc: Granted job allocation 339314
kilian@sh-5-33:~$ module load intel/2015
kilian@sh-5-33:~$ export SLURM_CPU_BIND=none
kilian@sh-5-33:~$ mpirun -v -np 16 hostname
host: sh-5-33

==================================================================================================
mpiexec options:
----------------
  Base path: /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/
  Launcher: slurm
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
[...]
[mpiexec@sh-5-33.local] Launch arguments: /usr/bin/srun --nodelist sh-5-33 -N 1 -n 1 /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/pmi_proxy --control-port 10.210.47.125:44957 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher slurm --demux poll --iface ib0 --pgid 0 --enable-stdin 1 --retries 10 --control-code 1316521041 --usize -2 --proxy-id -1
-- 8< --------------------------------------------------------

Notice the "srun ... -N 1 -n 1"

And I also tried with the task/affinity plugin, and got the same "srun -N 1 -n 1" result.
Comment 6 David Bigagli 2014-09-19 11:12:31 MDT
Let me get back to you on this one I investigate this a bit more.

David
Comment 7 David Bigagli 2014-09-22 05:43:33 MDT
Hi Kilian,
         so you had Matteo Renzi as guest for dinner. :-)
I can reproduce your problem with taskset command I can see all mpi programs 
bound to 1 cpu including the pmi_proxy.

However if I don't use the cgroups and switch to the task affinity plugin
I correctly see tasks bound to cpus as I requested. --cpus-per-task=1.

Also if I use cgroups and set: export SLURM_CPU_BIND=none in the batch script:

---------------------------
#!/bin/sh
export SLURM_CPU_BIND=none
for ((i = 1; i <= 1; i++))
do
 mpirun ./loop
done
---------------------------

then I see the processes bound correctly as well.

David
Comment 8 David Bigagli 2014-09-23 09:01:55 MDT
Information provided.

David