Summary: | Slurm/Intel MPI integration: mpirun -np 16 translated to srun -n 1 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | Scheduling | Assignee: | David Bigagli <david> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | da |
Version: | 14.03.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Kilian Cavalotti
2014-09-18 11:51:39 MDT
Hi Kilian, this is probably the same as bug 1049. We don't have the MPI intel source code to set that --cpu_bind=node to the srun command line. David Hi David, (In reply to David Bigagli from comment #1) > this is probably the same as bug 1049. We don't have the MPI intel > source code to set that --cpu_bind=node to the srun command line. Right it looks the same, although it's a different MPI. I opened an issue to Intel for this, so we'll see what they come up with, but I thought you guys should know about this too. Thanks! Yes thanks for the heads up. My question would be if you use the affinity plugin the issue goes away? David We can try to set SLURM_CPU_BIND=none in the mpirun environment. Let me bring up the Intel MPI environment and give it a try. David (In reply to David Bigagli from comment #4) > We can try to set SLURM_CPU_BIND=none in the mpirun environment. Let me > bring up the Intel MPI environment and give it a try. Setting SLURM_CPU_BIND=none doesn't seem to change much: -- 8< -------------------------------------------------------- kilian@sh-5-33:~$ salloc -w sh-5-33 -N 1 -n 16 -p test salloc: Granted job allocation 339314 kilian@sh-5-33:~$ module load intel/2015 kilian@sh-5-33:~$ export SLURM_CPU_BIND=none kilian@sh-5-33:~$ mpirun -v -np 16 hostname host: sh-5-33 ================================================================================================== mpiexec options: ---------------- Base path: /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/ Launcher: slurm Debug level: 1 Enable X: -1 Global environment: ------------------- [...] [mpiexec@sh-5-33.local] Launch arguments: /usr/bin/srun --nodelist sh-5-33 -N 1 -n 1 /share/sw/licensed/intel/pstudio_xe_cluster-2015/composer_xe_2015.0.090/mpirt/bin/intel64/pmi_proxy --control-port 10.210.47.125:44957 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher slurm --demux poll --iface ib0 --pgid 0 --enable-stdin 1 --retries 10 --control-code 1316521041 --usize -2 --proxy-id -1 -- 8< -------------------------------------------------------- Notice the "srun ... -N 1 -n 1" And I also tried with the task/affinity plugin, and got the same "srun -N 1 -n 1" result. Let me get back to you on this one I investigate this a bit more. David Hi Kilian, so you had Matteo Renzi as guest for dinner. :-) I can reproduce your problem with taskset command I can see all mpi programs bound to 1 cpu including the pmi_proxy. However if I don't use the cgroups and switch to the task affinity plugin I correctly see tasks bound to cpus as I requested. --cpus-per-task=1. Also if I use cgroups and set: export SLURM_CPU_BIND=none in the batch script: --------------------------- #!/bin/sh export SLURM_CPU_BIND=none for ((i = 1; i <= 1; i++)) do mpirun ./loop done --------------------------- then I see the processes bound correctly as well. David Information provided. David |