Ticket 6452 - Task affinity using srun with Intel MPI
Summary: Task affinity using srun with Intel MPI
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 19.05.x
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-05 03:25 MST by Karsten Kutzer
Modified: 2019-03-07 06:45 MST (History)
1 user (show)

See Also:
Site: Lenovo
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Karsten Kutzer 2019-02-05 03:25:19 MST
We are trying to get task affinity running with Intel MPI and srun. We set up everyting as described here https://slurm.schedmd.com/mpi_guide.html#intel_srun.

We have a test case with mpiexec.hydra that shows that task affinity works in general, but when trying to do the same with srun, the task affinity information is not passed to Intel MPI.
We tried with Intel MPI 2018.5 (test cases below), also with Intel MPI 2019.3 (pre-release version), and there was not difference. We also tried with PMI enabled or disabled, which did not change the picture (test case below was with PMI enabled).

# 1) test case: mpi task affinity using mpiexec.hydra:
I run the following test with the following environment variables:
I_MPI_PIN=1
I_MPI_PIN_DOMAIN=omp
I_MPI_FABRICS=shm:ofi
I_MPI_PLATFORM=skx
I_MPI_EXTRA_FILE_SYSTEM=on
I_MPI_PIN_MODE=pm
I_MPI_HYDRA_IFACE=ib0
I_MPI_DEBUG=5
I_MPI_PIN_ORDER=spread
I_MPI_PIN_CELL=core
I_MPI_ROOT=/dss/dssfs02/opt/intel/impi/2019.3.190131
KMP_AFFINITY=verbose,scatter,granularity=fine
OMP_NUM_THREADS=7
 
Using Intel MPI (for just one node):
> mpiexec.hydra -hosts i08r01c01s02opa -n 5 -ppn 5 placementtest-mpi.intel
…
[0] MPI startup(): Rank    Pid      Node name                Pin cpu
[0] MPI startup(): 0       599964   i08r01c01s02.sng.lrz.de  {0,1,2,3,4,5,6}
[0] MPI startup(): 1       599965   i08r01c01s02.sng.lrz.de  {7,8,9,10,11,12,13}
[0] MPI startup(): 2       599966   i08r01c01s02.sng.lrz.de  {14,15,16,17,18,19,20}
[0] MPI startup(): 3       599967   i08r01c01s02.sng.lrz.de  {21,22,23,24,25,26,27}
[0] MPI startup(): 4       599968   i08r01c01s02.sng.lrz.de  {28,29,30,31,32,33,34}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=,hfi1_0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=5:0 0,1 7,2 14,3 21,4 28
[0] MPI startup(): I_MPI_PLATFORM=skx
    omp_get_num_threads:      7
    omp_get_max_threads:      7
    mpi_tasks          :      5
Hostname        |   cpu |  Rank |Thread
i08r01c01s02.sn | 00000 | 00000 | 00000
i08r01c01s02.sn | 00001 | 00000 | 00001
i08r01c01s02.sn | 00002 | 00000 | 00002
i08r01c01s02.sn | 00003 | 00000 | 00003
i08r01c01s02.sn | 00004 | 00000 | 00004
i08r01c01s02.sn | 00005 | 00000 | 00005
i08r01c01s02.sn | 00006 | 00000 | 00006
i08r01c01s02.sn | 00007 | 00001 | 00000
i08r01c01s02.sn | 00008 | 00001 | 00001
i08r01c01s02.sn | 00009 | 00001 | 00002
i08r01c01s02.sn | 00010 | 00001 | 00003
i08r01c01s02.sn | 00011 | 00001 | 00004
i08r01c01s02.sn | 00012 | 00001 | 00005
i08r01c01s02.sn | 00013 | 00001 | 00006
i08r01c01s02.sn | 00014 | 00002 | 00000
i08r01c01s02.sn | 00015 | 00002 | 00001
i08r01c01s02.sn | 00016 | 00002 | 00002
i08r01c01s02.sn | 00017 | 00002 | 00003
i08r01c01s02.sn | 00018 | 00002 | 00004
i08r01c01s02.sn | 00019 | 00002 | 00005
i08r01c01s02.sn | 00020 | 00002 | 00006
i08r01c01s02.sn | 00021 | 00003 | 00000
i08r01c01s02.sn | 00024 | 00003 | 00001
i08r01c01s02.sn | 00022 | 00003 | 00002
i08r01c01s02.sn | 00025 | 00003 | 00003
i08r01c01s02.sn | 00023 | 00003 | 00004
i08r01c01s02.sn | 00026 | 00003 | 00005
i08r01c01s02.sn | 00027 | 00003 | 00006
i08r01c01s02.sn | 00028 | 00004 | 00000
i08r01c01s02.sn | 00029 | 00004 | 00001
i08r01c01s02.sn | 00030 | 00004 | 00002
i08r01c01s02.sn | 00031 | 00004 | 00003
i08r01c01s02.sn | 00032 | 00004 | 00004
i08r01c01s02.sn | 00033 | 00004 | 00005
i08r01c01s02.sn | 00034 | 00004 | 00006
 
The pinning with intel MPI looks OK and as expected, and there are no oversubscribed cores. Also, Intel OpenMP compiler receives proper affinity from MPI:
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 21-27
OMP: Info #156: KMP_AFFINITY: 7 available OS procs
OMP: Info #158: KMP_AFFINITY: Nonuniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (7 total cores) 

# 2) test case: using srun to launch the job:
> srun -l -w i08r01c01s02 -p bm -N 1 -n 5 placementtest-mpi.intel
0: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=acfce0
1: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=140fce0
2: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1882ce0
3: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c08ce0
4: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=23c7ce0
…
0: [0] MPI startup(): Rank    Pid      Node name                Pin cpu
0: [0] MPI startup(): 0       601714   i08r01c01s02.sng.lrz.de  +1
0: [0] MPI startup(): 1       601715   i08r01c01s02.sng.lrz.de  +1
0: [0] MPI startup(): 2       601716   i08r01c01s02.sng.lrz.de  +1
0: [0] MPI startup(): 3       601717   i08r01c01s02.sng.lrz.de  +1
0: [0] MPI startup(): 4       601718   i08r01c01s02.sng.lrz.de  +1
0: [0] MPI startup(): I_MPI_DEBUG=5
0: [0] MPI startup(): I_MPI_FABRICS=shm:ofi
0: [0] MPI startup(): I_MPI_PLATFORM=skx
0:     omp_get_num_threads:      7
0:     omp_get_max_threads:      7
0:     mpi_tasks          :      5
0: Hostname        |   cpu |  Rank |Thread
0: i08r01c01s02.sn | 00000 | 00000 | 00000   <----
0: i08r01c01s02.sn | 00024 | 00000 | 00001
0: i08r01c01s02.sn | 00001 | 00000 | 00002
0: i08r01c01s02.sn | 00025 | 00000 | 00003
0: i08r01c01s02.sn | 00002 | 00000 | 00004
0: i08r01c01s02.sn | 00026 | 00000 | 00005
0: i08r01c01s02.sn | 00003 | 00000 | 00006
1: i08r01c01s02.sn | 00000 | 00001 | 00000
1: i08r01c01s02.sn | 00024 | 00001 | 00001
1: i08r01c01s02.sn | 00001 | 00001 | 00002
1: i08r01c01s02.sn | 00025 | 00001 | 00003
1: i08r01c01s02.sn | 00002 | 00001 | 00004
1: i08r01c01s02.sn | 00026 | 00001 | 00005
1: i08r01c01s02.sn | 00003 | 00001 | 00006
2: i08r01c01s02.sn | 00000 | 00002 | 00000   <----
2: i08r01c01s02.sn | 00024 | 00002 | 00001
2: i08r01c01s02.sn | 00001 | 00002 | 00002
2: i08r01c01s02.sn | 00025 | 00002 | 00003
2: i08r01c01s02.sn | 00002 | 00002 | 00004
2: i08r01c01s02.sn | 00026 | 00002 | 00005
2: i08r01c01s02.sn | 00003 | 00002 | 00006
3: i08r01c01s02.sn | 00000 | 00003 | 00000   <----
3: i08r01c01s02.sn | 00024 | 00003 | 00001
3: i08r01c01s02.sn | 00001 | 00003 | 00002
3: i08r01c01s02.sn | 00025 | 00003 | 00003
3: i08r01c01s02.sn | 00002 | 00003 | 00004
3: i08r01c01s02.sn | 00026 | 00003 | 00005
3: i08r01c01s02.sn | 00003 | 00003 | 00006
4: i08r01c01s02.sn | 00000 | 00004 | 00000   <----
4: i08r01c01s02.sn | 00024 | 00004 | 00001
4: i08r01c01s02.sn | 00001 | 00004 | 00002
4: i08r01c01s02.sn | 00025 | 00004 | 00003
4: i08r01c01s02.sn | 00002 | 00004 | 00004
4: i08r01c01s02.sn | 00026 | 00004 | 00005
4: i08r01c01s02.sn | 00003 | 00004 | 00006
 
Many threads are oversubscribed as highlighted. The Intel MPI does not receive affinity setting from srun, and subsequently, there is no affinity mask set for OpenMP:
1: OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
1: OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-95
1: OMP: Info #156: KMP_AFFINITY: 96 available OS procs
1: OMP: Info #157: KMP_AFFINITY: Uniform topology
1: OMP: Info #179: KMP_AFFINITY: 2 packages x 24 cores/pkg x 2 threads/core (48 total cores)
Comment 1 Felip Moll 2019-02-05 08:59:31 MST
Hi Karsten,

First of all I changed the severity of this bug to sev-4. Sev-2 is for really
urgent and critical matters. At most I would say this is a sev-3 but the 
machine is not yet in production, so..

You can read more about our policies here:
https://www.schedmd.com/support.php

In what regards to the bug.

I would need you to provide with the latest slurm.conf and cgroup.conf. I would
also need to see the output of the following command:

srun --mpi=list

Have you exported I_MPI_PMI_LIBRARY variable with the correct path? Note you can
also use pmi2, pmix, and so on.


Here's an example of how to run an mpi job:

module load intel/<whatever version> : includes intel binaries, libraries, I_MPI_PMI_LIBRARY setting, and so on.

srun approach:
------------------
#!/bin/bash
#SBATCH --job-name=Hello_MPI
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=14
 
srun ./hello_mpi


Alternatively, use the mpirun approach:
--------------------------------------------
#!/bin/bash
#SBATCH --job-name=Hello_MPI
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=14
 
unset I_MPI_PMI_LIBRARY

# avoid bind all tasks to single CPU core
export SLURM_CPU_BIND=none
 
mpirun -np $SLURM_NTASKS ./hello_mpi



Please, provide me with the requested info and let's see what's missing.
Comment 2 Karsten Kutzer 2019-02-05 09:22:51 MST
Hi Felip,

thanks for responding so quickly. Sorry for the wrong severity, I will use 4 in the future…

I will try to run MPI as you suggested, but want to get the other information you asked for out now…

The environment variable was set like this:
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

i01r01c01s01:~ # file /usr/lib64/libpmi.so
/usr/lib64/libpmi.so: symbolic link to libpmi.so.0.0.0
i01r01c01s01:~ # file  /usr/lib64/libpmi.so.0.0.0
/usr/lib64/libpmi.so.0.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=61569b106d772725dea52c9cbe8bb1e39b2d5942, not stripped

Here the output you asked for:
sl01:/etc/slurm # cat cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes

ConstrainCores=yes
TaskAffinity=yes

ConstrainSwapSpace=yes # ???
AllowedSwapSpace=0     # ???

ConstrainRAMSpace=yes
MaxRAMPercent=90


sl01:/etc/slurm # cat slurm.conf
# Script:        /etc/slurm/slurm.conf
#
# Maintainer:    pmayes@lenovo.com
#      Modified: 2018-06-19
# Last modified: 2018-07-19
# Last modified: 2018-12-10
#
# Description:
# Main Slurm configuration file for SuperMUC-NG

#
# Basics
#
ClusterName=supermucng
ControlMachine=sl01
ControlAddr=sl01opa
#BackupController=sl02
#BackupAddr=sl02opa
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge

#
# State and Control
#
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm
SwitchType=switch/none
MpiDefault=pmi2
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
# PrivateData=accounts,jobs,nodes,reservations,usage,users ???????????????????????
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu_bind=no --mpi=none $SHELL"

PropagateResourceLimits=ALL # ???????????????????????
#PropagateResourceLimitsExcept=CPU,NPROC,NOFILE,AS # FMoll
#PropagateResourceLimitsExcept=CPU,NPROC,AS # FMoll

JobSubmitPlugins=lua

#
# Prologs and Epilogs
#
Prolog=/etc/slurm/scripts/Prolog
Epilog=/etc/slurm/scripts/Epilog
SrunProlog=/etc/slurm/scripts/SrunProlog
SrunEpilog=/etc/slurm/scripts/SrunEpilog
TaskProlog=/etc/slurm/scripts/TaskProlog
TaskEpilog=/etc/slurm/scripts/TaskEpilog
PrologFlags=Alloc,Contain # FMoll

#
# Node Health Check
#
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=600
HealthCheckNodeState=IDLE

#TaskPlugin=task/cgroup,task/affinity ???????????????????????
#TaskPlugin=task/affinity
#
# Timers
#
SlurmctldTimeout=420 # CHK, was 300
SlurmdTimeout=420 # CHK, was 300
ResumeTimeout=420 # CHK, wasn't set before
BatchStartTimeout=20 # FMoll
CompleteWait=15 # FMoll
PrologEpilogTimeout=120 # FMoll
#MessageTimeout=30 # FMoll
MessageTimeout=60 # CHK, wasn't set before
MinJobAge=600 # FMoll
InactiveLimit=0
#KillWait=30
KillWait=2
Waittime=0
TCPTimeout=15 # CHK
KillOnBadExit=1
UnkillableStepTimeout=120 # Added by Peter to see if it improves the "Kill task failed" draining issue
UnkillableStepProgram=/etc/slurm/UnkillableStepProgram.sh

#
# Scheduling
#
MaxJobCount=15000
SchedulerType=sched/backfill
FastSchedule=1
#SchedulerAuth=
SelectType=select/cons_res # FMoll
#SelectType=select/linear ???????????????????????
SelectTypeParameters=CR_CPU_Memory # FMoll
PriorityType=priority/multifactor
PriorityWeightAge=1000000
PriorityWeightJobSize=500000
PriorityWeightPartition=500000
PriorityMaxAge=14-0
SchedulerParameters=bf_window=10080,default_queue_depth=10000,bf_interval=30,bf_resolution=1800,bf_max_job_test=3000,bf_max_job_user=800,bf_continue # FMoll
#
#
# Launching
#
LaunchParameters=send_gids
#
# Logging
#
# DebugFlags=NO_CONF_HASH stops the constant warnings about slurm.conf not being the
# same everywhere. This is because we include /etc/slurm.specific.conf, which is
# different on every node
#
DebugFlags=NO_CONF_HASH # ,Energy
SlurmctldDebug=debug
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug
#SlurmdDebug=info
#SlurmdLogFile=/var/log/slurm/slurmd.%n.log
#include /etc/slurm/slurm.specific.conf
include /etc/slurm.specific.conf
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/job_completion.txt

#
# Accounting
#
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
# AccountingStorageEnforce=associations,qos,limits  # ???????????????????????
AcctGatherEnergyType=acct_gather_energy/xcc
AcctGatherNodeFreq=30

# EnforcePartLimits=ALL ???????????????????????

#
# Topology
#
TopologyPlugin=topology/tree
TreeWidth=22 # FMoll
#TreeWidth=7000 # CHK

#
# Compute Nodes
#

PartitionName=DEFAULT Default=NO OverSubscribe=EXCLUSIVE State=UP

PartitionName=bm Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12]
#PartitionName=bm Nodes=i[01-08]r[01-11]c[01-06]s[01-12]

PartitionName=test        Nodes=i01r[01-11]c[01-06]s[01-12]       AllowQOS=test   MinNodes=1  MaxNodes=16  MaxTime=00:30:00 Default=YES

PartitionName=fat         Nodes=f01r[01-02]c[01-06]s[01-12]       AllowQOS=fat           MinNodes=1  MaxNodes=128 MaxTime=48:00:00 PriorityJobFactor=0
PartitionName=micro       Nodes=i[01-02]r[01-11]c[01-06]s[01-12]  AllowQOS=micro         MinNodes=1  MaxNodes=16  MaxTime=48:00:00 PriorityJobFactor=0
PartitionName=general     Nodes=i[02-06]r[01-11]c[01-06]s[01-12]  AllowQOS=general       MinNodes=17 MaxNodes=792 MaxTime=48:00:00 PriorityJobFactor=70
PartitionName=large       Nodes=i[03-08]r[01-11]c[01-06]s[01-12]  AllowQOS=large         MinNodes=793 MaxNodes=3168 MaxTime=12:00:00 PriorityJobFactor=100
#PartitionName=tmp1        Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp1          AllowGroups=vip
#PartitionName=tmp2        Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp2          AllowGroups=vip
#PartitionName=tmp3        Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp3          AllowGroups=vip

#NodeName=DEFAULT CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=96322 Features=hot,thin,PROJECT1,PROJECT2,SCRATCH
NodeName=DEFAULT CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=88258

#
# Thin Node Island 1
#
NodeName=DEFAULT Features=i01,hot,thin,work,scratch,dss
NodeName=i01r01c[01-06]s[01-12] NodeAddr=172.16.192.[1-72]
NodeName=i01r02c[01-06]s[01-12] NodeAddr=172.16.192.[81-152]
NodeName=i01r03c[01-06]s[01-12] NodeAddr=172.16.192.[161-232]
NodeName=i01r04c[01-06]s[01-12] NodeAddr=172.16.193.[1-72]
NodeName=i01r05c[01-06]s[01-12] NodeAddr=172.16.193.[81-152]
NodeName=i01r06c[01-06]s[01-12] NodeAddr=172.16.193.[161-232]
NodeName=i01r07c[01-06]s[01-12] NodeAddr=172.16.194.[1-72]
NodeName=i01r08c[01-06]s[01-12] NodeAddr=172.16.194.[81-152]
NodeName=i01r09c[01-06]s[01-12] NodeAddr=172.16.194.[161-232]
NodeName=i01r10c[01-06]s[01-12] NodeAddr=172.16.195.[1-72]
NodeName=i01r11c[01-06]s[01-12] NodeAddr=172.16.195.[81-152]
#
# Thin Node Island 2
#
NodeName=DEFAULT Features=i02,hot,thin,work,scratch,dss
NodeName=i02r01c[01-06]s[01-12] NodeAddr=172.16.196.[1-72]
NodeName=i02r02c[01-06]s[01-12] NodeAddr=172.16.196.[81-152]
NodeName=i02r03c[01-06]s[01-12] NodeAddr=172.16.196.[161-232]
NodeName=i02r04c[01-06]s[01-12] NodeAddr=172.16.197.[1-72]
NodeName=i02r05c[01-06]s[01-12] NodeAddr=172.16.197.[81-152]
NodeName=i02r06c[01-06]s[01-12] NodeAddr=172.16.197.[161-232]
NodeName=i02r07c[01-06]s[01-12] NodeAddr=172.16.198.[1-72]
NodeName=i02r08c[01-06]s[01-12] NodeAddr=172.16.198.[81-152]
NodeName=i02r09c[01-06]s[01-12] NodeAddr=172.16.198.[161-232]
NodeName=i02r10c[01-06]s[01-12] NodeAddr=172.16.199.[1-72]
NodeName=i02r11c[01-06]s[01-12] NodeAddr=172.16.199.[81-152]
#
# Thin Node Island 3
#
NodeName=DEFAULT Features=i03,hot,thin,work,scratch,dss
NodeName=i03r01c[01-06]s[01-12] NodeAddr=172.16.200.[1-72]
NodeName=i03r02c[01-06]s[01-12] NodeAddr=172.16.200.[81-152]
NodeName=i03r03c[01-06]s[01-12] NodeAddr=172.16.200.[161-232]
NodeName=i03r04c[01-06]s[01-12] NodeAddr=172.16.201.[1-72]
NodeName=i03r05c[01-06]s[01-12] NodeAddr=172.16.201.[81-152]
NodeName=i03r06c[01-06]s[01-12] NodeAddr=172.16.201.[161-232]
NodeName=i03r07c[01-06]s[01-12] NodeAddr=172.16.202.[1-72]
NodeName=i03r08c[01-06]s[01-12] NodeAddr=172.16.202.[81-152]
NodeName=i03r09c[01-06]s[01-12] NodeAddr=172.16.202.[161-232]
NodeName=i03r10c[01-06]s[01-12] NodeAddr=172.16.203.[1-72]
NodeName=i03r11c[01-06]s[01-12] NodeAddr=172.16.203.[81-152]
#
# Thin Node Island 4
#
NodeName=DEFAULT Features=i04,hot,thin,work,scratch,dss
NodeName=i04r01c[01-06]s[01-12] NodeAddr=172.16.204.[1-72]
NodeName=i04r02c[01-06]s[01-12] NodeAddr=172.16.204.[81-152]
NodeName=i04r03c[01-06]s[01-12] NodeAddr=172.16.204.[161-232]
NodeName=i04r04c[01-06]s[01-12] NodeAddr=172.16.205.[1-72]
NodeName=i04r05c[01-06]s[01-12] NodeAddr=172.16.205.[81-152]
NodeName=i04r06c[01-06]s[01-12] NodeAddr=172.16.205.[161-232]
NodeName=i04r07c[01-06]s[01-12] NodeAddr=172.16.206.[1-72]
NodeName=i04r08c[01-06]s[01-12] NodeAddr=172.16.206.[81-152]
NodeName=i04r09c[01-06]s[01-12] NodeAddr=172.16.206.[161-232]
NodeName=i04r10c[01-06]s[01-12] NodeAddr=172.16.207.[1-72]
NodeName=i04r11c[01-06]s[01-12] NodeAddr=172.16.207.[81-152]

#
# Thin Node Island 5
#
NodeName=DEFAULT Features=i05,cold,thin,work,scratch,dss
NodeName=i05r01c[01-06]s[01-12] NodeAddr=172.16.208.[1-72]
NodeName=i05r02c[01-06]s[01-12] NodeAddr=172.16.208.[81-152]
NodeName=i05r03c[01-06]s[01-12] NodeAddr=172.16.208.[161-232]
NodeName=i05r04c[01-06]s[01-12] NodeAddr=172.16.209.[1-72]
NodeName=i05r05c[01-06]s[01-12] NodeAddr=172.16.209.[81-152]
NodeName=i05r06c[01-06]s[01-12] NodeAddr=172.16.209.[161-232]
NodeName=i05r07c[01-06]s[01-12] NodeAddr=172.16.210.[1-72]
NodeName=i05r08c[01-06]s[01-12] NodeAddr=172.16.210.[81-152]
NodeName=i05r09c[01-06]s[01-12] NodeAddr=172.16.210.[161-232]
NodeName=i05r10c[01-06]s[01-12] NodeAddr=172.16.211.[1-72]
NodeName=i05r11c[01-06]s[01-12] NodeAddr=172.16.211.[81-152]
#
# Thin Node Island 6
#
NodeName=DEFAULT Features=i06,cold,thin,work,scratch,dss
NodeName=i06r01c[01-06]s[01-12] NodeAddr=172.16.212.[1-72]
NodeName=i06r02c[01-06]s[01-12] NodeAddr=172.16.212.[81-152]
NodeName=i06r03c[01-06]s[01-12] NodeAddr=172.16.212.[161-232]
NodeName=i06r04c[01-06]s[01-12] NodeAddr=172.16.213.[1-72]
NodeName=i06r05c[01-06]s[01-12] NodeAddr=172.16.213.[81-152]
NodeName=i06r06c[01-06]s[01-12] NodeAddr=172.16.213.[161-232]
NodeName=i06r07c[01-06]s[01-12] NodeAddr=172.16.214.[1-72]
NodeName=i06r08c[01-06]s[01-12] NodeAddr=172.16.214.[81-152]
NodeName=i06r09c[01-06]s[01-12] NodeAddr=172.16.214.[161-232]
NodeName=i06r10c[01-06]s[01-12] NodeAddr=172.16.215.[1-72]
NodeName=i06r11c[01-06]s[01-12] NodeAddr=172.16.215.[81-152]
#
# Thin Node Island 7
#
NodeName=DEFAULT Features=i07,cold,thin,work,scratch,dss
NodeName=i07r01c[01-06]s[01-12] NodeAddr=172.16.216.[1-72]
NodeName=i07r02c[01-06]s[01-12] NodeAddr=172.16.216.[81-152]
NodeName=i07r03c[01-06]s[01-12] NodeAddr=172.16.216.[161-232]
NodeName=i07r04c[01-06]s[01-12] NodeAddr=172.16.217.[1-72]
NodeName=i07r05c[01-06]s[01-12] NodeAddr=172.16.217.[81-152]
NodeName=i07r06c[01-06]s[01-12] NodeAddr=172.16.217.[161-232]
NodeName=i07r07c[01-06]s[01-12] NodeAddr=172.16.218.[1-72]
NodeName=i07r08c[01-06]s[01-12] NodeAddr=172.16.218.[81-152]
NodeName=i07r09c[01-06]s[01-12] NodeAddr=172.16.218.[161-232]
NodeName=i07r10c[01-06]s[01-12] NodeAddr=172.16.219.[1-72]
NodeName=i07r11c[01-06]s[01-12] NodeAddr=172.16.219.[81-152]
#
# Thin Node Island 8
#
NodeName=DEFAULT Features=i08,cold,thin,work,scratch,dss
NodeName=i08r01c[01-06]s[01-12] NodeAddr=172.16.220.[1-72]
NodeName=i08r02c[01-06]s[01-12] NodeAddr=172.16.220.[81-152]
NodeName=i08r03c[01-06]s[01-12] NodeAddr=172.16.220.[161-232]
NodeName=i08r04c[01-06]s[01-12] NodeAddr=172.16.221.[1-72]
NodeName=i08r05c[01-06]s[01-12] NodeAddr=172.16.221.[81-152]
NodeName=i08r06c[01-06]s[01-12] NodeAddr=172.16.221.[161-232]
NodeName=i08r07c[01-06]s[01-12] NodeAddr=172.16.222.[1-72]
NodeName=i08r08c[01-06]s[01-12] NodeAddr=172.16.222.[81-152]
NodeName=i08r09c[01-06]s[01-12] NodeAddr=172.16.222.[161-232]
NodeName=i08r10c[01-06]s[01-12] NodeAddr=172.16.223.[1-72]
NodeName=i08r11c[01-06]s[01-12] NodeAddr=172.16.223.[81-152]
#
# Fat Node Island
#
NodeName=DEFAULT RealMemory=773697
NodeName=DEFAULT Features=f01,fat,work,scratch,dss
NodeName=f01r01c[01-06]s[01-12] NodeAddr=172.16.224.[1-72]
NodeName=f01r02c[01-06]s[01-12] NodeAddr=172.16.224.[81-152]


sl01:/etc/slurm # srun --mpi=list
srun: MPI types are...
srun: pmi2
srun: none
srun: openmpi




Mit freundlichen Grüßen / best regards

Karsten Kutzer
Sr. Sales Engineer High Performance Computing

kkutzer@lenovo.com
Mobile: +49 171 97 12 448
Phone: +49 711 656 90 774

Lenovo Global Technology Germany GmbH, Meitnerstrasse 9, 70563 Stuttgart

Geschäftsführung: Christophe Philippe Marie Laurent und Colm Brendan Gleeson (jeweils einzelvertretungsberechtigt)
Prokura: Dieter Stehle & Henrik Bächle (Einzelprokura)
Sitz der Gesellschaft: Stuttgart; HRB-Nr.: 758298, AG Stuttgart; WEEE-Reg.-Nr.: DE79679404

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: 5 February, 2019 17:00
To: Karsten Kutzer
Subject: [External] [Bug 6452] Task affinity using srun with Intel MPI

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 6452<https://bugs.schedmd.com/show_bug.cgi?id=6452>
What

Removed

Added

Hours Worked



0.50

Group



Lenovo

Assignee

support@schedmd.com<mailto:support@schedmd.com>

felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=6452#c1> on bug 6452<https://bugs.schedmd.com/show_bug.cgi?id=6452> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Karsten,



First of all I changed the severity of this bug to sev-4. Sev-2 is for really

urgent and critical matters. At most I would say this is a sev-3 but the

machine is not yet in production, so..



You can read more about our policies here:

https://www.schedmd.com/support.php



In what regards to the bug.



I would need you to provide with the latest slurm.conf and cgroup.conf. I would

also need to see the output of the following command:



srun --mpi=list



Have you exported I_MPI_PMI_LIBRARY variable with the correct path? Note you

can

also use pmi2, pmix, and so on.





Here's an example of how to run an mpi job:



module load intel/<whatever version> : includes intel binaries, libraries,

I_MPI_PMI_LIBRARY setting, and so on.



srun approach:

------------------

#!/bin/bash

#SBATCH --job-name=Hello_MPI

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=14



srun ./hello_mpi





Alternatively, use the mpirun approach:

--------------------------------------------

#!/bin/bash

#SBATCH --job-name=Hello_MPI

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=14



unset I_MPI_PMI_LIBRARY



# avoid bind all tasks to single CPU core

export SLURM_CPU_BIND=none



mpirun -np $SLURM_NTASKS ./hello_mpi







Please, provide me with the requested info and let's see what's missing.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Felip Moll 2019-02-05 11:32:13 MST
Karsten:

Please ensure this is the Slurm library and not coming from another package (just to be sure):
 export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

An 'ldd /usr/lib64/libpmi.so' will show whether it is linked to Slurm.

>sl01:/etc/slurm # srun --mpi=list
>srun: MPI types are...
>srun: pmi2
>srun: none
>srun: openmpi

You can also try with --mpi=pmi2 switch when running a job, but I also see this is your default
in slurm.conf, so really no need to.

Then, I see you have not set the task plugin correctly in slurm.conf, please set:

TaskPlugin=task/cgroup,task/affinity

and in cgroup.conf comment out this line or set it to No:

# TaskAffinity=yes


Restart slurmctld and retry.


What's in /etc/slurm.specific.conf ?

I also suggest to disable this: AcctGatherNodeFreq=30, it will just create noise and is not needed, the real
XCC frequency plugin for jobs is set in acct_gather.conf.

 
Check also this:
 TreeWidth=22 # FMoll

This number has to be the set to the square root of the number of nodes in the cluster for systems having no more than
2500 nodes or the cube root for larger systems. 


Do this changes and tell me how it goes.
Comment 5 Felip Moll 2019-02-11 09:08:41 MST
I want to add a little bit more information here.

To run with Intel MPI and pmi2, you should set:

I_MPI_PMI_LIBRARY=<path_to_slurm>/libpmi2.so
srun --mpi=pmi2 ...

You can even unset the I_MPI_PMI_LIBRARY and Intel MPI will fallback
to its default (which I think is some internal pmi2 implementation, not sure):

unset I_MPI_PMI_LIBRARY
srun --mpi=pmi2 ...

Then, for the affinity, please follow my comment 3:

So... please set slurm.conf:

TaskPlugin=task/cgroup,task/affinity

and in cgroup.conf comment out this line or set it to No:

# TaskAffinity=yes

and keep this value:

ConstrainCores=yes


Restart slurmctld/slurmd and retry.
Comment 6 Felip Moll 2019-02-19 06:09:45 MST
Hi Karsten,

Did you finally applied the suggested changes? Is everything working properly now? If everything is fine, can I close the bug?

Thanks
Comment 8 Felip Moll 2019-03-07 06:45:40 MST
I am closing this issue for now. Please if you do any progress and
see that is not working, just reopen again.

Regards