We are trying to get task affinity running with Intel MPI and srun. We set up everyting as described here https://slurm.schedmd.com/mpi_guide.html#intel_srun. We have a test case with mpiexec.hydra that shows that task affinity works in general, but when trying to do the same with srun, the task affinity information is not passed to Intel MPI. We tried with Intel MPI 2018.5 (test cases below), also with Intel MPI 2019.3 (pre-release version), and there was not difference. We also tried with PMI enabled or disabled, which did not change the picture (test case below was with PMI enabled). # 1) test case: mpi task affinity using mpiexec.hydra: I run the following test with the following environment variables: I_MPI_PIN=1 I_MPI_PIN_DOMAIN=omp I_MPI_FABRICS=shm:ofi I_MPI_PLATFORM=skx I_MPI_EXTRA_FILE_SYSTEM=on I_MPI_PIN_MODE=pm I_MPI_HYDRA_IFACE=ib0 I_MPI_DEBUG=5 I_MPI_PIN_ORDER=spread I_MPI_PIN_CELL=core I_MPI_ROOT=/dss/dssfs02/opt/intel/impi/2019.3.190131 KMP_AFFINITY=verbose,scatter,granularity=fine OMP_NUM_THREADS=7 Using Intel MPI (for just one node): > mpiexec.hydra -hosts i08r01c01s02opa -n 5 -ppn 5 placementtest-mpi.intel … [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 599964 i08r01c01s02.sng.lrz.de {0,1,2,3,4,5,6} [0] MPI startup(): 1 599965 i08r01c01s02.sng.lrz.de {7,8,9,10,11,12,13} [0] MPI startup(): 2 599966 i08r01c01s02.sng.lrz.de {14,15,16,17,18,19,20} [0] MPI startup(): 3 599967 i08r01c01s02.sng.lrz.de {21,22,23,24,25,26,27} [0] MPI startup(): 4 599968 i08r01c01s02.sng.lrz.de {28,29,30,31,32,33,34} [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:ofi [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=,hfi1_0:0 [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2 [0] MPI startup(): I_MPI_PIN_MAPPING=5:0 0,1 7,2 14,3 21,4 28 [0] MPI startup(): I_MPI_PLATFORM=skx omp_get_num_threads: 7 omp_get_max_threads: 7 mpi_tasks : 5 Hostname | cpu | Rank |Thread i08r01c01s02.sn | 00000 | 00000 | 00000 i08r01c01s02.sn | 00001 | 00000 | 00001 i08r01c01s02.sn | 00002 | 00000 | 00002 i08r01c01s02.sn | 00003 | 00000 | 00003 i08r01c01s02.sn | 00004 | 00000 | 00004 i08r01c01s02.sn | 00005 | 00000 | 00005 i08r01c01s02.sn | 00006 | 00000 | 00006 i08r01c01s02.sn | 00007 | 00001 | 00000 i08r01c01s02.sn | 00008 | 00001 | 00001 i08r01c01s02.sn | 00009 | 00001 | 00002 i08r01c01s02.sn | 00010 | 00001 | 00003 i08r01c01s02.sn | 00011 | 00001 | 00004 i08r01c01s02.sn | 00012 | 00001 | 00005 i08r01c01s02.sn | 00013 | 00001 | 00006 i08r01c01s02.sn | 00014 | 00002 | 00000 i08r01c01s02.sn | 00015 | 00002 | 00001 i08r01c01s02.sn | 00016 | 00002 | 00002 i08r01c01s02.sn | 00017 | 00002 | 00003 i08r01c01s02.sn | 00018 | 00002 | 00004 i08r01c01s02.sn | 00019 | 00002 | 00005 i08r01c01s02.sn | 00020 | 00002 | 00006 i08r01c01s02.sn | 00021 | 00003 | 00000 i08r01c01s02.sn | 00024 | 00003 | 00001 i08r01c01s02.sn | 00022 | 00003 | 00002 i08r01c01s02.sn | 00025 | 00003 | 00003 i08r01c01s02.sn | 00023 | 00003 | 00004 i08r01c01s02.sn | 00026 | 00003 | 00005 i08r01c01s02.sn | 00027 | 00003 | 00006 i08r01c01s02.sn | 00028 | 00004 | 00000 i08r01c01s02.sn | 00029 | 00004 | 00001 i08r01c01s02.sn | 00030 | 00004 | 00002 i08r01c01s02.sn | 00031 | 00004 | 00003 i08r01c01s02.sn | 00032 | 00004 | 00004 i08r01c01s02.sn | 00033 | 00004 | 00005 i08r01c01s02.sn | 00034 | 00004 | 00006 The pinning with intel MPI looks OK and as expected, and there are no oversubscribed cores. Also, Intel OpenMP compiler receives proper affinity from MPI: OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 21-27 OMP: Info #156: KMP_AFFINITY: 7 available OS procs OMP: Info #158: KMP_AFFINITY: Nonuniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (7 total cores) # 2) test case: using srun to launch the job: > srun -l -w i08r01c01s02 -p bm -N 1 -n 5 placementtest-mpi.intel 0: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=acfce0 1: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=140fce0 2: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1882ce0 3: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c08ce0 4: [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=23c7ce0 … 0: [0] MPI startup(): Rank Pid Node name Pin cpu 0: [0] MPI startup(): 0 601714 i08r01c01s02.sng.lrz.de +1 0: [0] MPI startup(): 1 601715 i08r01c01s02.sng.lrz.de +1 0: [0] MPI startup(): 2 601716 i08r01c01s02.sng.lrz.de +1 0: [0] MPI startup(): 3 601717 i08r01c01s02.sng.lrz.de +1 0: [0] MPI startup(): 4 601718 i08r01c01s02.sng.lrz.de +1 0: [0] MPI startup(): I_MPI_DEBUG=5 0: [0] MPI startup(): I_MPI_FABRICS=shm:ofi 0: [0] MPI startup(): I_MPI_PLATFORM=skx 0: omp_get_num_threads: 7 0: omp_get_max_threads: 7 0: mpi_tasks : 5 0: Hostname | cpu | Rank |Thread 0: i08r01c01s02.sn | 00000 | 00000 | 00000 <---- 0: i08r01c01s02.sn | 00024 | 00000 | 00001 0: i08r01c01s02.sn | 00001 | 00000 | 00002 0: i08r01c01s02.sn | 00025 | 00000 | 00003 0: i08r01c01s02.sn | 00002 | 00000 | 00004 0: i08r01c01s02.sn | 00026 | 00000 | 00005 0: i08r01c01s02.sn | 00003 | 00000 | 00006 1: i08r01c01s02.sn | 00000 | 00001 | 00000 1: i08r01c01s02.sn | 00024 | 00001 | 00001 1: i08r01c01s02.sn | 00001 | 00001 | 00002 1: i08r01c01s02.sn | 00025 | 00001 | 00003 1: i08r01c01s02.sn | 00002 | 00001 | 00004 1: i08r01c01s02.sn | 00026 | 00001 | 00005 1: i08r01c01s02.sn | 00003 | 00001 | 00006 2: i08r01c01s02.sn | 00000 | 00002 | 00000 <---- 2: i08r01c01s02.sn | 00024 | 00002 | 00001 2: i08r01c01s02.sn | 00001 | 00002 | 00002 2: i08r01c01s02.sn | 00025 | 00002 | 00003 2: i08r01c01s02.sn | 00002 | 00002 | 00004 2: i08r01c01s02.sn | 00026 | 00002 | 00005 2: i08r01c01s02.sn | 00003 | 00002 | 00006 3: i08r01c01s02.sn | 00000 | 00003 | 00000 <---- 3: i08r01c01s02.sn | 00024 | 00003 | 00001 3: i08r01c01s02.sn | 00001 | 00003 | 00002 3: i08r01c01s02.sn | 00025 | 00003 | 00003 3: i08r01c01s02.sn | 00002 | 00003 | 00004 3: i08r01c01s02.sn | 00026 | 00003 | 00005 3: i08r01c01s02.sn | 00003 | 00003 | 00006 4: i08r01c01s02.sn | 00000 | 00004 | 00000 <---- 4: i08r01c01s02.sn | 00024 | 00004 | 00001 4: i08r01c01s02.sn | 00001 | 00004 | 00002 4: i08r01c01s02.sn | 00025 | 00004 | 00003 4: i08r01c01s02.sn | 00002 | 00004 | 00004 4: i08r01c01s02.sn | 00026 | 00004 | 00005 4: i08r01c01s02.sn | 00003 | 00004 | 00006 Many threads are oversubscribed as highlighted. The Intel MPI does not receive affinity setting from srun, and subsequently, there is no affinity mask set for OpenMP: 1: OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info 1: OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-95 1: OMP: Info #156: KMP_AFFINITY: 96 available OS procs 1: OMP: Info #157: KMP_AFFINITY: Uniform topology 1: OMP: Info #179: KMP_AFFINITY: 2 packages x 24 cores/pkg x 2 threads/core (48 total cores)
Hi Karsten, First of all I changed the severity of this bug to sev-4. Sev-2 is for really urgent and critical matters. At most I would say this is a sev-3 but the machine is not yet in production, so.. You can read more about our policies here: https://www.schedmd.com/support.php In what regards to the bug. I would need you to provide with the latest slurm.conf and cgroup.conf. I would also need to see the output of the following command: srun --mpi=list Have you exported I_MPI_PMI_LIBRARY variable with the correct path? Note you can also use pmi2, pmix, and so on. Here's an example of how to run an mpi job: module load intel/<whatever version> : includes intel binaries, libraries, I_MPI_PMI_LIBRARY setting, and so on. srun approach: ------------------ #!/bin/bash #SBATCH --job-name=Hello_MPI #SBATCH --nodes=2 #SBATCH --ntasks-per-node=14 srun ./hello_mpi Alternatively, use the mpirun approach: -------------------------------------------- #!/bin/bash #SBATCH --job-name=Hello_MPI #SBATCH --nodes=2 #SBATCH --ntasks-per-node=14 unset I_MPI_PMI_LIBRARY # avoid bind all tasks to single CPU core export SLURM_CPU_BIND=none mpirun -np $SLURM_NTASKS ./hello_mpi Please, provide me with the requested info and let's see what's missing.
Hi Felip, thanks for responding so quickly. Sorry for the wrong severity, I will use 4 in the future… I will try to run MPI as you suggested, but want to get the other information you asked for out now… The environment variable was set like this: export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so i01r01c01s01:~ # file /usr/lib64/libpmi.so /usr/lib64/libpmi.so: symbolic link to libpmi.so.0.0.0 i01r01c01s01:~ # file /usr/lib64/libpmi.so.0.0.0 /usr/lib64/libpmi.so.0.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=61569b106d772725dea52c9cbe8bb1e39b2d5942, not stripped Here the output you asked for: sl01:/etc/slurm # cat cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes ConstrainCores=yes TaskAffinity=yes ConstrainSwapSpace=yes # ??? AllowedSwapSpace=0 # ??? ConstrainRAMSpace=yes MaxRAMPercent=90 sl01:/etc/slurm # cat slurm.conf # Script: /etc/slurm/slurm.conf # # Maintainer: pmayes@lenovo.com # Modified: 2018-06-19 # Last modified: 2018-07-19 # Last modified: 2018-12-10 # # Description: # Main Slurm configuration file for SuperMUC-NG # # Basics # ClusterName=supermucng ControlMachine=sl01 ControlAddr=sl01opa #BackupController=sl02 #BackupAddr=sl02opa SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge # # State and Control # StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm SwitchType=switch/none MpiDefault=pmi2 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid ProctrackType=proctrack/cgroup # PrivateData=accounts,jobs,nodes,reservations,usage,users ??????????????????????? SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu_bind=no --mpi=none $SHELL" PropagateResourceLimits=ALL # ??????????????????????? #PropagateResourceLimitsExcept=CPU,NPROC,NOFILE,AS # FMoll #PropagateResourceLimitsExcept=CPU,NPROC,AS # FMoll JobSubmitPlugins=lua # # Prologs and Epilogs # Prolog=/etc/slurm/scripts/Prolog Epilog=/etc/slurm/scripts/Epilog SrunProlog=/etc/slurm/scripts/SrunProlog SrunEpilog=/etc/slurm/scripts/SrunEpilog TaskProlog=/etc/slurm/scripts/TaskProlog TaskEpilog=/etc/slurm/scripts/TaskEpilog PrologFlags=Alloc,Contain # FMoll # # Node Health Check # HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=600 HealthCheckNodeState=IDLE #TaskPlugin=task/cgroup,task/affinity ??????????????????????? #TaskPlugin=task/affinity # # Timers # SlurmctldTimeout=420 # CHK, was 300 SlurmdTimeout=420 # CHK, was 300 ResumeTimeout=420 # CHK, wasn't set before BatchStartTimeout=20 # FMoll CompleteWait=15 # FMoll PrologEpilogTimeout=120 # FMoll #MessageTimeout=30 # FMoll MessageTimeout=60 # CHK, wasn't set before MinJobAge=600 # FMoll InactiveLimit=0 #KillWait=30 KillWait=2 Waittime=0 TCPTimeout=15 # CHK KillOnBadExit=1 UnkillableStepTimeout=120 # Added by Peter to see if it improves the "Kill task failed" draining issue UnkillableStepProgram=/etc/slurm/UnkillableStepProgram.sh # # Scheduling # MaxJobCount=15000 SchedulerType=sched/backfill FastSchedule=1 #SchedulerAuth= SelectType=select/cons_res # FMoll #SelectType=select/linear ??????????????????????? SelectTypeParameters=CR_CPU_Memory # FMoll PriorityType=priority/multifactor PriorityWeightAge=1000000 PriorityWeightJobSize=500000 PriorityWeightPartition=500000 PriorityMaxAge=14-0 SchedulerParameters=bf_window=10080,default_queue_depth=10000,bf_interval=30,bf_resolution=1800,bf_max_job_test=3000,bf_max_job_user=800,bf_continue # FMoll # # # Launching # LaunchParameters=send_gids # # Logging # # DebugFlags=NO_CONF_HASH stops the constant warnings about slurm.conf not being the # same everywhere. This is because we include /etc/slurm.specific.conf, which is # different on every node # DebugFlags=NO_CONF_HASH # ,Energy SlurmctldDebug=debug #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=debug #SlurmdDebug=info #SlurmdLogFile=/var/log/slurm/slurmd.%n.log #include /etc/slurm/slurm.specific.conf include /etc/slurm.specific.conf JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm/job_completion.txt # # Accounting # JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost # AccountingStorageEnforce=associations,qos,limits # ??????????????????????? AcctGatherEnergyType=acct_gather_energy/xcc AcctGatherNodeFreq=30 # EnforcePartLimits=ALL ??????????????????????? # # Topology # TopologyPlugin=topology/tree TreeWidth=22 # FMoll #TreeWidth=7000 # CHK # # Compute Nodes # PartitionName=DEFAULT Default=NO OverSubscribe=EXCLUSIVE State=UP PartitionName=bm Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] #PartitionName=bm Nodes=i[01-08]r[01-11]c[01-06]s[01-12] PartitionName=test Nodes=i01r[01-11]c[01-06]s[01-12] AllowQOS=test MinNodes=1 MaxNodes=16 MaxTime=00:30:00 Default=YES PartitionName=fat Nodes=f01r[01-02]c[01-06]s[01-12] AllowQOS=fat MinNodes=1 MaxNodes=128 MaxTime=48:00:00 PriorityJobFactor=0 PartitionName=micro Nodes=i[01-02]r[01-11]c[01-06]s[01-12] AllowQOS=micro MinNodes=1 MaxNodes=16 MaxTime=48:00:00 PriorityJobFactor=0 PartitionName=general Nodes=i[02-06]r[01-11]c[01-06]s[01-12] AllowQOS=general MinNodes=17 MaxNodes=792 MaxTime=48:00:00 PriorityJobFactor=70 PartitionName=large Nodes=i[03-08]r[01-11]c[01-06]s[01-12] AllowQOS=large MinNodes=793 MaxNodes=3168 MaxTime=12:00:00 PriorityJobFactor=100 #PartitionName=tmp1 Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp1 AllowGroups=vip #PartitionName=tmp2 Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp2 AllowGroups=vip #PartitionName=tmp3 Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=tmp3 AllowGroups=vip #NodeName=DEFAULT CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=96322 Features=hot,thin,PROJECT1,PROJECT2,SCRATCH NodeName=DEFAULT CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=88258 # # Thin Node Island 1 # NodeName=DEFAULT Features=i01,hot,thin,work,scratch,dss NodeName=i01r01c[01-06]s[01-12] NodeAddr=172.16.192.[1-72] NodeName=i01r02c[01-06]s[01-12] NodeAddr=172.16.192.[81-152] NodeName=i01r03c[01-06]s[01-12] NodeAddr=172.16.192.[161-232] NodeName=i01r04c[01-06]s[01-12] NodeAddr=172.16.193.[1-72] NodeName=i01r05c[01-06]s[01-12] NodeAddr=172.16.193.[81-152] NodeName=i01r06c[01-06]s[01-12] NodeAddr=172.16.193.[161-232] NodeName=i01r07c[01-06]s[01-12] NodeAddr=172.16.194.[1-72] NodeName=i01r08c[01-06]s[01-12] NodeAddr=172.16.194.[81-152] NodeName=i01r09c[01-06]s[01-12] NodeAddr=172.16.194.[161-232] NodeName=i01r10c[01-06]s[01-12] NodeAddr=172.16.195.[1-72] NodeName=i01r11c[01-06]s[01-12] NodeAddr=172.16.195.[81-152] # # Thin Node Island 2 # NodeName=DEFAULT Features=i02,hot,thin,work,scratch,dss NodeName=i02r01c[01-06]s[01-12] NodeAddr=172.16.196.[1-72] NodeName=i02r02c[01-06]s[01-12] NodeAddr=172.16.196.[81-152] NodeName=i02r03c[01-06]s[01-12] NodeAddr=172.16.196.[161-232] NodeName=i02r04c[01-06]s[01-12] NodeAddr=172.16.197.[1-72] NodeName=i02r05c[01-06]s[01-12] NodeAddr=172.16.197.[81-152] NodeName=i02r06c[01-06]s[01-12] NodeAddr=172.16.197.[161-232] NodeName=i02r07c[01-06]s[01-12] NodeAddr=172.16.198.[1-72] NodeName=i02r08c[01-06]s[01-12] NodeAddr=172.16.198.[81-152] NodeName=i02r09c[01-06]s[01-12] NodeAddr=172.16.198.[161-232] NodeName=i02r10c[01-06]s[01-12] NodeAddr=172.16.199.[1-72] NodeName=i02r11c[01-06]s[01-12] NodeAddr=172.16.199.[81-152] # # Thin Node Island 3 # NodeName=DEFAULT Features=i03,hot,thin,work,scratch,dss NodeName=i03r01c[01-06]s[01-12] NodeAddr=172.16.200.[1-72] NodeName=i03r02c[01-06]s[01-12] NodeAddr=172.16.200.[81-152] NodeName=i03r03c[01-06]s[01-12] NodeAddr=172.16.200.[161-232] NodeName=i03r04c[01-06]s[01-12] NodeAddr=172.16.201.[1-72] NodeName=i03r05c[01-06]s[01-12] NodeAddr=172.16.201.[81-152] NodeName=i03r06c[01-06]s[01-12] NodeAddr=172.16.201.[161-232] NodeName=i03r07c[01-06]s[01-12] NodeAddr=172.16.202.[1-72] NodeName=i03r08c[01-06]s[01-12] NodeAddr=172.16.202.[81-152] NodeName=i03r09c[01-06]s[01-12] NodeAddr=172.16.202.[161-232] NodeName=i03r10c[01-06]s[01-12] NodeAddr=172.16.203.[1-72] NodeName=i03r11c[01-06]s[01-12] NodeAddr=172.16.203.[81-152] # # Thin Node Island 4 # NodeName=DEFAULT Features=i04,hot,thin,work,scratch,dss NodeName=i04r01c[01-06]s[01-12] NodeAddr=172.16.204.[1-72] NodeName=i04r02c[01-06]s[01-12] NodeAddr=172.16.204.[81-152] NodeName=i04r03c[01-06]s[01-12] NodeAddr=172.16.204.[161-232] NodeName=i04r04c[01-06]s[01-12] NodeAddr=172.16.205.[1-72] NodeName=i04r05c[01-06]s[01-12] NodeAddr=172.16.205.[81-152] NodeName=i04r06c[01-06]s[01-12] NodeAddr=172.16.205.[161-232] NodeName=i04r07c[01-06]s[01-12] NodeAddr=172.16.206.[1-72] NodeName=i04r08c[01-06]s[01-12] NodeAddr=172.16.206.[81-152] NodeName=i04r09c[01-06]s[01-12] NodeAddr=172.16.206.[161-232] NodeName=i04r10c[01-06]s[01-12] NodeAddr=172.16.207.[1-72] NodeName=i04r11c[01-06]s[01-12] NodeAddr=172.16.207.[81-152] # # Thin Node Island 5 # NodeName=DEFAULT Features=i05,cold,thin,work,scratch,dss NodeName=i05r01c[01-06]s[01-12] NodeAddr=172.16.208.[1-72] NodeName=i05r02c[01-06]s[01-12] NodeAddr=172.16.208.[81-152] NodeName=i05r03c[01-06]s[01-12] NodeAddr=172.16.208.[161-232] NodeName=i05r04c[01-06]s[01-12] NodeAddr=172.16.209.[1-72] NodeName=i05r05c[01-06]s[01-12] NodeAddr=172.16.209.[81-152] NodeName=i05r06c[01-06]s[01-12] NodeAddr=172.16.209.[161-232] NodeName=i05r07c[01-06]s[01-12] NodeAddr=172.16.210.[1-72] NodeName=i05r08c[01-06]s[01-12] NodeAddr=172.16.210.[81-152] NodeName=i05r09c[01-06]s[01-12] NodeAddr=172.16.210.[161-232] NodeName=i05r10c[01-06]s[01-12] NodeAddr=172.16.211.[1-72] NodeName=i05r11c[01-06]s[01-12] NodeAddr=172.16.211.[81-152] # # Thin Node Island 6 # NodeName=DEFAULT Features=i06,cold,thin,work,scratch,dss NodeName=i06r01c[01-06]s[01-12] NodeAddr=172.16.212.[1-72] NodeName=i06r02c[01-06]s[01-12] NodeAddr=172.16.212.[81-152] NodeName=i06r03c[01-06]s[01-12] NodeAddr=172.16.212.[161-232] NodeName=i06r04c[01-06]s[01-12] NodeAddr=172.16.213.[1-72] NodeName=i06r05c[01-06]s[01-12] NodeAddr=172.16.213.[81-152] NodeName=i06r06c[01-06]s[01-12] NodeAddr=172.16.213.[161-232] NodeName=i06r07c[01-06]s[01-12] NodeAddr=172.16.214.[1-72] NodeName=i06r08c[01-06]s[01-12] NodeAddr=172.16.214.[81-152] NodeName=i06r09c[01-06]s[01-12] NodeAddr=172.16.214.[161-232] NodeName=i06r10c[01-06]s[01-12] NodeAddr=172.16.215.[1-72] NodeName=i06r11c[01-06]s[01-12] NodeAddr=172.16.215.[81-152] # # Thin Node Island 7 # NodeName=DEFAULT Features=i07,cold,thin,work,scratch,dss NodeName=i07r01c[01-06]s[01-12] NodeAddr=172.16.216.[1-72] NodeName=i07r02c[01-06]s[01-12] NodeAddr=172.16.216.[81-152] NodeName=i07r03c[01-06]s[01-12] NodeAddr=172.16.216.[161-232] NodeName=i07r04c[01-06]s[01-12] NodeAddr=172.16.217.[1-72] NodeName=i07r05c[01-06]s[01-12] NodeAddr=172.16.217.[81-152] NodeName=i07r06c[01-06]s[01-12] NodeAddr=172.16.217.[161-232] NodeName=i07r07c[01-06]s[01-12] NodeAddr=172.16.218.[1-72] NodeName=i07r08c[01-06]s[01-12] NodeAddr=172.16.218.[81-152] NodeName=i07r09c[01-06]s[01-12] NodeAddr=172.16.218.[161-232] NodeName=i07r10c[01-06]s[01-12] NodeAddr=172.16.219.[1-72] NodeName=i07r11c[01-06]s[01-12] NodeAddr=172.16.219.[81-152] # # Thin Node Island 8 # NodeName=DEFAULT Features=i08,cold,thin,work,scratch,dss NodeName=i08r01c[01-06]s[01-12] NodeAddr=172.16.220.[1-72] NodeName=i08r02c[01-06]s[01-12] NodeAddr=172.16.220.[81-152] NodeName=i08r03c[01-06]s[01-12] NodeAddr=172.16.220.[161-232] NodeName=i08r04c[01-06]s[01-12] NodeAddr=172.16.221.[1-72] NodeName=i08r05c[01-06]s[01-12] NodeAddr=172.16.221.[81-152] NodeName=i08r06c[01-06]s[01-12] NodeAddr=172.16.221.[161-232] NodeName=i08r07c[01-06]s[01-12] NodeAddr=172.16.222.[1-72] NodeName=i08r08c[01-06]s[01-12] NodeAddr=172.16.222.[81-152] NodeName=i08r09c[01-06]s[01-12] NodeAddr=172.16.222.[161-232] NodeName=i08r10c[01-06]s[01-12] NodeAddr=172.16.223.[1-72] NodeName=i08r11c[01-06]s[01-12] NodeAddr=172.16.223.[81-152] # # Fat Node Island # NodeName=DEFAULT RealMemory=773697 NodeName=DEFAULT Features=f01,fat,work,scratch,dss NodeName=f01r01c[01-06]s[01-12] NodeAddr=172.16.224.[1-72] NodeName=f01r02c[01-06]s[01-12] NodeAddr=172.16.224.[81-152] sl01:/etc/slurm # srun --mpi=list srun: MPI types are... srun: pmi2 srun: none srun: openmpi Mit freundlichen Grüßen / best regards Karsten Kutzer Sr. Sales Engineer High Performance Computing kkutzer@lenovo.com Mobile: +49 171 97 12 448 Phone: +49 711 656 90 774 Lenovo Global Technology Germany GmbH, Meitnerstrasse 9, 70563 Stuttgart Geschäftsführung: Christophe Philippe Marie Laurent und Colm Brendan Gleeson (jeweils einzelvertretungsberechtigt) Prokura: Dieter Stehle & Henrik Bächle (Einzelprokura) Sitz der Gesellschaft: Stuttgart; HRB-Nr.: 758298, AG Stuttgart; WEEE-Reg.-Nr.: DE79679404 From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: 5 February, 2019 17:00 To: Karsten Kutzer Subject: [External] [Bug 6452] Task affinity using srun with Intel MPI Felip Moll<mailto:felip.moll@schedmd.com> changed bug 6452<https://bugs.schedmd.com/show_bug.cgi?id=6452> What Removed Added Hours Worked 0.50 Group Lenovo Assignee support@schedmd.com<mailto:support@schedmd.com> felip.moll@schedmd.com<mailto:felip.moll@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=6452#c1> on bug 6452<https://bugs.schedmd.com/show_bug.cgi?id=6452> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Karsten, First of all I changed the severity of this bug to sev-4. Sev-2 is for really urgent and critical matters. At most I would say this is a sev-3 but the machine is not yet in production, so.. You can read more about our policies here: https://www.schedmd.com/support.php In what regards to the bug. I would need you to provide with the latest slurm.conf and cgroup.conf. I would also need to see the output of the following command: srun --mpi=list Have you exported I_MPI_PMI_LIBRARY variable with the correct path? Note you can also use pmi2, pmix, and so on. Here's an example of how to run an mpi job: module load intel/<whatever version> : includes intel binaries, libraries, I_MPI_PMI_LIBRARY setting, and so on. srun approach: ------------------ #!/bin/bash #SBATCH --job-name=Hello_MPI #SBATCH --nodes=2 #SBATCH --ntasks-per-node=14 srun ./hello_mpi Alternatively, use the mpirun approach: -------------------------------------------- #!/bin/bash #SBATCH --job-name=Hello_MPI #SBATCH --nodes=2 #SBATCH --ntasks-per-node=14 unset I_MPI_PMI_LIBRARY # avoid bind all tasks to single CPU core export SLURM_CPU_BIND=none mpirun -np $SLURM_NTASKS ./hello_mpi Please, provide me with the requested info and let's see what's missing. ________________________________ You are receiving this mail because: * You reported the bug.
Karsten: Please ensure this is the Slurm library and not coming from another package (just to be sure): export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so An 'ldd /usr/lib64/libpmi.so' will show whether it is linked to Slurm. >sl01:/etc/slurm # srun --mpi=list >srun: MPI types are... >srun: pmi2 >srun: none >srun: openmpi You can also try with --mpi=pmi2 switch when running a job, but I also see this is your default in slurm.conf, so really no need to. Then, I see you have not set the task plugin correctly in slurm.conf, please set: TaskPlugin=task/cgroup,task/affinity and in cgroup.conf comment out this line or set it to No: # TaskAffinity=yes Restart slurmctld and retry. What's in /etc/slurm.specific.conf ? I also suggest to disable this: AcctGatherNodeFreq=30, it will just create noise and is not needed, the real XCC frequency plugin for jobs is set in acct_gather.conf. Check also this: TreeWidth=22 # FMoll This number has to be the set to the square root of the number of nodes in the cluster for systems having no more than 2500 nodes or the cube root for larger systems. Do this changes and tell me how it goes.
I want to add a little bit more information here. To run with Intel MPI and pmi2, you should set: I_MPI_PMI_LIBRARY=<path_to_slurm>/libpmi2.so srun --mpi=pmi2 ... You can even unset the I_MPI_PMI_LIBRARY and Intel MPI will fallback to its default (which I think is some internal pmi2 implementation, not sure): unset I_MPI_PMI_LIBRARY srun --mpi=pmi2 ... Then, for the affinity, please follow my comment 3: So... please set slurm.conf: TaskPlugin=task/cgroup,task/affinity and in cgroup.conf comment out this line or set it to No: # TaskAffinity=yes and keep this value: ConstrainCores=yes Restart slurmctld/slurmd and retry.
Hi Karsten, Did you finally applied the suggested changes? Is everything working properly now? If everything is fine, can I close the bug? Thanks
I am closing this issue for now. Please if you do any progress and see that is not working, just reopen again. Regards