7681 – pinning issues with Intel MPI

Bug 7681 - pinning issues with Intel MPI

Summary: pinning issues with Intel MPI

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	20.02.x
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-09-03 11:31 MDT by Felip Moll
Modified:	2019-09-06 03:01 MDT (History)
CC List:	2 users (show)

See Also:
Site:	LRZ
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Felip Moll 2019-09-03 11:31:49 MDT

We are facing multiple pinning issues with Intel MPI when running the SLURM bootstrap mechanism on SuperMUC NG.

This could be a bug in SLURM but could also be a bug in our MPI library, since really is the interfacing here.

On SNG, there is no SSH, so the only bootstrap mechanism is SLURM, coming from our mpirun -> SLURM bootstrap -> back into our library with a affinity mask.

What we actually see as a result is incorrect pinning of MPI ranks, especially at larger node counts.

So far it is not clear where the bugs are actually sitting in that’s why I wanted to get all involved parties on board for a discussion meeting.


Anatoliy from our IMPI engineering team can tell you much more details about it.

Comment 1 Felip Moll 2019-09-03 11:56:29 MDT

Michael, Anatoliy,

I am creating this bug to keep track of the issue you say it's happening in LRZ. I think creating a bug is the best way to work it out.
Please, can you describe exactly the problem, show execution examples and demonstration of how pinning is not working?
Also slurm.conf and cgroup.conf is needed.
Can you reproduce it all the time?

Intel MPI (mpirun) translates the environment variables of the allocation done by Slurm and calls srun with modified parameters. You can run mpirun with verbose flags to see how srun is being called. We've seen before some issues with the translation that mpirun does, which seems not to parse correctly these env variables depending on how it is being called.

See srun calls within an allocation with:

mpirun --allow-run-as-root -debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 uptime


e.g. bug 7311
other lenovo ticket of previous work on this: bug 6452

I'll wait for more info.

Comment 2 Rozanov, Anatoliy 2019-09-04 03:10:01 MDT

Hello,

Some details about this bug:

We observe an incorrect affinity mask in the combination of Intel MPI and slurm option --cpus-per-task.

My batch file:
> #!/bin/bash
> #
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --cpus-per-task=48
> #SBATCH -p micro
> #SBATCH --time=1:00

> module load mpi.intel/2019

> mpirun numactl --show

Output:
> policy: default
> preferred node: current
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> cpubind: 0
> nodebind: 0
> membind: 0 1
> policy: default
> preferred node: current
> physcpubind: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> cpubind: 0
> nodebind: 0
> membind: 0 1

Expected output:
> ...
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> ...
> physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> ...

You are right, mpirun calls srun to launch hydra_pmi_proxy on the remote nodes and then each hydra_pmi_proxy launches mpi processes. This srun call looks like: 
> srun -N 1 -n 1 --nodelist remote_hostname --input none hydra_pmi_proxy. 

Then hydra_pmi_proxy calls fork ntasks-per-node times to launch mpi processes.

As I understand, after calling srun, we get just 48 cpus for hydra_pmi_proxy and their child processes. Is there any way to call srun and keep the correct affinity mask for mpi processes?


Another issue is related with --cpus-per-task and srun.
For example I set --cpus-per-task=10, --nodes=1, --ntasks-per-node=2 and run `srun numactl --show`
I expected something like:
> ...
> physcpubind: 0 1 2 3 4 5 6 7 8 9
> ...
> physcpubind: 10 11 12 13 14 15 16 17 18 19
> ...

But I see affinity mask with all cpus for each rank:
> ...
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> ...
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> ...

Comment 3 Rozanov, Anatoliy 2019-09-04 03:11:18 MDT

slurm.conf and cgroup.conf:

$ cat /etc/slurm/slurm.conf
> # Script:        /etc/slurm/slurm.conf      
> #                                           
> # Maintainer:    pmayes@lenovo.com          
> #                                           
> # Description: Main Slurm configuration file for SuperMUC-NG
> 
> #
> # Basics
> #       
> ClusterName=supermucng
> SlurmctldHost=sl01(172.16.226.80)
> SlurmctldHost=sl02(172.16.226.81)
> SlurmUser=slurm                  
> SlurmctldPort=6817               
> SlurmdPort=6818                  
> AuthType=auth/munge              
> 
> #
> # State and Control
> #                  
> StateSaveLocation=/var/spool/slurm/ctld
> SlurmdSpoolDir=/var/spool/slurm        
> SwitchType=switch/none                 
> MpiDefault=pmi2                        
> SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> SlurmdPidFile=/var/run/slurm/slurmd.pid      
> ProctrackType=proctrack/cgroup               
> # PrivateData=accounts,jobs,nodes,reservations,usage,users
> SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu_bind=no --mpi=none $SHELL"
> ReturnToService=1                                                                                       
> 
> PropagateResourceLimits=ALL
> PropagateResourceLimitsExcept=MEMLOCK
> 
> JobSubmitPlugins=lua
> 
> #
> # Prologs and Epilogs
> #                    
> Prolog=/etc/slurm/scripts/Prolog
> Epilog=/etc/slurm/scripts/Epilog
> SrunProlog=/etc/slurm/scripts/SrunProlog
> SrunEpilog=/etc/slurm/scripts/SrunEpilog
> TaskProlog=/etc/slurm/scripts/TaskProlog
> TaskEpilog=/etc/slurm/scripts/TaskEpilog
> PrologFlags=Alloc,Contain               
> 
> #
> # Node Health Check
> #                  
> HealthCheckProgram=/usr/sbin/nhc
> HealthCheckInterval=300         
> HealthCheckNodeState=ANY,CYCLE  
> 
> TaskPlugin=task/cgroup,task/affinity
> 
> #
> # Timers
> #       
> SlurmctldTimeout=420
> SlurmdTimeout=420   
> ResumeTimeout=420   
> BatchStartTimeout=20
> CompleteWait=15     
> PrologEpilogTimeout=500
> MessageTimeout=60      
> MinJobAge=5            
> InactiveLimit=0        
> KillWait=2             
> Waittime=0             
> TCPTimeout=15          
> KillOnBadExit=1        
> UnkillableStepTimeout=500
> UnkillableStepProgram=/etc/slurm/UnkillableStepProgram.sh
> 
> #
> # Scheduling
> #           
> MaxJobCount=15000
> SchedulerType=sched/backfill
> FastSchedule=1              
> SelectType=select/cons_res  
> SelectTypeParameters=CR_CPU_Memory
> PriorityType=priority/multifactor 
> PriorityWeightAge=1000000         
> PriorityWeightJobSize=500000      
> PriorityWeightPartition=500000    
> PriorityMaxAge=14-0               
> SchedulerParameters=max_switch_wait=999999,bf_window=10080,default_queue_depth=10000,bf_interval=30,bf_resolution=1800,bf_max_job_test=3000,bf_max_job_user=800,bf_continue,max_rpc_cnt=150,kill_invalid_depend
> 
> #
> # Launching
> #          
> LaunchParameters=send_gids
> 
> #
> # Logging
> #        
> # DebugFlags=NO_CONF_HASH stops the constant warnings about slurm.conf not being the
> # same everywhere. This is because we include /etc/slurm.specific.conf, which is    
> # different on every node                                                           
> #                                                                                   
> DebugFlags=NO_CONF_HASH                                                             
> #SlurmctldDebug=debug                                                               
> #SlurmdDebug=debug                                                                  
> include /etc/slurm.specific.conf                                                    
> 
> #
> # Accounting
> #           
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30             
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=sl01                       
> AccountingStorageBackupHost=sl02                 
> AccountingStorageEnforce=associations,qos,limits 
> AcctGatherEnergyType=acct_gather_energy/xcc      
> #                                                
> # Email FMoll 22/03/2019                         
> # Comment out the line below                     
> #                                                
> #AcctGatherNodeFreq=30                           
> 
> EnforcePartLimits=ALL
> 
> #
> # Topology
> #         
> TopologyPlugin=topology/tree
> TreeWidth=22                
> 
> #
> # Compute Nodes
> #              
> 
> PartitionName=DEFAULT Default=NO OverSubscribe=EXCLUSIVE State=UP
> 
> PartitionName=test        Nodes=i01r[01-11]c[01-06]s[01-12]                                  AllowQOS=test    MinNodes=1   MaxNodes=16   MaxTime=00:30:00 Default=YES
> PartitionName=fat         Nodes=f01r[01-02]c[01-06]s[01-12]                                  AllowQOS=fat     MinNodes=1   MaxNodes=128  MaxTime=48:00:00 PriorityJobFactor=0
> PartitionName=micro       Nodes=i01r[02-11]c[01-06]s[01-12],i02r[01-11]c[01-06]s[01-12]      AllowQOS=micro   MinNodes=1   MaxNodes=16   MaxTime=48:00:00 PriorityJobFactor=0
> PartitionName=general     Nodes=i01r[07-11]c[01-06]s[01-12],i[02-08]r[01-11]c[01-06]s[01-12] AllowQOS=general MinNodes=17  MaxNodes=768  MaxTime=48:00:00 PriorityJobFactor=70
> PartitionName=large       Nodes=i[03-08]r[01-11]c[01-06]s[01-12]                             AllowQOS=large   MinNodes=769 MaxNodes=3168 MaxTime=24:00:00 PriorityJobFactor=100
> 
> PartitionName=tmp1        Nodes=f01r[01-02]c[01-06]s[01-12],i[01-08]r[01-11]c[01-06]s[01-12] AllowQOS=nolimit AllowAccounts=pr28fa,pr27cu,pr27ca PriorityJobFactor=200
> 
> #PartitionName=tmp2        Nodes=i[01-08]r[01-11]c[01-06]s[01-12],f01r[01-02]c[01-06]s[01-12] AllowQOS=nolimit AllowAccounts=pr83te,pr86fe,pr45fi
> 
> #
> NodeName=DEFAULT CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=81920 # 80GB
> #                                                                                            
> # Thin Node Island 1                                                                         
> #                                                                                            
> NodeName=DEFAULT Features=i01,hot,thin,work,nowork,scratch,noscratch                         
> NodeName=i01r01c[01-06]s[01-12] NodeAddr=172.16.192.[1-72]                                   
> NodeName=i01r02c[01-06]s[01-12] NodeAddr=172.16.192.[81-152]                                 
> NodeName=i01r03c[01-06]s[01-12] NodeAddr=172.16.192.[161-232]                                
> NodeName=i01r04c[01-06]s[01-12] NodeAddr=172.16.193.[1-72]                                   
> NodeName=i01r05c[01-06]s[01-12] NodeAddr=172.16.193.[81-152]                                 
> NodeName=i01r06c[01-06]s[01-12] NodeAddr=172.16.193.[161-232]                                
> NodeName=i01r07c[01-06]s[01-12] NodeAddr=172.16.194.[1-72]                                   
> NodeName=i01r08c[01-06]s[01-12] NodeAddr=172.16.194.[81-152]                                 
> NodeName=i01r09c[01-06]s[01-12] NodeAddr=172.16.194.[161-232]                                
> NodeName=i01r10c[01-06]s[01-12] NodeAddr=172.16.195.[1-72]                                   
> NodeName=i01r11c[01-06]s[01-12] NodeAddr=172.16.195.[81-152]                                 
> #                                                                                            
> # Thin Node Island 2                                                                         
> #                                                                                            
> NodeName=DEFAULT Features=i02,hot,thin,work,nowork,scratch,noscratch                         
> NodeName=i02r01c[01-06]s[01-12] NodeAddr=172.16.196.[1-72]                                   
> NodeName=i02r02c[01-06]s[01-12] NodeAddr=172.16.196.[81-152]                                 
> NodeName=i02r03c[01-06]s[01-12] NodeAddr=172.16.196.[161-232]                                
> NodeName=i02r04c[01-06]s[01-12] NodeAddr=172.16.197.[1-72]                                   
> NodeName=i02r05c[01-06]s[01-12] NodeAddr=172.16.197.[81-152]                                 
> NodeName=i02r06c[01-06]s[01-12] NodeAddr=172.16.197.[161-232]                                
> NodeName=i02r07c[01-06]s[01-12] NodeAddr=172.16.198.[1-72]                                   
> NodeName=i02r08c[01-06]s[01-12] NodeAddr=172.16.198.[81-152]                                 
> NodeName=i02r09c[01-06]s[01-12] NodeAddr=172.16.198.[161-232]                                
> NodeName=i02r10c[01-06]s[01-12] NodeAddr=172.16.199.[1-72]                                   
> NodeName=i02r11c[01-06]s[01-12] NodeAddr=172.16.199.[81-152]                                 
> #                                                                                            
> # Thin Node Island 3                                                                         
> #                                                                                            
> NodeName=DEFAULT Features=i03,hot,thin,work,nowork,scratch,noscratch                         
> NodeName=i03r01c[01-06]s[01-12] NodeAddr=172.16.200.[1-72]                                   
> NodeName=i03r02c[01-06]s[01-12] NodeAddr=172.16.200.[81-152]                                 
> NodeName=i03r03c[01-06]s[01-12] NodeAddr=172.16.200.[161-232]                                
> NodeName=i03r04c[01-06]s[01-12] NodeAddr=172.16.201.[1-72]                                   
> NodeName=i03r05c[01-06]s[01-12] NodeAddr=172.16.201.[81-152]                                 
> NodeName=i03r06c[01-06]s[01-12] NodeAddr=172.16.201.[161-232]                                
> NodeName=i03r07c[01-06]s[01-12] NodeAddr=172.16.202.[1-72]                                   
> NodeName=i03r08c[01-06]s[01-12] NodeAddr=172.16.202.[81-152]                                 
> NodeName=i03r09c[01-06]s[01-12] NodeAddr=172.16.202.[161-232]                                
> NodeName=i03r10c[01-06]s[01-12] NodeAddr=172.16.203.[1-72]                                   
> NodeName=i03r11c[01-06]s[01-12] NodeAddr=172.16.203.[81-152]                                 
> #                                                                                            
> # Thin Node Island 4                                                                         
> #                                                                                            
> NodeName=DEFAULT Features=i04,hot,thin,work,nowork,scratch,noscratch                         
> NodeName=i04r01c[01-06]s[01-12] NodeAddr=172.16.204.[1-72]                                   
> NodeName=i04r02c[01-06]s[01-12] NodeAddr=172.16.204.[81-152]                                 
> NodeName=i04r03c[01-06]s[01-12] NodeAddr=172.16.204.[161-232]                                
> NodeName=i04r04c[01-06]s[01-12] NodeAddr=172.16.205.[1-72]                                   
> NodeName=i04r05c[01-06]s[01-12] NodeAddr=172.16.205.[81-152]                                 
> NodeName=i04r06c[01-06]s[01-12] NodeAddr=172.16.205.[161-232]                                
> NodeName=i04r07c[01-06]s[01-12] NodeAddr=172.16.206.[1-72]                                   
> NodeName=i04r08c[01-06]s[01-12] NodeAddr=172.16.206.[81-152]                                 
> NodeName=i04r09c[01-06]s[01-12] NodeAddr=172.16.206.[161-232]                                
> NodeName=i04r10c[01-06]s[01-12] NodeAddr=172.16.207.[1-72]                                   
> NodeName=i04r11c[01-06]s[01-12] NodeAddr=172.16.207.[81-152]                                 
> 
> #
> # Thin Node Island 5
> #                   
> NodeName=DEFAULT Features=i05,cold,thin,work,nowork,scratch,noscratch
> NodeName=i05r01c[01-06]s[01-12] NodeAddr=172.16.208.[1-72]           
> NodeName=i05r02c[01-06]s[01-12] NodeAddr=172.16.208.[81-152]         
> NodeName=i05r03c[01-06]s[01-12] NodeAddr=172.16.208.[161-232]        
> NodeName=i05r04c[01-06]s[01-12] NodeAddr=172.16.209.[1-72]           
> NodeName=i05r05c[01-06]s[01-12] NodeAddr=172.16.209.[81-152]
> NodeName=i05r06c[01-06]s[01-12] NodeAddr=172.16.209.[161-232]
> NodeName=i05r07c[01-06]s[01-12] NodeAddr=172.16.210.[1-72]
> NodeName=i05r08c[01-06]s[01-12] NodeAddr=172.16.210.[81-152]
> NodeName=i05r09c[01-06]s[01-12] NodeAddr=172.16.210.[161-232]
> NodeName=i05r10c[01-06]s[01-12] NodeAddr=172.16.211.[1-72]
> NodeName=i05r11c[01-06]s[01-12] NodeAddr=172.16.211.[81-152]
> #
> # Thin Node Island 6
> #
> NodeName=DEFAULT Features=i06,cold,thin,work,nowork,scratch,noscratch
> NodeName=i06r01c[01-06]s[01-12] NodeAddr=172.16.212.[1-72]
> NodeName=i06r02c[01-06]s[01-12] NodeAddr=172.16.212.[81-152]
> NodeName=i06r03c[01-06]s[01-12] NodeAddr=172.16.212.[161-232]
> NodeName=i06r04c[01-06]s[01-12] NodeAddr=172.16.213.[1-72]
> NodeName=i06r05c[01-06]s[01-12] NodeAddr=172.16.213.[81-152]
> NodeName=i06r06c[01-06]s[01-12] NodeAddr=172.16.213.[161-232]
> NodeName=i06r07c[01-06]s[01-12] NodeAddr=172.16.214.[1-72]
> NodeName=i06r08c[01-06]s[01-12] NodeAddr=172.16.214.[81-152]
> NodeName=i06r09c[01-06]s[01-12] NodeAddr=172.16.214.[161-232]
> NodeName=i06r10c[01-06]s[01-12] NodeAddr=172.16.215.[1-72]
> NodeName=i06r11c[01-06]s[01-12] NodeAddr=172.16.215.[81-152]
> #
> # Thin Node Island 7
> #
> NodeName=DEFAULT Features=i07,cold,thin,work,nowork,scratch,noscratch
> NodeName=i07r01c[01-06]s[01-12] NodeAddr=172.16.216.[1-72]
> NodeName=i07r02c[01-06]s[01-12] NodeAddr=172.16.216.[81-152]
> NodeName=i07r03c[01-06]s[01-12] NodeAddr=172.16.216.[161-232]
> NodeName=i07r04c[01-06]s[01-12] NodeAddr=172.16.217.[1-72]
> NodeName=i07r05c[01-06]s[01-12] NodeAddr=172.16.217.[81-152]
> NodeName=i07r06c[01-06]s[01-12] NodeAddr=172.16.217.[161-232]
> NodeName=i07r07c[01-06]s[01-12] NodeAddr=172.16.218.[1-72]
> NodeName=i07r08c[01-06]s[01-12] NodeAddr=172.16.218.[81-152]
> NodeName=i07r09c[01-06]s[01-12] NodeAddr=172.16.218.[161-232]
> NodeName=i07r10c[01-06]s[01-12] NodeAddr=172.16.219.[1-72]
> NodeName=i07r11c[01-06]s[01-12] NodeAddr=172.16.219.[81-152]
> #
> # Thin Node Island 8
> #
> NodeName=DEFAULT Features=i08,cold,thin,work,nowork,scratch,noscratch
> NodeName=i08r01c[01-06]s[01-12] NodeAddr=172.16.220.[1-72]
> NodeName=i08r02c[01-06]s[01-12] NodeAddr=172.16.220.[81-152]
> NodeName=i08r03c[01-06]s[01-12] NodeAddr=172.16.220.[161-232]
> NodeName=i08r04c[01-06]s[01-12] NodeAddr=172.16.221.[1-72]
> NodeName=i08r05c[01-06]s[01-12] NodeAddr=172.16.221.[81-152]
> NodeName=i08r06c[01-06]s[01-12] NodeAddr=172.16.221.[161-232]
> NodeName=i08r07c[01-06]s[01-12] NodeAddr=172.16.222.[1-72]
> NodeName=i08r08c[01-06]s[01-12] NodeAddr=172.16.222.[81-152]
> NodeName=i08r09c[01-06]s[01-12] NodeAddr=172.16.222.[161-232]
> NodeName=i08r10c[01-06]s[01-12] NodeAddr=172.16.223.[1-72]
> NodeName=i08r11c[01-06]s[01-12] NodeAddr=172.16.223.[81-152]
> #
> # Fat Node Island
> #
> NodeName=DEFAULT RealMemory=757760 # 740GB
> NodeName=DEFAULT Features=f01,fat,work,nowork,scratch,noscratch
> NodeName=f01r01c[01-06]s[01-12] NodeAddr=172.16.224.[1-72]
> NodeName=f01r02c[01-06]s[01-12] NodeAddr=172.16.224.[81-152]

$ cat /etc/slurm/cgroup.conf
> ###
> #
> # Slurm cgroup support configuration file
> #
> # See man slurm.conf and man cgroup.conf for further
> # information on cgroup configuration parameters
> #--
> CgroupAutomount=yes
> 
> ConstrainCores=yes
> #TaskAffinity=yes
> 
> ConstrainSwapSpace=yes # ???
> AllowedSwapSpace=0     # ???
> 
> ConstrainRAMSpace=yes
> MaxRAMPercent=100

Comment 4 Felip Moll 2019-09-04 07:26:58 MDT

> As I understand, after calling srun, we get just 48 cpus for hydra_pmi_proxy
> and their child processes. Is there any way to call srun and keep the
> correct affinity mask for mpi processes?

srun call is made by Intel's mpiexec/mpirun which is translating the parameters seen in the allocation's environment into what you're seeing.
See also bug 7097.


> 
> Another issue is related with --cpus-per-task and srun.
> For example I set --cpus-per-task=10, --nodes=1, --ntasks-per-node=2 and run
> `srun numactl --show`
> I expected something like:
> > ...
> > physcpubind: 0 1 2 3 4 5 6 7 8 9
> > ...
> > physcpubind: 10 11 12 13 14 15 16 17 18 19
> > ...

Can you try to add --hint=multithread to srun? I am doing some tests. There's bug 5290 that can help to understand.

Comment 5 Rozanov, Anatoliy 2019-09-04 07:50:06 MDT

> srun call is made by Intel's mpiexec/mpirun which is translating the parameters seen in the allocation's environment into what you're seeing.

May be there is some option that we can pass to the srun call to keep correct affinity mask?

> Can you try to add --hint=multithread to srun? 
It seems it helped:

physcpubind: 0 1 2 3 4 48 49 50 51 52
physcpubind: 24 25 26 27 28 72 73 74 75 76

> There's bug 5290 that can help to understand.
I do not have permissions to see this bug: "You are not authorized to access bug #5290."

Comment 6 Felip Moll 2019-09-04 10:05:44 MDT

(In reply to Rozanov, Anatoliy from comment #5)
> > srun call is made by Intel's mpiexec/mpirun which is translating the parameters seen in the allocation's environment into what you're seeing.
> 
> May be there is some option that we can pass to the srun call to keep
> correct affinity mask?
> 

I think that here it is mpiexec/mpirun that must pass the correct options to srun. I don't know if you have the possibility to modify it or open a bug with intel. Maybe you can also try to setup the environment before executing mpirun, like:

export SLURM_HINT=multithread
mpirun..

We could analyze alse what are your user's real needs.
Do you really want to schedule by thread? Are there any possibilities to bind to core instead?

> > Can you try to add --hint=multithread to srun? 
> It seems it helped:
> 
> physcpubind: 0 1 2 3 4 48 49 50 51 52
> physcpubind: 24 25 26 27 28 72 73 74 75 76

Ok, let me explain a bit. 

CR_CORE and CR_CPU both schedule/binds tasks within a job on hyper-threads. However, if you've defined cores, threads, sockets, etc. in a node definition, neither CR_CPU nor CR_CORE will allow two different jobs to be scheduled on different hyper-threads on the same core. You will still be able to allocate different steps of the same job to same cores and different threads though.

]$ srun -c3 -n1  --cpu-bind=verbose --hint=multithread numactl --show
cpu-bind=MASK - gamba1, task  0  0 [32318]: mask 0x7 set
policy: default
preferred node: current
physcpubind: 0 1 2 
cpubind: 0 
nodebind: 0 
membind: 0 

]$ sacct -j 20410
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20410           numactl      debug       lipi          4  COMPLETED      0:0 
20410.extern     extern                  lipi          4  COMPLETED      0:0 
20410.0         numactl                  lipi          3  COMPLETED      0:0 


The difference is in the accounting. With CR_CPU, if a job requests a single task, it will be charged for a single CPU (a single hyper-thread), but the task is still given both hyper-threads. The other hyper-thread is not available for use by any other job in the system. With CR_CORE, if a job requests a single task, Slurm will allocate it the entire core with both hyper-threads, and it will be charged for two CPUs. In the example you can see AllocCPUS of the job being 4, and the task having 3 usable CPUs.

Having said that, there are currently known limitations on affinity. We would like to be able to schedule by thread and treat a thread like we are doing with cores. e.g.:

Currently, you can only get one thread or all threads on the core but can't control how many threads between that. This is problematic on nodes that have more than two threads per core.

You can still control thread placement with:
srun --hint=nomultithreads
Uses only one thread per core

srun --hint=multithread or srun --cpu-bind=threads
binds to all threads in the task/core


> > There's bug 5290 that can help to understand.
> I do not have permissions to see this bug: "You are not authorized to access
> bug #5290."

Yep sorry, this one is private. I just want let you know that there's an enhancement in place to overcome these limitations.

Comment 7 Rozanov, Anatoliy 2019-09-05 04:30:17 MDT

> I don't know if you have the possibility to modify it or open a bug with intel.

Yes, I can modify srun call in mpiexec

> We could analyze alse what are your user's real needs.

Our main scenario: 
The user allocates nodes with some slurm options and then run mpiexec.

mpiexec inside reads SLURM env variables and determines how many nodes do we have, which nodes and how many processes per node.

Then mpiexec launches hydra_pmi_proxy (or hydra_bstrap_proxy in case IMPI 2019) on the remote nodes. The default tool for launching proxies is ssh, but if we run under SLURM, we use srun. Especially ssh is not available in SuperMUC NG. 

Then hydra_pmi_proxy launches user's application ntask-per-node times.

So, we use srun to launch hydra_pmi_proxy on the remote nodes and we have to run only one proxy per node. Therefore, the srun call looks like `srun -n 1 -N 1 --nodelist hydra_pmi_proxy`.

But if user sets some option for pinning (--cpus-per-task for example), then our srun call defined as one task and we get cpus only for one task. We are interested to get all cpus and somehow to set right affinity masks for user's application processes.

> I think that here it is mpiexec/mpirun that must pass the correct options to srun

Could you clarify what options should we pass to srun to run one process per node and keep right pinning for the user's application? Or can we get affinity masks for all processes and then set it manually?

Comment 8 Felip Moll 2019-09-05 05:15:35 MDT

(In reply to Rozanov, Anatoliy from comment #7)
> > I don't know if you have the possibility to modify it or open a bug with intel.
> 
> Yes, I can modify srun call in mpiexec
> 
> > We could analyze alse what are your user's real needs.
> 
> Our main scenario: 
> The user allocates nodes with some slurm options and then run mpiexec.
> 
> mpiexec inside reads SLURM env variables and determines how many nodes do we
> have, which nodes and how many processes per node.
> 
> Then mpiexec launches hydra_pmi_proxy (or hydra_bstrap_proxy in case IMPI
> 2019) on the remote nodes. The default tool for launching proxies is ssh,
> but if we run under SLURM, we use srun. Especially ssh is not available in
> SuperMUC NG. 

At this point is where I am suggesting (bug 7097) to modify mpiexec to srun with --ntasks-per-node=1 to avoid the 'honor' warning.

> Then hydra_pmi_proxy launches user's application ntask-per-node times.

I guess it reads the environment. My suggestion from bug 7097 (maybe from inside mpiexec if you're able to modify it) is to unset the environment just after reading and storing SLURM_NTASKS_PER_NODE into a tmp var, then start the pmi proxies, then launch user's application with the stored tmp var. This would remove the 'honor' warning.

> So, we use srun to launch hydra_pmi_proxy on the remote nodes and we have to
> run only one proxy per node. Therefore, the srun call looks like `srun -n 1
> -N 1 --nodelist hydra_pmi_proxy`.

Correct. It misses the --ntasks-per-node=1 to avoid the warning.

> 
> But if user sets some option for pinning (--cpus-per-task for example), then
> our srun call defined as one task and we get cpus only for one task. We are
> interested to get all cpus and somehow to set right affinity masks for
> user's application processes.

Here is where mpiexec must take care of the options, since it is no more than a wrapper that calls srun in its way.

> > I think that here it is mpiexec/mpirun that must pass the correct options to srun
> 
> Could you clarify what options should we pass to srun to run one process per
> node and keep right pinning for the user's application? 

The question here is, once you have the pmi_proxy set in all the nodes, how does mpiexec launches the apps? does it use srun again or just communicates with the pmi proxy? If so, we cannot do anything and the launcher is who has to set the pinning.

Our pinning just works by creating a mask for the forked tasks, but if it is another app which launches the tasks partly outside slurm's control, we cannot force those masks. You will still be constrained by core thanks to cgroup anyway.

> Or can we get affinity masks for all processes and then set it manually?

I think you should do it manually if the launcher is outside slurm control. As I see it, once you have 1 proxy in each node, an independent launcher is launching N processes, so it would be its responsibility to set the affinity for each one.


Hence the reason I asked about your specific use case. Do your users need to bind to thread? or it will be enough to bind to core? Have you considered 'srun' instead of 'mpiexec'? with srun you will got affinity working correctly and escape from this problems. I don't know if with srun you'll get the same performance than from mpiexec. Also, maybe you can play with some mpiexec options like --map-by to instruct it to bind tasks to threads or cores.

I'm sorry if I am mistaken but this is how I understand your issue right now.

Comment 9 Felip Moll 2019-09-05 05:43:10 MDT

Forgot to mention:

> But if user sets some option for pinning (--cpus-per-task for example), then our srun call defined as one task and we get cpus only for one task. We are interested to get all cpus and somehow to set right affinity masks for user's application processes.

Are you sure about that? In theory the SLURM_CPUS_PER_TASK is set in the batch script so further calls to srun should inherit this setting.
That means that the hydra_pmi_proxy launched by srun would be granted $SLURM_CPUS_PER_TASKS hyper-threads, and further forks from it to launch user applications could use those threads.
 
You can see that by:

]$ cat run-parallel.sh 
#!/bin/bash
#SBATCH --job-name=parallel
#SBATCH --output=sl_%j.out
#SBATCH --error=sl_%j.err
#SBATCH --mem=200M
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2

srun -N4 -n4 env

You will see SLURM_CPUS_PER_TASK=2 is set.


Since the launcher of the final tasks is not Slurm but the first process (which is hydra_pmi_proxy) init'ed by the first srun call, it means that hydra_pmi_proxy will be constrained to SLURM_CPUS_PER_TASK cores/threads:

/usr/bin/srun -N 32 -n 32 --nodelist i01r01.....s08 ... /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_pmi_proxy

so, if you're (and you are) using cgroups and constraining Cores, it means hydra_pmi_proxy will be constrained these cores, which implies that further forks, when receiving the instruction to launch user processes will be also constrained to these cores and that the affinity then is responsibility of the launcher (not slurm) which will fork the application of the user and which should set the affinity correctly.

Does it make sense?

Comment 10 Rozanov, Anatoliy 2019-09-05 05:55:35 MDT

> The question here is, once you have the pmi_proxy set in all the nodes, how does mpiexec launches the apps?

hydra_pmi_proxy launches the apps. It calls simple `fork` ntask-per-node times.

> I think you should do it manually if the launcher is outside slurm control.

I guess you are right. But we don't know which masks should be for each process. The users can set different pinning options in slurm and we don't know how it will be computed by slurm. Can we get precomputed masks from slurm?

> Do your users need to bind to thread? or it will be enough to bind to core?

I don't have such information. Our goal is to have the same affinity mask as srun.

> Have you considered 'srun' instead of 'mpiexec'?

Yes, in some cases we recommend to use srun instead mpiexec. But there are some features, that available only when using mpiexec.

Comment 11 Rozanov, Anatoliy 2019-09-05 06:02:52 MDT

> Are you sure about that? In theory the SLURM_CPUS_PER_TASK is set in the batch script so further calls to srun should inherit this setting.

When I set --ntasks-per-node=2 and --cpus-per-task=48 (48 is half of all cpus), I see the masks bellow:

physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
physcpubind: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

So, I consider that hydra_pmi_proxy gets half of all cpus:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

And then somehow it distributed between two child processes.

I think so because I don't see cpus from the second half:
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

Comment 12 Felip Moll 2019-09-05 06:03:40 MDT

(In reply to Rozanov, Anatoliy from comment #10)
> > The question here is, once you have the pmi_proxy set in all the nodes, how does mpiexec launches the apps?
> 
> hydra_pmi_proxy launches the apps. It calls simple `fork` ntask-per-node
> times.

Yep, then it's totally hydra_pmi_proxy responsibility to bind processes to cores.


> > I think you should do it manually if the launcher is outside slurm control.
> 
> I guess you are right. But we don't know which masks should be for each
> process. The users can set different pinning options in slurm and we don't
> know how it will be computed by slurm. Can we get precomputed masks from
> slurm?

That's something hydra must determine. We are calculating the masks in the affinity plugin (slurm/src/plugins/task/affinity), which is part of the allocation process in srun and as you can imagine it takes into account numa topology, already allocated threads, and so on. We cannot directly provide masks to 3rd party apps. The constraining is still provided by cgroups though, so maybe playing with mpiexec options you can get what you need.

mpirun (openmpi)

--map-by <foo>
    Map to the specified object, defaults to socket. Supported options
    include slot, hwthread, core, L1cache, L2cache, L3cache, socket, 
    numa, board, node, sequential, distance, and ppr. Any object can 
    include modifiers by adding a : and any combination of PE=n (bind n
    processing elements to each proc), SPAN (load balance the processes 
    across the allocation), OVERSUBSCRIBE (allow more processes on a node
    than processing elements), and NOOVERSUBSCRIBE. This includes PPR,
    where the pattern would be terminated by another colon to separate 
    it from the modifiers.

otherwise you can try to trust the linux kernel binding intelligence. It should pin it mostly ok.

> > Do your users need to bind to thread? or it will be enough to bind to core?
> 
> I don't have such information. Our goal is to have the same affinity mask as
> srun.

Then you have to copy slurm's affinity plugin logic :)

> > Have you considered 'srun' instead of 'mpiexec'?
> 
> Yes, in some cases we recommend to use srun instead mpiexec. But there are
> some features, that available only when using mpiexec.

In the past it has been recommended to use srun over mpirun/exec just for this affinity issue. It is one of the advantages srun provides. What are the features only available in mpiexec? I am interested.



In parallel to all of this and a bit off topic, you said you're using Intel MPI 2019.
Just note that Intel MPI doesn't have support for PMI2 or PMIX:
See bug 6727 comment 37

We're waiting for a release note of Intel explaining their intentions in what regards to libpmi2.

Comment 13 Felip Moll 2019-09-05 06:08:43 MDT

(In reply to Rozanov, Anatoliy from comment #11)
> > Are you sure about that? In theory the SLURM_CPUS_PER_TASK is set in the batch script so further calls to srun should inherit this setting.
> 
> When I set --ntasks-per-node=2 and --cpus-per-task=48 (48 is half of all
> cpus), I see the masks bellow:
> 
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> physcpubind: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> 
> So, I consider that hydra_pmi_proxy gets half of all cpus:
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > 48 49 50 51
> 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

Can you confirm this by entering into the node which allocates hydra proxy, and do a:

taskset -cp $(pidof hydra_pmi_proxy)

if you see 48 CPUs there, then binding done by Slurm is correct, and what's happening is that numactl forked by pmi_proxy is assigned a different set, which I said is responsibility of hydra.

Comment 14 Rozanov, Anatoliy 2019-09-05 06:30:58 MDT

> The constraining is still provided by cgroups though, so maybe playing with mpiexec options you can get what you need.

Yes, we can set pinning as we want by mpiexec options, but it will be more correct to use pinning from slurm. 

Ok, I understood you. Unfortunately there is no way to get pinning from slurm and set it to forked processes.

> What are the features only available in mpiexec? 

For example gtool. You can launch such tools for the specified processes (for example gdb, valgrind, vtune and so on). I don't know, may be srun also can do it.

And you can run mpiexec under different job managers, not only slurm. Therefore, the same command line.

> Can you confirm this by entering into the node which allocates hydra proxy

Unfortunately, It is impossible to go to the remote node in SuperMUC.

Comment 15 Felip Moll 2019-09-05 08:32:21 MDT

(In reply to Rozanov, Anatoliy from comment #14)
> > The constraining is still provided by cgroups though, so maybe playing with mpiexec options you can get what you need.
> 
> Yes, we can set pinning as we want by mpiexec options, but it will be more
> correct to use pinning from slurm. 
> 
> Ok, I understood you. Unfortunately there is no way to get pinning from
> slurm and set it to forked processes.

Correct. We don't have such API.

> > Can you confirm this by entering into the node which allocates hydra proxy
> 
> Unfortunately, It is impossible to go to the remote node in SuperMUC.

Well, I guess you could talk with the sysadmins if this is an issue really concerning LRZ.


After this analysis, do you think everything is clear now? Do you need anything from me?

Comment 16 Rozanov, Anatoliy 2019-09-06 00:28:21 MDT

> After this analysis, do you think everything is clear now? Do you need anything from me?

I think everything is clear now. We will think about how to set correct affinity mask for the user's application. Thank you for your help.

Comment 17 Felip Moll 2019-09-06 03:01:35 MDT

Thank you for time.

I am marking now the bug as infogiven.

Don't hesitate to contact us if finally Intel extends mpiexec with improved Slurm support and affinity, then we will document it in our website.