10473 – MPI jobs only run on the batch host - little CPU usage on other hosts

Bug 10473 - MPI jobs only run on the batch host - little CPU usage on other hosts

Summary: MPI jobs only run on the batch host - little CPU usage on other hosts

Status:	RESOLVED DUPLICATE of bug 10383

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	20.11.1
Hardware:	Linux Linux

Importance:	--- 1 - System not usable
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-12-17 04:26 MST by Greg Wickham
Modified:	2020-12-17 08:14 MST (History)
CC List:	1 user (show)

See Also:
Site:	KAUST
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configuration. (14.14 KB, application/x-bzip2) 2020-12-17 04:28 MST, Greg Wickham	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Greg Wickham 2020-12-17 04:26:06 MST

While running multi-node MPI jobs using SLURM, CPUs are fully utilized on the batch host only.
While the utilization on the other nodes is very low.
It looks like all the processes are sharing a single core on each node.
They use independent cores only on the batch host.
That makes multi-node jobs run forever

on a batch host, utilization is almost 100% for each process
 55136 mazatyae  20   0 9213248   8.1g  13180 R  99.7  1.6   8:23.24 xhpl                                                                             
 55143 mazatyae  20   0 9228744   8.2g  12684 R  99.7  1.6   8:24.41 xhpl                                                                             
 55144 mazatyae  20   0 9138108   8.1g  12676 R  99.7  1.6   8:23.72 xhpl                                                                             
 55146 mazatyae  20   0 9142304   8.1g  12804 R  99.7  1.6   8:23.63 xhpl                                                                             
 55132 mazatyae  20   0 9279780   8.1g  13288 R  99.3  1.6   8:23.92 xhpl                                                                             
 55133 mazatyae  20   0 9300568   8.2g  13292 R  99.3  1.6   8:24.21 xhpl                                                                             
 55134 mazatyae  20   0 9299688   8.2g  13380 R  99.3  1.6   8:24.23 xhpl  

on another node running the same job, utilization is very low. Only one core was utilized and seems all processes were sharing it
34772 mazatyae  20   0 8734076   7.4g  10220 R   3.6  1.5   0:16.15 xhpl                                                                             
 34759 mazatyae  20   0 8819632   7.4g  10212 R   3.3  1.5   0:16.14 xhpl                                                                             
 34760 mazatyae  20   0 8819632   7.4g  10228 R   3.3  1.5   0:16.15 xhpl                                                                             
 34761 mazatyae  20   0 8819632   7.4g  10212 R   3.3  1.5   0:16.14 xhpl

Comment 1 Greg Wickham 2020-12-17 04:28:29 MST

Created attachment 17195 [details]
Slurm configuration.

Comment 2 Greg Wickham 2020-12-17 04:29:29 MST

Slurm 20.11.1
PMIX 3.2.2
CentOS 7.9

Comment 3 Greg Wickham 2020-12-17 04:57:43 MST

sbatch file:

#!/bin/bash
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --tasks-per-node=32
#SBATCH --cpus-per-task=4
#SBATCH --partition=batch
#SBATCH -J hpl
#SBATCH -o hpl-NPS4-32threads.%N.%J.out
#SBATCH -e hpl-NPS4-32threads.%N.%J.err
#SBATCH --time=04:10:00
#SBATCH --mem=0
#SBATCH --reservation=IBEX_CS
#run the application:
module load intelstack-default
module load openmpi/4.0.1/.gnu-6.4.0
mpirun -np 64 --mca btl self,vader --report-bindings --map-by l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores  ./xhpl


$ squeue -j 13329509
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13329509     batch      hpl wickhagj  R       0:08      2 cn506-02-l,cn506-03-l


cn506-02-l:

top - 14:57:00 up 22:32,  1 user,  load average: 25.54, 9.20, 3.39
Tasks: 1463 total,  33 running, 1430 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.0 us,  0.0 sy,  0.0 ni, 75.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52820464+total, 41650988+free, 10474582+used,  6948924 buff/cache
KiB Swap: 31457276 total, 31447756 free,     9520 used. 41612816+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                  
 80613 wickhagj  20   0 3125448   2.6g  13544 R 100.3  0.5   1:45.12 xhpl                                     
 80600 wickhagj  20   0 3064572   2.5g  13604 R 100.0  0.5   1:45.04 xhpl                                     
 80601 wickhagj  20   0 3085320   2.5g  13556 R 100.0  0.5   1:45.03 xhpl                                     
 80602 wickhagj  20   0 3098100   2.5g  13868 R 100.0  0.5   1:45.04 xhpl                                     
 80603 wickhagj  20   0 3064572   2.5g  13484 R 100.0  0.5   1:45.04 xhpl                                     
 80604 wickhagj  20   0 3133088   2.6g  13524 R 100.0  0.5   1:45.08 xhpl                                     
 80605 wickhagj  20   0 3166268   2.6g  13628 R 100.0  0.5   1:45.09 xhpl                                     
 80606 wickhagj  20   0 3154176   2.6g  13832 R 100.0  0.5   1:45.08 xhpl                                     

cn506-03-l

top - 14:57:24 up 22:33,  1 user,  load average: 2.05, 0.89, 0.43
Tasks: 1409 total,   6 running, 1403 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.5 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52820464+total, 50044947+free, 20958684 used,  6796484 buff/cache
KiB Swap: 31457276 total, 31457020 free,      256 used. 49999168+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND   
 80416 wickhagj  20   0  474080  68496  11316 S   4.6  0.0   0:05.88 xhpl      
 80419 wickhagj  20   0  474080  70472  11316 S   4.6  0.0   0:05.92 xhpl      
 80420 wickhagj  20   0  474080  68508  11324 S   4.6  0.0   0:05.88 xhpl      
 80424 wickhagj  20   0  474080  68488  11312 S   4.6  0.0   0:05.90 xhpl      
 80425 wickhagj  20   0  474080  68504  11324 S   4.6  0.0   0:05.91 xhpl      
 80427 wickhagj  20   0  474080  68492  11312 S   4.6  0.0   0:05.88 xhpl

Comment 4 Greg Wickham 2020-12-17 06:15:45 MST

Bump.

Comment 5 Marcin Stolarek 2020-12-17 07:16:35 MST

Greg,

Sorry for delay.

Could you please try:
>export SLURM_WHOLE=1
before mpirun call?

This is very likely a duplicate of Bug 10383, where you can find more details.

cheers,
Marcin

Comment 6 Greg Wickham 2020-12-17 07:57:58 MST

Dear Marcin,

Thanks for the work around.

Confirming it works for us.

Please resolve this ticket.

With thanks,

   -Greg

Comment 7 Marcin Stolarek 2020-12-17 08:14:25 MST

Resolving as duplicate

*** This bug has been marked as a duplicate of bug 10383 ***