Summary: | MPI jobs only run on the batch host - little CPU usage on other hosts | ||
---|---|---|---|
Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 1 - System not usable | ||
Priority: | --- | CC: | cinek |
Version: | 20.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | KAUST | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Slurm configuration. |
Description
Greg Wickham
2020-12-17 04:26:06 MST
Created attachment 17195 [details]
Slurm configuration.
Slurm 20.11.1 PMIX 3.2.2 CentOS 7.9 sbatch file: #!/bin/bash #SBATCH -N 2 #SBATCH -n 64 #SBATCH --tasks-per-node=32 #SBATCH --cpus-per-task=4 #SBATCH --partition=batch #SBATCH -J hpl #SBATCH -o hpl-NPS4-32threads.%N.%J.out #SBATCH -e hpl-NPS4-32threads.%N.%J.err #SBATCH --time=04:10:00 #SBATCH --mem=0 #SBATCH --reservation=IBEX_CS #run the application: module load intelstack-default module load openmpi/4.0.1/.gnu-6.4.0 mpirun -np 64 --mca btl self,vader --report-bindings --map-by l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl $ squeue -j 13329509 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13329509 batch hpl wickhagj R 0:08 2 cn506-02-l,cn506-03-l cn506-02-l: top - 14:57:00 up 22:32, 1 user, load average: 25.54, 9.20, 3.39 Tasks: 1463 total, 33 running, 1430 sleeping, 0 stopped, 0 zombie %Cpu(s): 25.0 us, 0.0 sy, 0.0 ni, 75.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 52820464+total, 41650988+free, 10474582+used, 6948924 buff/cache KiB Swap: 31457276 total, 31447756 free, 9520 used. 41612816+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 80613 wickhagj 20 0 3125448 2.6g 13544 R 100.3 0.5 1:45.12 xhpl 80600 wickhagj 20 0 3064572 2.5g 13604 R 100.0 0.5 1:45.04 xhpl 80601 wickhagj 20 0 3085320 2.5g 13556 R 100.0 0.5 1:45.03 xhpl 80602 wickhagj 20 0 3098100 2.5g 13868 R 100.0 0.5 1:45.04 xhpl 80603 wickhagj 20 0 3064572 2.5g 13484 R 100.0 0.5 1:45.04 xhpl 80604 wickhagj 20 0 3133088 2.6g 13524 R 100.0 0.5 1:45.08 xhpl 80605 wickhagj 20 0 3166268 2.6g 13628 R 100.0 0.5 1:45.09 xhpl 80606 wickhagj 20 0 3154176 2.6g 13832 R 100.0 0.5 1:45.08 xhpl cn506-03-l top - 14:57:24 up 22:33, 1 user, load average: 2.05, 0.89, 0.43 Tasks: 1409 total, 6 running, 1403 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.4 us, 0.5 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 52820464+total, 50044947+free, 20958684 used, 6796484 buff/cache KiB Swap: 31457276 total, 31457020 free, 256 used. 49999168+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 80416 wickhagj 20 0 474080 68496 11316 S 4.6 0.0 0:05.88 xhpl 80419 wickhagj 20 0 474080 70472 11316 S 4.6 0.0 0:05.92 xhpl 80420 wickhagj 20 0 474080 68508 11324 S 4.6 0.0 0:05.88 xhpl 80424 wickhagj 20 0 474080 68488 11312 S 4.6 0.0 0:05.90 xhpl 80425 wickhagj 20 0 474080 68504 11324 S 4.6 0.0 0:05.91 xhpl 80427 wickhagj 20 0 474080 68492 11312 S 4.6 0.0 0:05.88 xhpl Bump. Greg, Sorry for delay. Could you please try: >export SLURM_WHOLE=1 before mpirun call? This is very likely a duplicate of Bug 10383, where you can find more details. cheers, Marcin Dear Marcin, Thanks for the work around. Confirming it works for us. Please resolve this ticket. With thanks, -Greg Resolving as duplicate *** This ticket has been marked as a duplicate of ticket 10383 *** |