10548 – Jobs not utilizing all CPUs and memory

Ticket 10548 - Jobs not utilizing all CPUs and memory

Summary: Jobs not utilizing all CPUs and memory

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	20.11.0
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-01-04 12:14 MST by Christian Caruthers
Modified:	2021-01-18 05:04 MST (History)
CC List:	1 user (show)

See Also:
Site:	University of Chicago
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Currentl slurm config. (6.83 KB, text/plain) 2021-01-04 12:14 MST, Christian Caruthers	Details
Config.log from build (266.48 KB, text/plain) 2021-01-04 14:44 MST, Christian Caruthers	Details
Current config (7.81 KB, text/plain) 2021-01-06 14:54 MST, Christian Caruthers	Details
Existing cluster config (8.19 KB, text/plain) 2021-01-06 14:54 MST, Christian Caruthers	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Christian Caruthers 2021-01-04 12:14:48 MST

Created attachment 17331 [details]
Currentl slurm config.

User-submited mpi xhpl jobs (intel MPI 2019.7.217) w/ 2 tasks per node see one task use 24/24 cores on socket, and second task barely uses 1 core. Also second job does not allocate necessary memory.

Comment 1 Ben Roberts 2021-01-04 12:35:59 MST

Hi Christian,

I see that you've marked the version for this ticket as 20.11.  Did this start happening after upgrading to 20.11?  I'm not sure based on the information in your initial description, but this sounds like it may be related to an issue we've seen with mpi jobs and a change where job steps default to being exclusive in 20.11.  You can read more about this issue in bug 10383.  If that is the issue you can work around the problem by setting the SLURM_WHOLE environment variable to '1' before launching the mpi job.  Please let me know if this isn't the issue in your case.

Thanks,
Ben

Comment 2 mengxing cheng 2021-01-04 13:52:18 MST

Thank you for contacting. I am no longer employed by the University. Please direct future email to systems@rcc.uchicago.edu. This is an automated reply.

Mengxing

Comment 3 Christian Caruthers 2021-01-04 14:37:04 MST

Thanks for the note. This is not an upgrade, rather a fresh install. I will ask the user to try an resubmit with this setting in place.

Comment 4 Christian Caruthers 2021-01-04 14:44:06 MST

Created attachment 17334 [details]
Config.log from build

Including in case there's something else I may have missed.

Comment 6 Christian Caruthers 2021-01-05 12:06:48 MST

I have attempted to add the SLURM_WHOLE=1 to a test sbatch script (pasted below), and I am seeing the same behavior. From top on the compute node:


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 369354 ccaruth+  20   0  160.4g 156.7g 365572 R  2384  83.1  44:16.20 xhpl_intel64_dy
 369357 ccaruth+  20   0 4156072 464964 365552 R  99.7   0.2   2:40.11 xhpl_intel64_dy


sbatch script:

#!/bin/sh

#SBATCH --account=lenovo
#SBATCH --partition=debug
#SBATCH --qos=test
#SBATCH --output=cluster.log
#SBATCH --nodes=1
#SBATCH --nodelist=midway3-[0009]
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=24

module load intelmpi mkl

export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export I_MPI_HYDRA_BOOTSTRAP=slurm

export HPL_HWPREFETCH=1
export KMP_AFFINITY=verbose,granularity=fine,compact
export I_MPI_PIN_CELL=unit
export HPL_HWPREFETCH=1
export HPL_SWAPWIDTH=384

export SLURM_WHOLE=1

mpirun -np 2 ./run_hpl.sh

Comment 7 Jason Booth 2021-01-05 16:19:51 MST

Hi Christian - I have Marcin assigned to this issue. Would you let us know what your MPI application does? Also, if you could retest with srun instead of mpirun and report back the results.

Comment 8 Christian Caruthers 2021-01-06 10:58:15 MST

Seems like this was not a slurm issue, but rather a submission issue. I have successfully run single- and multi-node HPL using SLURM. submission script below:

#!/bin/bash
##SBATCH --constraint="6248R&EDR&192GB&2933MHz"
#SBATCH --account=lenovo
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --nodelist=midway3-[0009]
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --job-name=hpl
#SBATCH --time=10:00:00
#SBATCH --output hpl.%J.out
#SBATCH --error hpl.%J.err

###Inputs###
export HPL_ROOT='/home/ccaruthers/kevin/hpl'
export I_MPI_ROOT=$HPL_ROOT
export HPL_EXEC='xhpl_icx_static'
export HPL_WRAPPER='run_TLP'
export nodes=${SLURM_JOB_NUM_NODES}
export ppn=${SLURM_NTASKS_PER_NODE}
export N=141312
export Nb=384
export P=2                      #PxQ must equal NodesxPPN
export Q=1

###Setting variables###
export ntasks=$(($nodes * $ppn))

###Directory Setup###
export datestamp=$(date '+%mm%dd%yy')
export timestamp=$(date '+%Hh%Mm%Ss')
export jobname=${SLURM_JOB_NAME}
export odir=${nodes}n_${ppn}ppn_${datestamp}_${timestamp}
MYPWD=${PWD}
mkdir -p slurm_runs
mkdir slurm_runs/${odir}
cd slurm_runs/${odir}
cp ${MYPWD}/slurm_intel.sh .
cp ${MYPWD}/${HPL_WRAPPER} .
cp ${MYPWD}/${HPL_EXEC} .

###Creating/Checking hostfile###
echo $(echo $SLURM_JOB_NODELIST|scontrol show hostnames) > NODELIST
printf "%s\n" $( <NODELIST ) > hostfile

###Library Setup###
PATH=$PATH:$HPL_ROOT:$HPL_ROOT/intel64/bin
. ${I_MPI_ROOT}/intel64/bin/mpivars.sh

###Env Variables###
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export HPL_HWPREFETCH=1
export HPL_SWAPWIDTH=768
export KMP_AFFINITY=verbose,granularity=fine,compact
export I_MPI_PIN_CELL=unit

###App Execution###
echo "-------------------Running App--------------------" >> ${jobname}.out
date >> ${jobname}.out
echo "-------------------------------------------------------" >> ${jobname}.out
###Cleanup###
mv ${MYPWD}/${SLURM_JOB_NAME}.${SLURM_JOB_ID}.out ${MYPWD}/${SLURM_JOB_NAME}.${SLURM_JOB_ID}.err .

Comment 9 Marcin Stolarek 2021-01-06 11:25:51 MST

Christian,

>Seems like this was not a slurm issue, but rather a submission issue. 

That's what I thought but was double-checking before the reply. The reason why this couldn't be Slurm limiting the resources available to the computing process is the fact that you don't have any TaskPlugin configured.

This is generally not a recommended setup which may lead to jobs utilizating resources outside of an allocation. Please take a look at TaskPlugin option documentation[1].

Is there anything else I can help you with here?

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_TaskEpilog

Comment 10 Christian Caruthers 2021-01-06 14:53:43 MST

Interesting that you mention "jobs utilizating resources outside of an allocation". I've tried to implement a configuration that matches the original cluster's setup w/ an older slurm version. I've reenabled the TaskPlugins they are using on that cluster. I've attached new show configs from both the new cluster (slurm_config-20210106.txt) and old (midway2-slurm-config).

Comment 11 Christian Caruthers 2021-01-06 14:54:01 MST

Created attachment 17367 [details]
Current config

Comment 12 Christian Caruthers 2021-01-06 14:54:18 MST

Created attachment 17368 [details]
Existing cluster config

Comment 13 Christian Caruthers 2021-01-07 08:34:34 MST

Adding on to my previous comment, more specifically, they're asking about cgroups: why "cpuset" and "memory" subsystems are not shown under /cgroup so the resource limit doesn't take effect?

Comment 14 Marcin Stolarek 2021-01-08 06:04:25 MST

Christian,

> why "cpuset" and "memory" subsystems are not shown under /cgroup so the resource limit doesn't take effect?

The reason is exactly what I've mentioned in comment 9, when you compare the existing cluster config you have:
>TaskPlugin              = task/cgroup,task/affinity
while in the "Current slurm config" config you have:
>TaskPlugin              = task/none
none is a default "phantom" plugin that actually doesn't do anything (only displays debug level messages).

cheers,
Marcin

Comment 15 Christian Caruthers 2021-01-08 06:44:36 MST

Sorry about that, I thought I had addressed that. I guess I maybe did a reload instead of a restart on slurmctld? When I ran scontrol show config | grep Plugin, I saw it was still set to none. I've restarted the daemon this morning, and I now see the plugin loaded. We'll see how that works.

Comment 16 Marcin Stolarek 2021-01-11 03:39:14 MST

Let me know if it works fine for you now.

cheers,
Marcin

Comment 17 Marcin Stolarek 2021-01-18 05:04:06 MST

I'm closing the incident as information given now.

Should you have any questions please reopen the case.

cheers,
Marcin