Created attachment 17331 [details] Currentl slurm config. User-submited mpi xhpl jobs (intel MPI 2019.7.217) w/ 2 tasks per node see one task use 24/24 cores on socket, and second task barely uses 1 core. Also second job does not allocate necessary memory.
Hi Christian, I see that you've marked the version for this ticket as 20.11. Did this start happening after upgrading to 20.11? I'm not sure based on the information in your initial description, but this sounds like it may be related to an issue we've seen with mpi jobs and a change where job steps default to being exclusive in 20.11. You can read more about this issue in bug 10383. If that is the issue you can work around the problem by setting the SLURM_WHOLE environment variable to '1' before launching the mpi job. Please let me know if this isn't the issue in your case. Thanks, Ben
Thank you for contacting. I am no longer employed by the University. Please direct future email to systems@rcc.uchicago.edu. This is an automated reply. Mengxing
Thanks for the note. This is not an upgrade, rather a fresh install. I will ask the user to try an resubmit with this setting in place.
Created attachment 17334 [details] Config.log from build Including in case there's something else I may have missed.
I have attempted to add the SLURM_WHOLE=1 to a test sbatch script (pasted below), and I am seeing the same behavior. From top on the compute node: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 369354 ccaruth+ 20 0 160.4g 156.7g 365572 R 2384 83.1 44:16.20 xhpl_intel64_dy 369357 ccaruth+ 20 0 4156072 464964 365552 R 99.7 0.2 2:40.11 xhpl_intel64_dy sbatch script: #!/bin/sh #SBATCH --account=lenovo #SBATCH --partition=debug #SBATCH --qos=test #SBATCH --output=cluster.log #SBATCH --nodes=1 #SBATCH --nodelist=midway3-[0009] #SBATCH --tasks-per-node=2 #SBATCH --cpus-per-task=24 module load intelmpi mkl export I_MPI_FABRICS=shm:ofi export FI_PROVIDER=mlx export I_MPI_HYDRA_BOOTSTRAP=slurm export HPL_HWPREFETCH=1 export KMP_AFFINITY=verbose,granularity=fine,compact export I_MPI_PIN_CELL=unit export HPL_HWPREFETCH=1 export HPL_SWAPWIDTH=384 export SLURM_WHOLE=1 mpirun -np 2 ./run_hpl.sh
Hi Christian - I have Marcin assigned to this issue. Would you let us know what your MPI application does? Also, if you could retest with srun instead of mpirun and report back the results.
Seems like this was not a slurm issue, but rather a submission issue. I have successfully run single- and multi-node HPL using SLURM. submission script below: #!/bin/bash ##SBATCH --constraint="6248R&EDR&192GB&2933MHz" #SBATCH --account=lenovo #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --nodelist=midway3-[0009] #SBATCH --cpus-per-task=1 #SBATCH --exclusive #SBATCH --job-name=hpl #SBATCH --time=10:00:00 #SBATCH --output hpl.%J.out #SBATCH --error hpl.%J.err ###Inputs### export HPL_ROOT='/home/ccaruthers/kevin/hpl' export I_MPI_ROOT=$HPL_ROOT export HPL_EXEC='xhpl_icx_static' export HPL_WRAPPER='run_TLP' export nodes=${SLURM_JOB_NUM_NODES} export ppn=${SLURM_NTASKS_PER_NODE} export N=141312 export Nb=384 export P=2 #PxQ must equal NodesxPPN export Q=1 ###Setting variables### export ntasks=$(($nodes * $ppn)) ###Directory Setup### export datestamp=$(date '+%mm%dd%yy') export timestamp=$(date '+%Hh%Mm%Ss') export jobname=${SLURM_JOB_NAME} export odir=${nodes}n_${ppn}ppn_${datestamp}_${timestamp} MYPWD=${PWD} mkdir -p slurm_runs mkdir slurm_runs/${odir} cd slurm_runs/${odir} cp ${MYPWD}/slurm_intel.sh . cp ${MYPWD}/${HPL_WRAPPER} . cp ${MYPWD}/${HPL_EXEC} . ###Creating/Checking hostfile### echo $(echo $SLURM_JOB_NODELIST|scontrol show hostnames) > NODELIST printf "%s\n" $( <NODELIST ) > hostfile ###Library Setup### PATH=$PATH:$HPL_ROOT:$HPL_ROOT/intel64/bin . ${I_MPI_ROOT}/intel64/bin/mpivars.sh ###Env Variables### export I_MPI_FABRICS=shm:ofi export FI_PROVIDER=mlx export HPL_HWPREFETCH=1 export HPL_SWAPWIDTH=768 export KMP_AFFINITY=verbose,granularity=fine,compact export I_MPI_PIN_CELL=unit ###App Execution### echo "-------------------Running App--------------------" >> ${jobname}.out date >> ${jobname}.out echo "-------------------------------------------------------" >> ${jobname}.out ###Cleanup### mv ${MYPWD}/${SLURM_JOB_NAME}.${SLURM_JOB_ID}.out ${MYPWD}/${SLURM_JOB_NAME}.${SLURM_JOB_ID}.err .
Christian, >Seems like this was not a slurm issue, but rather a submission issue. That's what I thought but was double-checking before the reply. The reason why this couldn't be Slurm limiting the resources available to the computing process is the fact that you don't have any TaskPlugin configured. This is generally not a recommended setup which may lead to jobs utilizating resources outside of an allocation. Please take a look at TaskPlugin option documentation[1]. Is there anything else I can help you with here? cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_TaskEpilog
Interesting that you mention "jobs utilizating resources outside of an allocation". I've tried to implement a configuration that matches the original cluster's setup w/ an older slurm version. I've reenabled the TaskPlugins they are using on that cluster. I've attached new show configs from both the new cluster (slurm_config-20210106.txt) and old (midway2-slurm-config).
Created attachment 17367 [details] Current config
Created attachment 17368 [details] Existing cluster config
Adding on to my previous comment, more specifically, they're asking about cgroups: why "cpuset" and "memory" subsystems are not shown under /cgroup so the resource limit doesn't take effect?
Christian, > why "cpuset" and "memory" subsystems are not shown under /cgroup so the resource limit doesn't take effect? The reason is exactly what I've mentioned in comment 9, when you compare the existing cluster config you have: >TaskPlugin = task/cgroup,task/affinity while in the "Current slurm config" config you have: >TaskPlugin = task/none none is a default "phantom" plugin that actually doesn't do anything (only displays debug level messages). cheers, Marcin
Sorry about that, I thought I had addressed that. I guess I maybe did a reload instead of a restart on slurmctld? When I ran scontrol show config | grep Plugin, I saw it was still set to none. I've restarted the daemon this morning, and I now see the plugin loaded. We'll see how that works.
Let me know if it works fine for you now. cheers, Marcin
I'm closing the incident as information given now. Should you have any questions please reopen the case. cheers, Marcin