15632 – Option --cpus-per-gpu not working with srun

Ticket 15632 - Option --cpus-per-gpu not working with srun

Summary: Option --cpus-per-gpu not working with srun

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	22.05.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-12-15 09:55 MST by Marc Caubet Serrabou
Modified:	2023-07-31 08:32 MDT (History)
CC List:	3 users (show)

See Also:	16622
Site:	Paul Scherrer
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	23.11.0rc1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm.conf (6.93 KB, text/plain) 2022-12-15 09:58 MST, Marc Caubet Serrabou	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Marc Caubet Serrabou 2022-12-15 09:55:24 MST

Hi,

after upgrading to v22.05.6 we experience problems with the option --cpus-per-gpu and srun, which seeems to have a clear bug.

Within the allocation or batch script, --cpus-per-gpu works:

❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc --clusters=gmerlin6 --partition=gpu-maint -n 1  --gpus=2 --cpus-per-gpu=6
salloc: Pending job allocation 70905
salloc: job 70905 queued and waiting for resources
salloc: job 70905 has been allocated resources
salloc: Granted job allocation 70905
salloc: Waiting for resource configuration
salloc: Nodes merlin-g-100 are ready for job

(base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# python -c "import os; print(os.sched_getaffinity(0))"
{48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181}

However, as soon as srun is used, it does not work:

(base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun  python -c "import os; print(os.sched_getaffinity(0))"
{48, 176}

This is also true when implicitly defining the setting:

(base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --gpus=2 --cpus-per-gpu=6 python -c "import os; print(os.sched_getaffinity(0))"
{48, 176}

Notice that when setting --cpus-per-task it works:

(base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU * $SLURM_NTASKS)) python -c "import os; print(os.sched_getaffinity(0))"
{48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181}

However, this is puzzling, since in the man pages it's clearly stated that --cpus-per-gpu and --cpus-per-task  are not compatible.

Is this bug already known by SchedMD? Can a fix be provided for Slurm v22.05?

On the other hand, also related to srun, --cpus-per-task in srun does not work in the same way it was working in the past. This change is stated in the Highlights section, in the Release Notes document for Slurm v22.05. However, this change of behavior is really confusing for users, mainly when considering that other variables are properly propagated. It would be great to restore the previous behavior or to find a way to keep coherency for all the different environment variables. 

Thanks a lot,
Marc

Comment 1 Marc Caubet Serrabou 2022-12-15 09:58:17 MST

Created attachment 28200 [details]
Slurm.conf

Comment 2 Spencer Bliven 2022-12-15 13:38:45 MST

I find it quite confusing that srun for job steps no longer inherit all resources from the parent job. We commonly submit scripts to sbatch with a single srun step for mpi or openmp software. Is a different paradigm recommended now? I'm confused why this default was changed, and why some options are inherhited (eg --gpus) but other options are not (--cpus-per-task, --cpus-per-gpu). It would be nice to get best practices for ensuring all resources are available to a job step, given that some options may be overridden at submission time on the command line.

I use the following sbatch script for testing:

$ cat gpu.sh
#!/bin/bash
#SBATCH -J slurmgpu
#SBATCH -n 1
#SBATCH --gpus=2
#SBATCH --cluster=gmerlin6
#SBATCH --cpus-per-gpu=6
#SBATCH -p gpu-maint
#SBATCH --time=00:01:00
#SBATCH --output=gpu.log

run () {
    echo "### $* ###"
    $* python -c 'import os; print(os.sched_getaffinity(0))'
    #$* env
}

hostname
run
run srun
run srun --cpus-per-gpu=$SLURM_CPUS_PER_GPU
run srun --exact --cpus-per-gpu=$SLURM_CPUS_PER_GPU
run srun --cpus-per-task=$SLURM_CPUS_PER_GPU
run srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU))
run srun --cpus-per-gpu=$SLURM_CPUS_PER_GPU --cpus-per-task=$SLURM_CPUS_PER_GPU
run srun --gpus=$SLURM_GPUS --cpus-per-gpu=$SLURM_CPUS_PER_GPU
run srun --gpus=$SLURM_GPUS --cpus-per-task=$SLURM_CPUS_PER_GPU
run srun --gpus=$SLURM_GPUS --cpus-per-gpu=$SLURM_CPUS_PER_GPU --cpus-per-task=$SLURM_CPUS_PER_GPU

$ sbatch gpu.sh
$ cat gpu.log
merlin-g-008.psi.ch
###  ###
{0, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15}
### srun ###
{0}
### srun --cpus-per-gpu=6 ###
{0}
### srun --exact --cpus-per-gpu=6 ###
{0}
### srun --cpus-per-task=6 ###
{0, 1, 2, 3, 4, 5}
### srun --cpus-per-task=12 ###
{0, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15}
### srun --cpus-per-gpu=6 --cpus-per-task=6 ###
{0, 1, 2, 3, 4, 5}
### srun --gpus=2 --cpus-per-gpu=6 ###
{0}
### srun --gpus=2 --cpus-per-task=6 ###
{0, 1, 2, 3, 4, 5}
### srun --gpus=2 --cpus-per-gpu=6 --cpus-per-task=6 ###
{0, 1, 2, 3, 4, 5}
```

As Marc mentions, --cpus-per-gpu is ignored and only by explicitly computing the expected number of cpus am I able to use the full reservation within the task.

Comment 3 Marshall Garey 2022-12-19 13:58:37 MST

> Is this bug already known by SchedMD? Can a fix be provided for Slurm v22.05?
No, thanks for reporting the bug. I'll work on fixing --cpus-per-gpu hopefully for 22.05.

Do you set --exact in a cli_filter plugin? The behavior I see does not exactly match the behavior that you see, but I do see the problem with --cpus-per-gpu.

My node hardware configuration:
1 socket, 8 cores, 2 threads per core.

$ salloc --gpus=2 --cpus-per-gpu=6
salloc: Granted job allocation 4284
$ python3 -c "import os; print(os.sched_getaffinity(0))"
{0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13}
$ srun python3 -c "import os; print(os.sched_getaffinity(0))"
{0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13}


Because I did not request --exact, srun gets the whole allocation. This is different from what you see. Do you set SLURM_EXACT, SLURM_EXCLUSIVE, or SRUN_CPUS_PER_TASK anywhere in the environment, maybe with a bashrc or with a cli_filter plugin? All of those imply --exact. If so, that will explain why your srun only gets one core rather than the whole allocation.



Now, I show how --cpus-per-gpu doesn't affect the step allocation's cpus:

$ srun --gpus=1 --cpus-per-gpu=4 python3 -c "import os; print(os.sched_getaffinity(0))"
{0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13}
$ srun --exact --gpus=1 --cpus-per-gpu=4 python3 -c "import os; print(os.sched_getaffinity(0))"
{0, 8}




$ sacct -o jobid,alloctres -p -j $SLURM_JOB_ID
JobID|AllocTRES|
4284|billing=12,cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
4284.interactive|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
4284.0|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
4284.1|cpu=12,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1|
4284.2|cpu=1,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1|


Notes:

* --cpus-per-gpu does *not* imply --exact
* --cpus-per-task implies --exact
* --exact does not look at the number of cpus requested with --cpus-per-gpu; therefore, --cpus-per-gpu is kind of useless for step allocations right now. This is a problem.



> Notice that when setting --cpus-per-task it works:
>
> (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU * $SLURM_NTASKS)) python -c "import os; print(os.sched_getaffinity(0))"
> {48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181}
>
> However, this is puzzling, since in the man pages it's clearly stated that
> --cpus-per-gpu and --cpus-per-task  are not compatible.

In Slurm, the command line overrides environment variables. In this case, you have a SLURM_CPUS_PER_GPU environment variable that is read by srun. have srun --cpus-per-task; since this is on the command line, it overrides the SLURM_CPUS_PER_GPU environment variable.

However, Spencer showed using srun --cpus-per-task and --cpus-per-gpu at the same time. We only enforce this on the job, but not on the step. This is another bug.

Enforced on the job:

$ salloc --gpus=2 --cpus-per-gpu=6 --cpus-per-task=1
salloc: error: --cpus-per-gpu is mutually exclusive with --cpus-per-task
salloc: error: Invalid generic resource (gres) specification


Not enforced on the step:

$ srun --cpus-per-task=4 --cpus-per-gpu=2 --gpus=1 -n1 hostname
voyager


> On the other hand, also related to srun, --cpus-per-task in srun does not
> work in the same way it was working in the past. This change is stated in
> the Highlights section, in the Release Notes document for Slurm v22.05.
> However, this change of behavior is really confusing for users, mainly when
> considering that other variables are properly propagated. It would be great
> to restore the previous behavior or to find a way to keep coherency for all
> the different environment variables.


* In the past, --cpus-per-task did not imply --exact. But users were confused why steps still got all the cpus in the job allocation even though they requested --cpus-per-task such that the step should get fewer cpus. See bug 11275. This is why we made --cpus-per-task imply --exact.
* Bug 13351: --exact breaks mpirun, so we cannot have --cpus-per-task be inherited by srun if --cpus-per-task implies --exact.
* Slurm 22.05: --cpus-per-task implies --exact, but srun no longer reads SLURM_CPUS_PER_TASK so it does not automatically inherit the job's --cpus-per-task value. This is a compromise between the problem in bug 11275 (--cpus-per-task didn't change step allocations without --exact) and bug 13351 (--exact breaks mpirun). See https://bugs.schedmd.com/show_bug.cgi?id=13351#c76

Do you have any questions about that?


Arguably, --cpus-per-gpu should behave the same way as --cpus-per-task. If we make --cpus-per-gpu imply --exact, then it would have the same problem as --cpus-per-task in that it would break mpirun. So then we'd have to make --cpus-per-gpu also not get inherited by srun. I'll discuss this internally, but this would not be a change we could make in 22.05.



My TODO's:
* Fix --cpus-per-gpu and --exact to work together properly.
* Make --cpus-per-gpu and --cpus-per-task mutually exclusive for steps as well as jobs.
* Discuss internally whether we should make --cpus-per-gpu behave the same way as --cpus-per-task in 23.02.

Comment 5 Marc Caubet Serrabou 2022-12-20 02:18:39 MST

Hi,

> Do you set --exact in a cli_filter plugin? The behavior I see does not exactly
> match the behavior that you see, but I do see the problem with --cpus-per-gpu.

We do not set --exact, we have a pretty simple configuration without 
extra plugins (nor in prolog). Therefore, --exact is not enforced:

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc 
--clusters=gmerlin6 --partition=gpu-maint -n 1  --gpus=2 --cpus-per-gpu=6
salloc: Granted job allocation 71749
salloc: Waiting for resource configuration
salloc: Nodes merlin-g-014 are ready for job
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SLURM_EXACT
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep 
SLURM_EXCLUSIVE
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep 
SRUN_CPUS_PER_TASK
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# python -c "import 
os; print(os.sched_getaffinity(0))"
{32, 33, 34, 35, 24, 25, 26, 27, 28, 29, 30, 31}
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun python -c 
"import os; print(os.sched_getaffinity(0))"
{24}
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --gpus=1 
--cpus-per-gpu=4 python -c "import os; print(os.sched_getaffinity(0))"
{24}
(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --exact 
--gpus=1 --cpus-per-gpu=4 python -c "import os; 
print(os.sched_getaffinity(0))"
{24}

> My node hardware configuration:
> 1 socket, 8 cores, 2 threads per core.
>
> $ salloc --gpus=2 --cpus-per-gpu=6
> salloc: Granted job allocation 4284
> $ python3 -c "import os; print(os.sched_getaffinity(0))"
> {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13}
> $ srun python3 -c "import os; print(os.sched_getaffinity(0))"
> {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13}
>
> Because I did not request --exact, srun gets the whole allocation. This is
> different from what you see. Do you set SLURM_EXACT, SLURM_EXCLUSIVE, or
> SRUN_CPUS_PER_TASK anywhere in the environment, maybe with a bashrc or with a
> cli_filter plugin? All of those imply --exact. If so, that will explain why
> your srun only gets one core rather than the whole allocation.

That's clearly different to what we see. However, as mentioned above, no 
--exact.

> $ sacct -o jobid,alloctres -p -j $SLURM_JOB_ID
> JobID|AllocTRES|
> 4284|billing=12,cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
> 4284.interactive|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
> 4284.0|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1|
> 4284.1|cpu=12,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1|
> 4284.2|cpu=1,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1|

Here is my output:

(base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# sacct -o 
jobid,alloctres -p -j $SLURM_JOB_ID
JobID|AllocTRES|
71749|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1|
71749.interactive|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1|
71749.extern|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1|
71749.0|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1|
71749.1|cpu=12,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1|
71749.2|cpu=1,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1|

Notice that we set DefMemPerCPU=5120 by default, which is overwritten 
then in some partitions. That's why I get higher allocated memory. I 
wonder whether this setting can be problematic now with this release 
(does DefMemPerCPU imply --exact now?).

> * In the past, --cpus-per-task did not imply --exact. But users were confused
> why steps still got all the cpus in the job allocation even though they
> requested --cpus-per-task such that the step should get fewer cpus. See bug
> 11275. This is why we made --cpus-per-task imply --exact.
> *Bug 13351  <show_bug.cgi?id=13351>: --exact breaks mpirun, so we cannot have --cpus-per-task be
> inherited by srun if --cpus-per-task implies --exact.
> * Slurm 22.05: --cpus-per-task implies --exact, but srun no longer reads
> SLURM_CPUS_PER_TASK so it does not automatically inherit the job's
> --cpus-per-task value. This is a compromise between the problem inbug 11275  <show_bug.cgi?id=11275>
> (--cpus-per-task didn't change step allocations without --exact) andbug 13351  <show_bug.cgi?id=13351>
> (--exact breaks mpirun). Seehttps://bugs.schedmd.com/show_bug.cgi?id=13351#c76  <show_bug.cgi?id=13351#c76>
>
> Do you have any questions about that?

Yes, this is clear, thanks a lot. However, having some variables being 
inherited and some other not, it's also confusing users. Probably there 
is not an ideal solution for it. In our case, we have many OpenMP and 
Hybrid (MPI+OpenMPI) based jobs, so many users had to change the job 
scripts to make it run efficiently (otherwise srun was using 1 core per 
task instead), and I am trying to find users which are now running in a 
non-efficient way.

> Arguably, --cpus-per-gpu should behave the same way as --cpus-per-task. If we
> make --cpus-per-gpu imply --exact, then it would have the same problem as
> --cpus-per-task in that it would break mpirun. So then we'd have to make
> --cpus-per-gpu also not get inherited by srun. I'll discuss this internally,
> but this would not be a change we could make in 22.05.

Perfect, at least this would be coherent with the current behavior of 
--cpus-per-task.

However, I know some users running jobs with mpirun, but we usually 
instruct users to use srun instead. Wasn't that the general policy in 
Slurm? (using srun over mpirun). I understand that supporting mpirun may 
trigger some code development and support problems

> My TODO's:
> * Fix --cpus-per-gpu and --exact to work together properly.
> * Make --cpus-per-gpu and --cpus-per-task mutually exclusive for steps as well
> as jobs.
> * Discuss internally whether we should make --cpus-per-gpu behave the same way
> as --cpus-per-task in 23.02.

Perfect. Thanks a lot,

Marc


> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>   * You are watching someone on the CC list of the bug.
>

Comment 6 Spencer Bliven 2022-12-20 02:18:53 MST

I am out of the office through 9 January, 2023. For merlin support, please mail merlin-admins@lists.psi.ch.


-Spencer Bliven

Comment 7 Marshall Garey 2022-12-20 09:00:54 MST

> We do not set --exact, we have a pretty simple configuration without
> extra plugins (nor in prolog). Therefore, --exact is not enforced:
>
> (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc
> --clusters=gmerlin6 --partition=gpu-maint -n 1  --gpus=2 --cpus-per-gpu=6
> salloc: Granted job allocation 71749
> salloc: Waiting for resource configuration
> salloc: Nodes merlin-g-014 are ready for job
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SLURM_EXACT
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep
> SLURM_EXCLUSIVE
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep
> SRUN_CPUS_PER_TASK
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# python -c "import
> os; print(os.sched_getaffinity(0))"
> {32, 33, 34, 35, 24, 25, 26, 27, 28, 29, 30, 31}
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun python -c
> "import os; print(os.sched_getaffinity(0))"
> {24}
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --gpus=1
> --cpus-per-gpu=4 python -c "import os; print(os.sched_getaffinity(0))"
> {24}
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --exact
> --gpus=1 --cpus-per-gpu=4 python -c "import os;
> print(os.sched_getaffinity(0))"
> {24}
> Here is my output:
>
> (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# sacct -o
> jobid,alloctres -p -j $SLURM_JOB_ID
> JobID|AllocTRES|
> 71749|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,
> node=1|
> 71749.interactive|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,
> node=1|
> 71749.extern|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,
> mem=60G,node=1|
> 71749.0|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1|
> 71749.1|cpu=12,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1|
> 71749.2|cpu=1,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1|

I know what happened. It is the cpu binding. Although the step was allocated 12 CPUs, in slurmd it was bound to a single core because if TaskPluginParam=autobind=core. Without that, slurmd defaults to autobinding to sockets.

marshall@curiosity:~/slurm/22.05/install/c1$ salloc --cpus-per-gpu=6 --gpus=2
salloc: Granted job allocation 2288
salloc: Waiting for resource configuration
salloc: Nodes n1-2 are ready for job
marshall@curiosity:~/slurm/22.05/install/c1$ srun whereami
0000 n1-2 - Cpus_allowed:   	00000101    	Cpus_allowed_list:  	0,8
marshall@curiosity:~/slurm/22.05/install/c1$ srun --cpu-bind=socket whereami
0000 n1-2 - Cpus_allowed:   	00003f3f    	Cpus_allowed_list:  	0-5,8-13


This also happens in 21.08 with --cpus-per-gpu. Here is an example in 21.08:

$ salloc -V
slurm 21.08.2
$ srun -V
slurm 21.08.2
$ slurmctld -V
slurm 21.08.2

$ salloc --cpus-per-gpu=6 --gpus=2
salloc: Granted job allocation 5
salloc: Waiting for resource configuration
salloc: Nodes n1-2 are ready for job
$ srun whereami
0000 n1-2 - Cpus_allowed:   	00000101    	Cpus_allowed_list:  	0,8
marshall@curiosity:~/slurm/22.05/install/c1$ srun --cpu-bind=socket whereami
0000 n1-2 - Cpus_allowed:   	00003f3f    	Cpus_allowed_list:  	0-5,8-13

This is because --cpus-per-gpu isn't affecting cpus per task, which is what tells slurmd to allocate more cpus per task. --cpus-per-task gets inherited by the step in 21.08, so that tells slurmd to allocate more cpus per task even though autobind=cores.

$ salloc --cpus-per-task=6 -n1
salloc: Granted job allocation 6
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
$ srun whereami
0000 n1-1 - Cpus_allowed:   	00000707    	Cpus_allowed_list:  	0-2,8-10

But in 22.05, --cpus-per-task is not inherited by the step, so the slurmd doesn't know to allocate more cpus per task and only allocates 1 CPU for the task; then slurmd autobinds the task to a core.

Example in 22.05:

$ salloc --cpus-per-task=6 -n1
salloc: Granted job allocation 2289
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
$ srun whereami
0000 n1-1 - Cpus_allowed:   	00000101    	Cpus_allowed_list:  	0,8


The workaround we have been suggesting is for admins to set the SRUN_CPUS_PER_TASK environment variable in the job's environment (like in a shell rc).

export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

$ export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
$ srun whereami
0000 n1-1 - Cpus_allowed:   	00000707    	Cpus_allowed_list:  	0-2,8-10

Or, you can use a cli_filter plugin to set SRUN_CPUS_PER_TASK or --cpus-per-task. I'm using lua-posix to access the getenv/setenv functions.

cli_filter.lua:

function slurm_cli_pre_submit(options, pack_offset)
	slurm.log_info("Function: %s", "pre_submit")
	p=require("posix")
    cpus_per_task = p.getenv("SLURM_CPUS_PER_TASK")
    if (cpus_per_task ~= nil) then
   	 --Set the environment variable:
   	 --p.setenv("SRUN_CPUS_PER_TASK", cpus_per_task)
   	 --cpus_per_task=p.getenv("SRUN_CPUS_PER_TASK")
   	 --slurm.log_info("SRUN_CPUS_PER_TASK=%u", cpus_per_task)

   	 --Or set the command-line option:
   	 options['cpus-per-task'] = cpus_per_task
   	 slurm.log_info("SRUN_CPUS_PER_TASK=%u", options['cpus-per-task'])
    end

    return slurm.SUCCESS
end

Or you can set the environment variable just in the job after it has been submitted:

function slurm_cli_post_submit(offset, job_id, step_id)
    slurm.log_info("Function: %s", "post_submit")
    p=require("posix")
    cpus_per_task = p.getenv("SLURM_CPUS_PER_TASK")
    if (cpus_per_task ~= nil) then
   	 p.setenv("SRUN_CPUS_PER_TASK", cpus_per_task)
   	 cpus_per_task=p.getenv("SRUN_CPUS_PER_TASK")
   	 slurm.log_info("SRUN_CPUS_PER_TASK=%u", cpus_per_task)

   	 --options['cpus-per-task'] = cpus_per_task
   	 --slurm.log_info("SRUN_CPUS_PER_TASK=%u", options['cpus-per-task'])
    end
    	return slurm.SUCCESS
end

Whichever method you use, --cpus-per-task will automatically be set on any srun within a job:

$ salloc --cpus-per-task=6 -n1
salloc: lua: Function: pre_submit
salloc: Granted job allocation 2298
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
srun: lua: Function: pre_submit
srun: lua: SRUN_CPUS_PER_TASK=6
$ srun whereami
srun: lua: Function: pre_submit
srun: lua: SRUN_CPUS_PER_TASK=6
0000 n1-1 - Cpus_allowed:   	00000707    	Cpus_allowed_list:  	0-2,8-10



Since --cpus-per-gpu is broken right now, you can also try setting --cpus-per-task automatically from SLURM_CPUS_PER_GPUS and other options. If you have any questions



> Notice that we set DefMemPerCPU=5120 by default, which is overwritten
> then in some partitions. That's why I get higher allocated memory. I
> wonder whether this setting can be problematic now with this release
> (does DefMemPerCPU imply --exact now?).

No, DefMemPerCPU doesn't imply --exact. The allocated memory doesn't matter in this case. I have DefMemPerCPU=100, which is why I got 1200 MB of allocated memory. I was really just looking for allocated CPUs.


> Yes, this is clear, thanks a lot. However, having some variables being
> inherited and some other not, it's also confusing users. Probably there
> is not an ideal solution for it. In our case, we have many OpenMP and
> Hybrid (MPI+OpenMPI) based jobs, so many users had to change the job
> scripts to make it run efficiently (otherwise srun was using 1 core per
> task instead), and I am trying to find users which are now running in a
> non-efficient way.

I suggest setting SRUN_CPUS_PER_TASK in the environment or using a cil_filter plugin to automatically set --cpus-per-task.


> However, I know some users running jobs with mpirun, but we usually
> instruct users to use srun instead. Wasn't that the general policy in
> Slurm? (using srun over mpirun). I understand that supporting mpirun may
> trigger some code development and support problems

Yes, we do suggest using srun rather than mpirun (obviously we are biased toward srun). And you are right that if we didn't care about mpirun, we could just make --cpus-per-task both imply --exact and get inherited by steps from the job. However, the reality is that *many* people use mpirun and will continue to use mpirun no matter how much we suggest using srun, so we cannot ignore mpirun.



Let me know if you have any questions about that. I'll work on fixing --cpus-per-gpu.

Comment 9 Marshall Garey 2023-01-13 11:59:29 MST

Hi Marc,

First, we've discussed this internally and we will make --cpus-per-gpu behave like --cpus-per-task in that --cpus-per-gpu will imply --exact and steps will not inherit this option from the job. We will also properly ensure that these options are mutually exclusive, and that a command line option from one option will override the environment option from the other option. For example, --cpus-per-gpu will override SRUN_CPUS_PER_TASK, and --cpus-per-task will override SRUN_CPUS_PER_GPU.

I've gotten quite far in this. However, our merge window for 23.02 is rapidly closing. Unfortunately, it won't be ready in time for 23.02, so we are targeting this for the 23.11 release. In the meantime, I recommend using the workarounds that we have discussed.

- Marshall

Comment 10 Marc Caubet Serrabou 2023-01-16 07:22:49 MST

Hi,

Thanks a lot for the update. From that I understand that this fix can not be back ported to v22.05 due to its complexity, am I correct? We usually wait a few months until we upgrade our production system to a new major release. 

In the meantime, we are offering users a recipe of how to deal with these problems.

Best regards,
Marc

Comment 11 Marshall Garey 2023-01-16 09:14:06 MST

The complexity is certainly part of it, but the primary reason it won't be part of a maintenance version of Slurm is that it is not only a bug fix but also introduces a change in behavior to a user command. We try to include only bug fixes in maintenance releases. We also highlight behavior changes in the RELEASE_NOTES file for the next Slurm version.

In addition, previous changes to step allocation behavior (including the change to --cpus-per-task implying --exact) have caused some problems. So, we are being more cautious with behavioral changes to step allocations.

Comment 43 Marshall Garey 2023-05-23 15:22:38 MDT

Hi,

We fixed --cpus-per-gpu for steps in the following range of commits:

159a0cab09..200a47802c

They will be part of 23.11.0rc1.
You can checkout the the master branch on github to test the new behavior.

There is one important thing that is different from what I told you:
--cpus-per-gpu *will* be inherited from the job to steps. In a separate internal ticket, we are making some changes to allow this to happen while also not breaking mpirun. Because of this, we can also allow --cpus-per-task to be inherited from the job to the steps again in 23.11.

Closing as resolved/fixed.

Let me know if you have any questions about this.

Comment 44 Marc Caubet Serrabou 2023-07-27 20:16:32 MDT

I will be out of the office until August 2.

For all urgent matters that need immediate assistance, then please contact:

  *   merlin-admins@lists.psi.ch for any matters related to the Merlin cluster
  *   meg-admins@lists.psi.ch for any matters related to the MEG cluster
  *   psi-hpc-at-cscs-admin@lists.psi.ch for any matters related to PSI Projects at CSCS

Thanks a lot,
Marc

Comment 45 Spencer Bliven 2023-07-31 07:14:29 MDT

Thanks Marshall. Good to hear that both the --cpus-per options will inherit. I look forward to testing this.

Comment 46 Marshall Garey 2023-07-31 08:32:05 MDT

You're welcome. Just so you know if you do some testing on the master branch, we did find that my work on this caused a couple of regressions, but we do have fixes in the process. (The master branch is always considered unstable.)