Hi, after upgrading to v22.05.6 we experience problems with the option --cpus-per-gpu and srun, which seeems to have a clear bug. Within the allocation or batch script, --cpus-per-gpu works: ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc --clusters=gmerlin6 --partition=gpu-maint -n 1 --gpus=2 --cpus-per-gpu=6 salloc: Pending job allocation 70905 salloc: job 70905 queued and waiting for resources salloc: job 70905 has been allocated resources salloc: Granted job allocation 70905 salloc: Waiting for resource configuration salloc: Nodes merlin-g-100 are ready for job (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# python -c "import os; print(os.sched_getaffinity(0))" {48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181} However, as soon as srun is used, it does not work: (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun python -c "import os; print(os.sched_getaffinity(0))" {48, 176} This is also true when implicitly defining the setting: (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --gpus=2 --cpus-per-gpu=6 python -c "import os; print(os.sched_getaffinity(0))" {48, 176} Notice that when setting --cpus-per-task it works: (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU * $SLURM_NTASKS)) python -c "import os; print(os.sched_getaffinity(0))" {48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181} However, this is puzzling, since in the man pages it's clearly stated that --cpus-per-gpu and --cpus-per-task are not compatible. Is this bug already known by SchedMD? Can a fix be provided for Slurm v22.05? On the other hand, also related to srun, --cpus-per-task in srun does not work in the same way it was working in the past. This change is stated in the Highlights section, in the Release Notes document for Slurm v22.05. However, this change of behavior is really confusing for users, mainly when considering that other variables are properly propagated. It would be great to restore the previous behavior or to find a way to keep coherency for all the different environment variables. Thanks a lot, Marc
Created attachment 28200 [details] Slurm.conf
I find it quite confusing that srun for job steps no longer inherit all resources from the parent job. We commonly submit scripts to sbatch with a single srun step for mpi or openmp software. Is a different paradigm recommended now? I'm confused why this default was changed, and why some options are inherhited (eg --gpus) but other options are not (--cpus-per-task, --cpus-per-gpu). It would be nice to get best practices for ensuring all resources are available to a job step, given that some options may be overridden at submission time on the command line. I use the following sbatch script for testing: $ cat gpu.sh #!/bin/bash #SBATCH -J slurmgpu #SBATCH -n 1 #SBATCH --gpus=2 #SBATCH --cluster=gmerlin6 #SBATCH --cpus-per-gpu=6 #SBATCH -p gpu-maint #SBATCH --time=00:01:00 #SBATCH --output=gpu.log run () { echo "### $* ###" $* python -c 'import os; print(os.sched_getaffinity(0))' #$* env } hostname run run srun run srun --cpus-per-gpu=$SLURM_CPUS_PER_GPU run srun --exact --cpus-per-gpu=$SLURM_CPUS_PER_GPU run srun --cpus-per-task=$SLURM_CPUS_PER_GPU run srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU)) run srun --cpus-per-gpu=$SLURM_CPUS_PER_GPU --cpus-per-task=$SLURM_CPUS_PER_GPU run srun --gpus=$SLURM_GPUS --cpus-per-gpu=$SLURM_CPUS_PER_GPU run srun --gpus=$SLURM_GPUS --cpus-per-task=$SLURM_CPUS_PER_GPU run srun --gpus=$SLURM_GPUS --cpus-per-gpu=$SLURM_CPUS_PER_GPU --cpus-per-task=$SLURM_CPUS_PER_GPU $ sbatch gpu.sh $ cat gpu.log merlin-g-008.psi.ch ### ### {0, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15} ### srun ### {0} ### srun --cpus-per-gpu=6 ### {0} ### srun --exact --cpus-per-gpu=6 ### {0} ### srun --cpus-per-task=6 ### {0, 1, 2, 3, 4, 5} ### srun --cpus-per-task=12 ### {0, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15} ### srun --cpus-per-gpu=6 --cpus-per-task=6 ### {0, 1, 2, 3, 4, 5} ### srun --gpus=2 --cpus-per-gpu=6 ### {0} ### srun --gpus=2 --cpus-per-task=6 ### {0, 1, 2, 3, 4, 5} ### srun --gpus=2 --cpus-per-gpu=6 --cpus-per-task=6 ### {0, 1, 2, 3, 4, 5} ``` As Marc mentions, --cpus-per-gpu is ignored and only by explicitly computing the expected number of cpus am I able to use the full reservation within the task.
> Is this bug already known by SchedMD? Can a fix be provided for Slurm v22.05? No, thanks for reporting the bug. I'll work on fixing --cpus-per-gpu hopefully for 22.05. Do you set --exact in a cli_filter plugin? The behavior I see does not exactly match the behavior that you see, but I do see the problem with --cpus-per-gpu. My node hardware configuration: 1 socket, 8 cores, 2 threads per core. $ salloc --gpus=2 --cpus-per-gpu=6 salloc: Granted job allocation 4284 $ python3 -c "import os; print(os.sched_getaffinity(0))" {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13} $ srun python3 -c "import os; print(os.sched_getaffinity(0))" {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13} Because I did not request --exact, srun gets the whole allocation. This is different from what you see. Do you set SLURM_EXACT, SLURM_EXCLUSIVE, or SRUN_CPUS_PER_TASK anywhere in the environment, maybe with a bashrc or with a cli_filter plugin? All of those imply --exact. If so, that will explain why your srun only gets one core rather than the whole allocation. Now, I show how --cpus-per-gpu doesn't affect the step allocation's cpus: $ srun --gpus=1 --cpus-per-gpu=4 python3 -c "import os; print(os.sched_getaffinity(0))" {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13} $ srun --exact --gpus=1 --cpus-per-gpu=4 python3 -c "import os; print(os.sched_getaffinity(0))" {0, 8} $ sacct -o jobid,alloctres -p -j $SLURM_JOB_ID JobID|AllocTRES| 4284|billing=12,cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| 4284.interactive|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| 4284.0|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| 4284.1|cpu=12,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1| 4284.2|cpu=1,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1| Notes: * --cpus-per-gpu does *not* imply --exact * --cpus-per-task implies --exact * --exact does not look at the number of cpus requested with --cpus-per-gpu; therefore, --cpus-per-gpu is kind of useless for step allocations right now. This is a problem. > Notice that when setting --cpus-per-task it works: > > (base) ❄ [caubet_m@merlin-g-100:/data/user/caubet_m]# srun --cpus-per-task=$(($SLURM_GPUS * $SLURM_CPUS_PER_GPU * $SLURM_NTASKS)) python -c "import os; print(os.sched_getaffinity(0))" > {48, 49, 50, 51, 52, 53, 176, 177, 178, 179, 180, 181} > > However, this is puzzling, since in the man pages it's clearly stated that > --cpus-per-gpu and --cpus-per-task are not compatible. In Slurm, the command line overrides environment variables. In this case, you have a SLURM_CPUS_PER_GPU environment variable that is read by srun. have srun --cpus-per-task; since this is on the command line, it overrides the SLURM_CPUS_PER_GPU environment variable. However, Spencer showed using srun --cpus-per-task and --cpus-per-gpu at the same time. We only enforce this on the job, but not on the step. This is another bug. Enforced on the job: $ salloc --gpus=2 --cpus-per-gpu=6 --cpus-per-task=1 salloc: error: --cpus-per-gpu is mutually exclusive with --cpus-per-task salloc: error: Invalid generic resource (gres) specification Not enforced on the step: $ srun --cpus-per-task=4 --cpus-per-gpu=2 --gpus=1 -n1 hostname voyager > On the other hand, also related to srun, --cpus-per-task in srun does not > work in the same way it was working in the past. This change is stated in > the Highlights section, in the Release Notes document for Slurm v22.05. > However, this change of behavior is really confusing for users, mainly when > considering that other variables are properly propagated. It would be great > to restore the previous behavior or to find a way to keep coherency for all > the different environment variables. * In the past, --cpus-per-task did not imply --exact. But users were confused why steps still got all the cpus in the job allocation even though they requested --cpus-per-task such that the step should get fewer cpus. See bug 11275. This is why we made --cpus-per-task imply --exact. * Bug 13351: --exact breaks mpirun, so we cannot have --cpus-per-task be inherited by srun if --cpus-per-task implies --exact. * Slurm 22.05: --cpus-per-task implies --exact, but srun no longer reads SLURM_CPUS_PER_TASK so it does not automatically inherit the job's --cpus-per-task value. This is a compromise between the problem in bug 11275 (--cpus-per-task didn't change step allocations without --exact) and bug 13351 (--exact breaks mpirun). See https://bugs.schedmd.com/show_bug.cgi?id=13351#c76 Do you have any questions about that? Arguably, --cpus-per-gpu should behave the same way as --cpus-per-task. If we make --cpus-per-gpu imply --exact, then it would have the same problem as --cpus-per-task in that it would break mpirun. So then we'd have to make --cpus-per-gpu also not get inherited by srun. I'll discuss this internally, but this would not be a change we could make in 22.05. My TODO's: * Fix --cpus-per-gpu and --exact to work together properly. * Make --cpus-per-gpu and --cpus-per-task mutually exclusive for steps as well as jobs. * Discuss internally whether we should make --cpus-per-gpu behave the same way as --cpus-per-task in 23.02.
Hi, > Do you set --exact in a cli_filter plugin? The behavior I see does not exactly > match the behavior that you see, but I do see the problem with --cpus-per-gpu. We do not set --exact, we have a pretty simple configuration without extra plugins (nor in prolog). Therefore, --exact is not enforced: (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc --clusters=gmerlin6 --partition=gpu-maint -n 1 --gpus=2 --cpus-per-gpu=6 salloc: Granted job allocation 71749 salloc: Waiting for resource configuration salloc: Nodes merlin-g-014 are ready for job (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SLURM_EXACT (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SLURM_EXCLUSIVE (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SRUN_CPUS_PER_TASK (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# python -c "import os; print(os.sched_getaffinity(0))" {32, 33, 34, 35, 24, 25, 26, 27, 28, 29, 30, 31} (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun python -c "import os; print(os.sched_getaffinity(0))" {24} (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --gpus=1 --cpus-per-gpu=4 python -c "import os; print(os.sched_getaffinity(0))" {24} (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --exact --gpus=1 --cpus-per-gpu=4 python -c "import os; print(os.sched_getaffinity(0))" {24} > My node hardware configuration: > 1 socket, 8 cores, 2 threads per core. > > $ salloc --gpus=2 --cpus-per-gpu=6 > salloc: Granted job allocation 4284 > $ python3 -c "import os; print(os.sched_getaffinity(0))" > {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13} > $ srun python3 -c "import os; print(os.sched_getaffinity(0))" > {0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13} > > Because I did not request --exact, srun gets the whole allocation. This is > different from what you see. Do you set SLURM_EXACT, SLURM_EXCLUSIVE, or > SRUN_CPUS_PER_TASK anywhere in the environment, maybe with a bashrc or with a > cli_filter plugin? All of those imply --exact. If so, that will explain why > your srun only gets one core rather than the whole allocation. That's clearly different to what we see. However, as mentioned above, no --exact. > $ sacct -o jobid,alloctres -p -j $SLURM_JOB_ID > JobID|AllocTRES| > 4284|billing=12,cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| > 4284.interactive|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| > 4284.0|cpu=12,gres/gpu:tty1=2,gres/gpu=2,mem=1200M,node=1| > 4284.1|cpu=12,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1| > 4284.2|cpu=1,gres/gpu:tty1=1,gres/gpu=1,mem=1200M,node=1| Here is my output: (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# sacct -o jobid,alloctres -p -j $SLURM_JOB_ID JobID|AllocTRES| 71749|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1| 71749.interactive|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1| 71749.extern|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1| 71749.0|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1| 71749.1|cpu=12,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1| 71749.2|cpu=1,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1| Notice that we set DefMemPerCPU=5120 by default, which is overwritten then in some partitions. That's why I get higher allocated memory. I wonder whether this setting can be problematic now with this release (does DefMemPerCPU imply --exact now?). > * In the past, --cpus-per-task did not imply --exact. But users were confused > why steps still got all the cpus in the job allocation even though they > requested --cpus-per-task such that the step should get fewer cpus. See bug > 11275. This is why we made --cpus-per-task imply --exact. > *Bug 13351 <show_bug.cgi?id=13351>: --exact breaks mpirun, so we cannot have --cpus-per-task be > inherited by srun if --cpus-per-task implies --exact. > * Slurm 22.05: --cpus-per-task implies --exact, but srun no longer reads > SLURM_CPUS_PER_TASK so it does not automatically inherit the job's > --cpus-per-task value. This is a compromise between the problem inbug 11275 <show_bug.cgi?id=11275> > (--cpus-per-task didn't change step allocations without --exact) andbug 13351 <show_bug.cgi?id=13351> > (--exact breaks mpirun). Seehttps://bugs.schedmd.com/show_bug.cgi?id=13351#c76 <show_bug.cgi?id=13351#c76> > > Do you have any questions about that? Yes, this is clear, thanks a lot. However, having some variables being inherited and some other not, it's also confusing users. Probably there is not an ideal solution for it. In our case, we have many OpenMP and Hybrid (MPI+OpenMPI) based jobs, so many users had to change the job scripts to make it run efficiently (otherwise srun was using 1 core per task instead), and I am trying to find users which are now running in a non-efficient way. > Arguably, --cpus-per-gpu should behave the same way as --cpus-per-task. If we > make --cpus-per-gpu imply --exact, then it would have the same problem as > --cpus-per-task in that it would break mpirun. So then we'd have to make > --cpus-per-gpu also not get inherited by srun. I'll discuss this internally, > but this would not be a change we could make in 22.05. Perfect, at least this would be coherent with the current behavior of --cpus-per-task. However, I know some users running jobs with mpirun, but we usually instruct users to use srun instead. Wasn't that the general policy in Slurm? (using srun over mpirun). I understand that supporting mpirun may trigger some code development and support problems > My TODO's: > * Fix --cpus-per-gpu and --exact to work together properly. > * Make --cpus-per-gpu and --cpus-per-task mutually exclusive for steps as well > as jobs. > * Discuss internally whether we should make --cpus-per-gpu behave the same way > as --cpus-per-task in 23.02. Perfect. Thanks a lot, Marc > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > * You are watching someone on the CC list of the bug. >
I am out of the office through 9 January, 2023. For merlin support, please mail merlin-admins@lists.psi.ch. -Spencer Bliven
> We do not set --exact, we have a pretty simple configuration without > extra plugins (nor in prolog). Therefore, --exact is not enforced: > > (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# salloc > --clusters=gmerlin6 --partition=gpu-maint -n 1 --gpus=2 --cpus-per-gpu=6 > salloc: Granted job allocation 71749 > salloc: Waiting for resource configuration > salloc: Nodes merlin-g-014 are ready for job > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep SLURM_EXACT > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep > SLURM_EXCLUSIVE > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# env | grep > SRUN_CPUS_PER_TASK > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# python -c "import > os; print(os.sched_getaffinity(0))" > {32, 33, 34, 35, 24, 25, 26, 27, 28, 29, 30, 31} > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun python -c > "import os; print(os.sched_getaffinity(0))" > {24} > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --gpus=1 > --cpus-per-gpu=4 python -c "import os; print(os.sched_getaffinity(0))" > {24} > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# srun --exact > --gpus=1 --cpus-per-gpu=4 python -c "import os; > print(os.sched_getaffinity(0))" > {24} > Here is my output: > > (base) ❄ [caubet_m@merlin-g-014:/data/user/caubet_m]# sacct -o > jobid,alloctres -p -j $SLURM_JOB_ID > JobID|AllocTRES| > 71749|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G, > node=1| > 71749.interactive|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G, > node=1| > 71749.extern|billing=12,cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2, > mem=60G,node=1| > 71749.0|cpu=12,gres/gpu:geforce_rtx_2080_ti=2,gres/gpu=2,mem=60G,node=1| > 71749.1|cpu=12,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1| > 71749.2|cpu=1,gres/gpu:geforce_rtx_2080_ti=1,gres/gpu=1,mem=60G,node=1| I know what happened. It is the cpu binding. Although the step was allocated 12 CPUs, in slurmd it was bound to a single core because if TaskPluginParam=autobind=core. Without that, slurmd defaults to autobinding to sockets. marshall@curiosity:~/slurm/22.05/install/c1$ salloc --cpus-per-gpu=6 --gpus=2 salloc: Granted job allocation 2288 salloc: Waiting for resource configuration salloc: Nodes n1-2 are ready for job marshall@curiosity:~/slurm/22.05/install/c1$ srun whereami 0000 n1-2 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 marshall@curiosity:~/slurm/22.05/install/c1$ srun --cpu-bind=socket whereami 0000 n1-2 - Cpus_allowed: 00003f3f Cpus_allowed_list: 0-5,8-13 This also happens in 21.08 with --cpus-per-gpu. Here is an example in 21.08: $ salloc -V slurm 21.08.2 $ srun -V slurm 21.08.2 $ slurmctld -V slurm 21.08.2 $ salloc --cpus-per-gpu=6 --gpus=2 salloc: Granted job allocation 5 salloc: Waiting for resource configuration salloc: Nodes n1-2 are ready for job $ srun whereami 0000 n1-2 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 marshall@curiosity:~/slurm/22.05/install/c1$ srun --cpu-bind=socket whereami 0000 n1-2 - Cpus_allowed: 00003f3f Cpus_allowed_list: 0-5,8-13 This is because --cpus-per-gpu isn't affecting cpus per task, which is what tells slurmd to allocate more cpus per task. --cpus-per-task gets inherited by the step in 21.08, so that tells slurmd to allocate more cpus per task even though autobind=cores. $ salloc --cpus-per-task=6 -n1 salloc: Granted job allocation 6 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job $ srun whereami 0000 n1-1 - Cpus_allowed: 00000707 Cpus_allowed_list: 0-2,8-10 But in 22.05, --cpus-per-task is not inherited by the step, so the slurmd doesn't know to allocate more cpus per task and only allocates 1 CPU for the task; then slurmd autobinds the task to a core. Example in 22.05: $ salloc --cpus-per-task=6 -n1 salloc: Granted job allocation 2289 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job $ srun whereami 0000 n1-1 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 The workaround we have been suggesting is for admins to set the SRUN_CPUS_PER_TASK environment variable in the job's environment (like in a shell rc). export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK $ export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK $ srun whereami 0000 n1-1 - Cpus_allowed: 00000707 Cpus_allowed_list: 0-2,8-10 Or, you can use a cli_filter plugin to set SRUN_CPUS_PER_TASK or --cpus-per-task. I'm using lua-posix to access the getenv/setenv functions. cli_filter.lua: function slurm_cli_pre_submit(options, pack_offset) slurm.log_info("Function: %s", "pre_submit") p=require("posix") cpus_per_task = p.getenv("SLURM_CPUS_PER_TASK") if (cpus_per_task ~= nil) then --Set the environment variable: --p.setenv("SRUN_CPUS_PER_TASK", cpus_per_task) --cpus_per_task=p.getenv("SRUN_CPUS_PER_TASK") --slurm.log_info("SRUN_CPUS_PER_TASK=%u", cpus_per_task) --Or set the command-line option: options['cpus-per-task'] = cpus_per_task slurm.log_info("SRUN_CPUS_PER_TASK=%u", options['cpus-per-task']) end return slurm.SUCCESS end Or you can set the environment variable just in the job after it has been submitted: function slurm_cli_post_submit(offset, job_id, step_id) slurm.log_info("Function: %s", "post_submit") p=require("posix") cpus_per_task = p.getenv("SLURM_CPUS_PER_TASK") if (cpus_per_task ~= nil) then p.setenv("SRUN_CPUS_PER_TASK", cpus_per_task) cpus_per_task=p.getenv("SRUN_CPUS_PER_TASK") slurm.log_info("SRUN_CPUS_PER_TASK=%u", cpus_per_task) --options['cpus-per-task'] = cpus_per_task --slurm.log_info("SRUN_CPUS_PER_TASK=%u", options['cpus-per-task']) end return slurm.SUCCESS end Whichever method you use, --cpus-per-task will automatically be set on any srun within a job: $ salloc --cpus-per-task=6 -n1 salloc: lua: Function: pre_submit salloc: Granted job allocation 2298 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job srun: lua: Function: pre_submit srun: lua: SRUN_CPUS_PER_TASK=6 $ srun whereami srun: lua: Function: pre_submit srun: lua: SRUN_CPUS_PER_TASK=6 0000 n1-1 - Cpus_allowed: 00000707 Cpus_allowed_list: 0-2,8-10 Since --cpus-per-gpu is broken right now, you can also try setting --cpus-per-task automatically from SLURM_CPUS_PER_GPUS and other options. If you have any questions > Notice that we set DefMemPerCPU=5120 by default, which is overwritten > then in some partitions. That's why I get higher allocated memory. I > wonder whether this setting can be problematic now with this release > (does DefMemPerCPU imply --exact now?). No, DefMemPerCPU doesn't imply --exact. The allocated memory doesn't matter in this case. I have DefMemPerCPU=100, which is why I got 1200 MB of allocated memory. I was really just looking for allocated CPUs. > Yes, this is clear, thanks a lot. However, having some variables being > inherited and some other not, it's also confusing users. Probably there > is not an ideal solution for it. In our case, we have many OpenMP and > Hybrid (MPI+OpenMPI) based jobs, so many users had to change the job > scripts to make it run efficiently (otherwise srun was using 1 core per > task instead), and I am trying to find users which are now running in a > non-efficient way. I suggest setting SRUN_CPUS_PER_TASK in the environment or using a cil_filter plugin to automatically set --cpus-per-task. > However, I know some users running jobs with mpirun, but we usually > instruct users to use srun instead. Wasn't that the general policy in > Slurm? (using srun over mpirun). I understand that supporting mpirun may > trigger some code development and support problems Yes, we do suggest using srun rather than mpirun (obviously we are biased toward srun). And you are right that if we didn't care about mpirun, we could just make --cpus-per-task both imply --exact and get inherited by steps from the job. However, the reality is that *many* people use mpirun and will continue to use mpirun no matter how much we suggest using srun, so we cannot ignore mpirun. Let me know if you have any questions about that. I'll work on fixing --cpus-per-gpu.
Hi Marc, First, we've discussed this internally and we will make --cpus-per-gpu behave like --cpus-per-task in that --cpus-per-gpu will imply --exact and steps will not inherit this option from the job. We will also properly ensure that these options are mutually exclusive, and that a command line option from one option will override the environment option from the other option. For example, --cpus-per-gpu will override SRUN_CPUS_PER_TASK, and --cpus-per-task will override SRUN_CPUS_PER_GPU. I've gotten quite far in this. However, our merge window for 23.02 is rapidly closing. Unfortunately, it won't be ready in time for 23.02, so we are targeting this for the 23.11 release. In the meantime, I recommend using the workarounds that we have discussed. - Marshall
Hi, Thanks a lot for the update. From that I understand that this fix can not be back ported to v22.05 due to its complexity, am I correct? We usually wait a few months until we upgrade our production system to a new major release. In the meantime, we are offering users a recipe of how to deal with these problems. Best regards, Marc
The complexity is certainly part of it, but the primary reason it won't be part of a maintenance version of Slurm is that it is not only a bug fix but also introduces a change in behavior to a user command. We try to include only bug fixes in maintenance releases. We also highlight behavior changes in the RELEASE_NOTES file for the next Slurm version. In addition, previous changes to step allocation behavior (including the change to --cpus-per-task implying --exact) have caused some problems. So, we are being more cautious with behavioral changes to step allocations.
Hi, We fixed --cpus-per-gpu for steps in the following range of commits: 159a0cab09..200a47802c They will be part of 23.11.0rc1. You can checkout the the master branch on github to test the new behavior. There is one important thing that is different from what I told you: --cpus-per-gpu *will* be inherited from the job to steps. In a separate internal ticket, we are making some changes to allow this to happen while also not breaking mpirun. Because of this, we can also allow --cpus-per-task to be inherited from the job to the steps again in 23.11. Closing as resolved/fixed. Let me know if you have any questions about this.
I will be out of the office until August 2. For all urgent matters that need immediate assistance, then please contact: * merlin-admins@lists.psi.ch for any matters related to the Merlin cluster * meg-admins@lists.psi.ch for any matters related to the MEG cluster * psi-hpc-at-cscs-admin@lists.psi.ch for any matters related to PSI Projects at CSCS Thanks a lot, Marc
Thanks Marshall. Good to hear that both the --cpus-per options will inherit. I look forward to testing this.
You're welcome. Just so you know if you do some testing on the master branch, we did find that my work on this caused a couple of regressions, but we do have fixes in the process. (The master branch is always considered unstable.)