Created attachment 25459 [details] slurm.conf Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node. ############################## # start interactive session ############################## [crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l [crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/ ############################## # job details ############################## [crutledge@largemem-5-1 gpu-6]$ cat job #!/bin/bash -l # #SBATCH --job-name=HPCC #SBATCH -n 48 #SBATCH -p gpu #SBATCH --mem-per-cpu=3975 module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel srun ./hpcc mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID} ############################## # submit the job ############################## [crutledge@largemem-5-1 gpu-6]$ sbatch job Submitted batch job 8533 ############################## # resulting error ############################## [crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out Loading icc version 2022.0.2 Loading compiler-rt version 2022.0.2 srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001. srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 *** srun: error: gpu-5-1: tasks 0-46: Killed mv: cannot stat ‘hpccoutf.txt’: No such file or directory [crutledge@largemem-5-1 gpu-6]$
Chris: also seen this recently under 22.05. I _think_ the issue is SLURM_CPU_BIND being inherited when sbatch is invoked and there therefore _sometimes_ being a mismatch between the value of SLURM_CPU_BIND in the batch job and the taskset of the batch job: if you 'unset SLURM_CPU_BIND' before running sbatch then the issue doesn't seem to occur. It seems like this is a change in behaviour in 22.05, but I'm not sure what's caused it. Possibly a side effect of one of the following changes: -- Fail srun when using invalid --cpu-bind options (e.g. --cpu-bind=map_cpu:99 when only 10 cpus are allocated). -- srun --overlap now allows the step to share all resources (CPUs, memory, and GRES), where previously --overlap only allowed the step to share CPUs with other steps.