Ticket 14298

Summary: CPU binding outside of job step allocation
Product: Slurm Reporter: Chris Rutledge <rutledge.chris>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: will
Version: 22.05.0   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=15856
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Chris Rutledge 2022-06-10 11:44:53 MDT
Created attachment 25459 [details]
slurm.conf

Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.


##############################
# start interactive session
##############################
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/

##############################
# job details
##############################
[crutledge@largemem-5-1 gpu-6]$ cat job
#!/bin/bash -l
#
#SBATCH --job-name=HPCC
#SBATCH -n 48
#SBATCH -p gpu
#SBATCH --mem-per-cpu=3975

module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel

srun ./hpcc

mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}

##############################
# submit the job
##############################
[crutledge@largemem-5-1 gpu-6]$ sbatch job
Submitted batch job 8533

##############################
# resulting error
##############################
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out
Loading icc version 2022.0.2
Loading compiler-rt version 2022.0.2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001.
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 ***
srun: error: gpu-5-1: tasks 0-46: Killed
mv: cannot stat ‘hpccoutf.txt’: No such file or directory
[crutledge@largemem-5-1 gpu-6]$
Comment 1 Will F 2022-11-07 06:17:30 MST
Chris: also seen this recently under 22.05.  I _think_ the issue is SLURM_CPU_BIND being inherited when sbatch is invoked and there therefore _sometimes_ being a mismatch between the value of SLURM_CPU_BIND in the batch job and the taskset of the batch job: if you 'unset SLURM_CPU_BIND' before running sbatch then the issue doesn't seem to occur.

It seems like this is a change in behaviour in 22.05, but I'm not sure what's caused it. Possibly a side effect of one of the following changes:

 -- Fail srun when using invalid --cpu-bind options (e.g. --cpu-bind=map_cpu:99
    when only 10 cpus are allocated).
 -- srun --overlap now allows the step to share all resources (CPUs, memory, and
    GRES), where previously --overlap only allowed the step to share CPUs with
    other steps.