Bug 14298 - CPU binding outside of job step allocation
Summary: CPU binding outside of job step allocation
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 22.05.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-06-10 11:44 MDT by Chris Rutledge
Modified: 2023-01-24 14:58 MST (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (6.16 KB, text/plain)
2022-06-10 11:44 MDT, Chris Rutledge
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Rutledge 2022-06-10 11:44:53 MDT
Created attachment 25459 [details]
slurm.conf

Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.


##############################
# start interactive session
##############################
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/

##############################
# job details
##############################
[crutledge@largemem-5-1 gpu-6]$ cat job
#!/bin/bash -l
#
#SBATCH --job-name=HPCC
#SBATCH -n 48
#SBATCH -p gpu
#SBATCH --mem-per-cpu=3975

module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel

srun ./hpcc

mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}

##############################
# submit the job
##############################
[crutledge@largemem-5-1 gpu-6]$ sbatch job
Submitted batch job 8533

##############################
# resulting error
##############################
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out
Loading icc version 2022.0.2
Loading compiler-rt version 2022.0.2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001.
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 ***
srun: error: gpu-5-1: tasks 0-46: Killed
mv: cannot stat ‘hpccoutf.txt’: No such file or directory
[crutledge@largemem-5-1 gpu-6]$
Comment 1 Will F 2022-11-07 06:17:30 MST
Chris: also seen this recently under 22.05.  I _think_ the issue is SLURM_CPU_BIND being inherited when sbatch is invoked and there therefore _sometimes_ being a mismatch between the value of SLURM_CPU_BIND in the batch job and the taskset of the batch job: if you 'unset SLURM_CPU_BIND' before running sbatch then the issue doesn't seem to occur.

It seems like this is a change in behaviour in 22.05, but I'm not sure what's caused it. Possibly a side effect of one of the following changes:

 -- Fail srun when using invalid --cpu-bind options (e.g. --cpu-bind=map_cpu:99
    when only 10 cpus are allocated).
 -- srun --overlap now allows the step to share all resources (CPUs, memory, and
    GRES), where previously --overlap only allowed the step to share CPUs with
    other steps.