14298 – CPU binding outside of job step allocation

Bug 14298 - CPU binding outside of job step allocation

Summary: CPU binding outside of job step allocation

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other bugs)
Version:	22.05.0
Hardware:	Linux Linux

Importance:	--- 6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-06-10 11:44 MDT by Chris Rutledge
Modified:	2023-01-24 14:58 MST (History)
CC List:	1 user (show)

See Also:	15856
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (6.16 KB, text/plain) 2022-06-10 11:44 MDT, Chris Rutledge	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Rutledge 2022-06-10 11:44:53 MDT

Created attachment 25459 [details]
slurm.conf

Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.


##############################
# start interactive session
##############################
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/

##############################
# job details
##############################
[crutledge@largemem-5-1 gpu-6]$ cat job
#!/bin/bash -l
#
#SBATCH --job-name=HPCC
#SBATCH -n 48
#SBATCH -p gpu
#SBATCH --mem-per-cpu=3975

module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel

srun ./hpcc

mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}

##############################
# submit the job
##############################
[crutledge@largemem-5-1 gpu-6]$ sbatch job
Submitted batch job 8533

##############################
# resulting error
##############################
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out
Loading icc version 2022.0.2
Loading compiler-rt version 2022.0.2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001.
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 ***
srun: error: gpu-5-1: tasks 0-46: Killed
mv: cannot stat ‘hpccoutf.txt’: No such file or directory
[crutledge@largemem-5-1 gpu-6]$

Comment 1 Will F 2022-11-07 06:17:30 MST

Chris: also seen this recently under 22.05.  I _think_ the issue is SLURM_CPU_BIND being inherited when sbatch is invoked and there therefore _sometimes_ being a mismatch between the value of SLURM_CPU_BIND in the batch job and the taskset of the batch job: if you 'unset SLURM_CPU_BIND' before running sbatch then the issue doesn't seem to occur.

It seems like this is a change in behaviour in 22.05, but I'm not sure what's caused it. Possibly a side effect of one of the following changes:

 -- Fail srun when using invalid --cpu-bind options (e.g. --cpu-bind=map_cpu:99
    when only 10 cpus are allocated).
 -- srun --overlap now allows the step to share all resources (CPUs, memory, and
    GRES), where previously --overlap only allowed the step to share CPUs with
    other steps.