Bug 6763 - gres-flags=disable-bindings disables counting of gres resources
Summary: gres-flags=disable-bindings disables counting of gres resources
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 18.08.5
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-27 04:07 MDT by Peter Steinbach
Modified: 2019-04-05 07:27 MDT (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Steinbach 2019-03-27 04:07:36 MDT
Dear all,

Using these config files,

https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/gres.conf

https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/slurm.conf

I observed a weird behavior of the '--gres-flags=disable-binding' option. With the above .conf files, I created a local slurm cluster with 3 computes (2 GPUs and 4 cores each).

# sinfo -N -l
Mon Mar 25 09:20:59 2019
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
g1             1      gpu*        idle    4    1:4:1   4000        0  1   (null) none
g2             1      gpu*        idle    4    1:4:1   4000        0  1   (null) none
g3             1      gpu*        idle    4    1:4:1   4000        0  1   (null) none

I first submitted 3 jobs that consume all available GPUs:

# sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out --mem=500
Submitted batch job 2
# sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out --mem=500
Submitted batch job 3
# sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out --mem=500
Submitted batch job 4
# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 5       gpu     wrap     root  R       0:04      1 g1
                 6       gpu     wrap     root  R       0:01      1 g2
                 7       gpu     wrap     root  R       0:01      1 g3

Funny enough, if I send a job with only one gpu and add --gres-flags=disable-binding it actually starts running.

# sbatch --gres=gpu:1 --wrap="env && sleep 30" -o use_1gpu_%A.out --mem=500 --gres-flags=disable-binding
Submitted batch job 9
[root@ernie /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 5       gpu     wrap     root  R       1:44      1 g1
                 6       gpu     wrap     root  R       1:41      1 g2
                 7       gpu     wrap     root  R       1:41      1 g3
                 9       gpu     wrap     root  R       0:02      1 g1

I am not sure what to think of this. I consider this behavior not ideal as our users reported that their jobs die due to insufficient GPU memory avialble. Which is obvious, as the already present GPU jobs are using the GPUs (as they should).

I am a bit lost here. slurm is as clever as to NOT SET CUDA_VISIBLE_DEVICES for the job that has '--gres-flags=disable-binding', but that doesn't help our users.

Personally, I believe this is a bug, but I would love to get feedback from other slurm users/developers.

Thanks in advance -
P

# scontrol show Nodes g1
NodeName=g1 CoresPerSocket=4
   CPUAlloc=1 CPUTot=4 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:titanxp:2
   NodeAddr=127.0.0.1 NodeHostName=localhost Port=0
   RealMemory=4000 AllocMem=500 FreeMem=N/A Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2019-03-18T10:14:18 SlurmdStartTime=2019-03-25T09:20:57
   CfgTRES=cpu=4,mem=4000M,billing=4
   AllocTRES=cpu=1,mem=500M
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

JobId=5 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901756 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:06:30 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2019-03-25T09:23:13 EligibleTime=2019-03-25T09:23:13
   AccrueTime=Unknown
   StartTime=2019-03-25T09:23:13 EndTime=2019-03-30T09:23:13 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-25T09:23:13
   Partition=gpu AllocNode:Sid=ernie:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=g1
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=g1 CPU_IDs=0 Mem=500 GRES_IDX=gpu(IDX:0-1)
   MinCPUsNode=1 MinMemoryNode=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//block_2gpus_5.out
   StdIn=/dev/null
   StdOut=//block_2gpus_5.out
   Power=
   TresPerNode=gpu:2

JobId=10 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901751 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:07 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2019-03-25T09:29:12 EligibleTime=2019-03-25T09:29:12
   AccrueTime=Unknown
   StartTime=2019-03-25T09:29:12 EndTime=2019-03-30T09:29:12 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-25T09:29:12
   Partition=gpu AllocNode:Sid=ernie:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=g1
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=g1 CPU_IDs=1 Mem=500 GRES_IDX=gpu(IDX:)
   MinCPUsNode=1 MinMemoryNode=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//use_1gpu_10.out
   StdIn=/dev/null
   StdOut=//use_1gpu_10.out
   Power=
   GresEnforceBind=No
   TresPerNode=gpu:1