The DefMemPerGPU option is not considered by the scheduler after submitting a job: martijk ~ $ sbatch --reservation=restest --partition=gpu_titanrtx_shared --gres=gpu:1 job.sh Submitted batch job 4493927 martijk ~ $ squeue -u martijk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4493927 gpu_titan job.sh martijk PD 0:00 1 (Resources) 4493926 gpu_titan job.sh martijk R 0:07 1 r34n4 4493925 gpu_titan job.sh martijk R 1:18 1 r34n3 martijk ~ $ scontrol show job=4493927 JobId=4493927 JobName=job.sh UserId=martijk(46857) GroupId=martijk(46561) MCS_label=N/A Priority=382045 Nice=0 Account=martijnl QOS=default JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2020-02-17T11:24:45 EligibleTime=2020-02-17T11:24:45 AccrueTime=2020-02-17T11:24:46 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-17T11:24:57 Partition=gpu_titanrtx_shared AllocNode:Sid= ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,billing=26,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=restest OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=/home/martijk/job.sh WorkDir=/home/martijk StdErr=/home/martijk/slurm-4493927.out StdIn=/dev/null StdOut=/home/martijk/slurm-4493927.out Power= TresPerNode=gpu:titanrtx:1 Expected behavior: multiple jobs run concurrently on one node. However, the scheduler does not seem to consider the memory requirements. root# scontrol show node r34n4 NodeName=r34n4 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUTot=24 CPULoad=0.00 AvailableFeatures=skylake,sse4,avx512 ActiveFeatures=skylake,sse4,avx512 Gres=gpu:titanrtx:4(S:0-1) NodeAddr= NodeHostName= OS=Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) RealMemory=191488 AllocMem=0 FreeMem=188003 Sockets=2 Boards=1 MemSpecLimit=1024 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A Partitions=gpu_titanrtx,gpu_titanrtx_short,gpu_titanrtx_shared,gpu_test BootTime=2020-01-04T12:45:40 SlurmdStartTime=2020-02-10T11:18:23 CfgTRES=cpu=24,mem=187G,billing=104,gres/gpu=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s root# scontrol show partition gpu_titanrtx_shared PartitionName=gpu_titanrtx_shared AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=p_gpu_titanrtx_shared DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=8 MaxTime=5-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=r33n6,r34n[1-7] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:4 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=192 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=DefCpuPerGPU=6,DefMemPerGPU=46700 DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=4.33,Mem=572.559T,GRES/gpu=26.0 The memory is shown correctly for running jobs: JobId=4493925 JobName=job.sh UserId=martijk(46857) GroupId=martijk(46561) MCS_label=N/A Priority=382045 Nice=0 Account=martijnl QOS=default JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:14 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2020-02-17T11:23:29 EligibleTime=2020-02-17T11:23:29 AccrueTime=2020-02-17T11:23:29 StartTime=2020-02-17T11:23:29 EndTime=2020-02-17T11:28:30 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-17T11:23:29 Partition=gpu_titanrtx_shared AllocNode:Sid= ReqNodeList=(null) ExcNodeList=(null) NodeList=r34n3 BatchHost=r34n3 NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=46700M,node=1,billing=26,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=restest OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=/home/martijk/job.sh WorkDir=/home/martijk StdErr=/home/martijk/slurm-4493925.out StdIn=/dev/null StdOut=/home/martijk/slurm-4493925.out Power= TresPerNode=gpu:titanrtx:1 It works as expected if you manually enter the memory requirements while submitting your job.
Still present in Slurm 20.11.4 (and most probably 20.11.5). So basically if you submit a job using some DefMemPerGpu or --mem-per-gpu value it will never be considered to run on an oversubscribed node where another job is currently running, even though it will fit. It will act as if 100% of the available node memory is requested. Only when the job has started it will take the allocated memory into account, permitting other jobs to run on that node (unless they are also using DefMemPerGpu or --mem-per-gpu of course).
When we tested this issue on our system (20.02.6) we noticed that cgroups are also appear to be set to the DefMemPerCPU (times number of cpus) value instead of based on the --mem-per-gpu flag.
This seemed to have been caused by our job_submit.lua script which was using the older job_desc.gres instead of job_desc.tres_per_{job,task,socket,node}.