Ticket 14013 - cpuset: No space left on device on heterogenous gpu nodes
Summary: cpuset: No space left on device on heterogenous gpu nodes
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 21.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
: 14701 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2022-05-05 14:23 MDT by Dylan Simon
Modified: 2023-03-16 13:08 MDT (History)
3 users (show)

See Also:
Site: Simons Foundation & Flatiron Institute
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.4 23.02.1pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (12.32 KB, text/plain)
2022-05-05 14:23 MDT, Dylan Simon
Details
workergpu15 slurmd log (11.48 KB, text/plain)
2022-05-05 14:24 MDT, Dylan Simon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Dylan Simon 2022-05-05 14:23:01 MDT
Created attachment 24870 [details]
slurm.conf

Trying to run a single task (srun -n1) in an allocation with multiple nodes results in an empty cpu task set, in some cases.  When this happens, the slurmd logs include:

[2022-05-05T16:04:31.262] error: cons_res: zero processors allocated to step
[2022-05-05T16:04:31.316] [1375095.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0'
[2022-05-05T16:04:31.316] [1375095.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are ''
[2022-05-05T16:04:31.316] [1375095.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0'
[2022-05-05T16:04:31.316] [1375095.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are ''

Here's an example:

>salloc -N 2 --ntasks-per-node=1 --gpus-per-task=1 -p gpu -w workergpu15,workergpu052 -t 1:00:00 bash                                                                                                                                                                                                      
salloc: Pending job allocation 1375095
salloc: job 1375095 queued and waiting for resources
salloc: job 1375095 has been allocated resources
salloc: Granted job allocation 1375095
salloc: Waiting for resource configuration
salloc: Nodes workergpu[15,052] are ready for job
>scontrol -d show job $SLURM_JOBID
JobId=1375095 JobName=interactive
   UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A
   Priority=4294829213 Nice=0 Account=scc QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-05-05T16:03:22 EligibleTime=2022-05-05T16:03:22
   AccrueTime=2022-05-05T16:03:22
   StartTime=2022-05-05T16:04:15 EndTime=2022-05-05T17:04:15 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-05T16:04:15 Scheduler=Backfill
   Partition=gpu AllocNode:Sid=rustyamd1:111648
   ReqNodeList=workergpu[15,052] ExcNodeList=(null)
   NodeList=workergpu[15,052]
   BatchHost=workergpu15
   NumNodes=2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=32000M,node=2,billing=2,gres/gpu=2
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   JOB_GRES=gpu:p100-16gb:1,gpu:v100-32gb:1
     Nodes=workergpu15 CPU_IDs=0 Mem=16000 GRES=gpu:p100-16gb:1(IDX:0)
     Nodes=workergpu052 CPU_IDs=0 Mem=16000 GRES=gpu:v100-32gb:1(IDX:0)
   MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/home/dylan/scc/disBatch
   Power=
   TresPerTask=gres:gpu:1


rustyamd1:~/scc/disBatch [0]>scontrol -d show node $SLURM_NODELIST
NodeName=workergpu15 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=1 CPUTot=36 CPULoad=0.05
   AvailableFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7
   ActiveFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7
   Gres=gpu:v100-32gb:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:v100-32gb:1(IDX:0),gdr:0
   NodeAddr=workergpu15 NodeHostName=workergpu15 Version=21.08.6
   OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021
   RealMemory=768000 AllocMem=16000 FreeMem=754182 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=450000 Weight=55 Owner=N/A MCS_label=N/A
   Partitions=gpu,request
   BootTime=2022-05-05T15:46:17 SlurmdStartTime=2022-05-05T15:46:17
   LastBusyTime=2022-05-05T16:03:22
   CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=4
   AllocTRES=cpu=1,mem=16000M,gres/gpu=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=workergpu052 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=1 CPUTot=28 CPULoad=0.01
   AvailableFeatures=gpu,p100,ib,numai14,centos7
   ActiveFeatures=gpu,p100,ib,numai14,centos7
   Gres=gpu:p100-16gb:2(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:p100-16gb:1(IDX:0),gdr:0
   NodeAddr=workergpu052 NodeHostName=workergpu052 Version=21.08.6
   OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021
   RealMemory=512000 AllocMem=16000 FreeMem=503742 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=950000 Weight=45 Owner=N/A MCS_label=N/A
   Partitions=gpu,request
   BootTime=2022-05-05T15:46:17 SlurmdStartTime=2022-05-05T15:46:18
   LastBusyTime=2022-05-05T16:03:22
   CfgTRES=cpu=28,mem=500G,billing=28,gres/gpu=2
   AllocTRES=cpu=1,mem=16000M,gres/gpu=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

>srun -N1 -n1 -w workergpu15 nproc
srun: error: task 0 launch failed: Slurmd could not execve job
slurmstepd: error: common_file_write_uint32s: write pid 361312 to /sys/fs/cgroup/cpuset/slurm/uid_1135/job_1375095/step_0/cgroup.procs failed: No space left on device
slurmstepd: error: unable to add pids to '/sys/fs/cgroup/cpuset/slurm/uid_1135/job_1375095/step_0'
slurmstepd: error: task_g_pre_set_affinity: No space left on device
slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error


The slurmctld log has nothing unusual, but I'll attach the "slurmd -d" log.  I tried turning on debugflag cpu_bind but didn't see much, but happy to try again if you have specific suggestions.

We've only seen this happen on GPU nodes, and it seems like it mainly happens when the two nodes have different CPU configurations in some way, either their NUMA cpu maps or gres gpu cores.  For example, the two nodes above:

workergpu052 (two GPUs, one one each socket):
NodeName=workergpu[048,049,051,052] Name=gpu Type=p100-16gb Count=1 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26
NodeName=workergpu[048,049,051,052] Name=gpu Type=p100-16gb Count=1 File=/dev/nvidia1 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27

workergpu15 (4 GPUs, all on NUMA0 so we leave out cores):
NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0
NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1
NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2
NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
Comment 1 Dylan Simon 2022-05-05 14:24:04 MDT
Created attachment 24871 [details]
workergpu15 slurmd log
Comment 2 Dylan Simon 2022-05-05 14:44:56 MDT
I just managed to reproduce this on these same nodes with --exclusive.  Note particularly the job GRES:
   JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4
     Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1)
     Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3)

These are reversed: workergpu15 has 36 cpus and 4 gpus, and 052 has 28 cpus and 2 gpus!  The slurmd log also messes up the physical CPU mapping:

[2022-05-05T16:38:13.724] [1375114.1] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-27'
[2022-05-05T16:38:13.724] [1375114.1] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are ''
[2022-05-05T16:38:13.724] [1375114.1] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-20,22,24,26,28,30,32,34'
[2022-05-05T16:38:13.724] [1375114.1] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are ''

JobId=1375114 JobName=interactive
   UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A
   Priority=4294829194 Nice=0 Account=scc QOS=unlimit
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:47 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-05-05T16:35:49 EligibleTime=2022-05-05T16:35:49
   AccrueTime=2022-05-05T16:36:34
   StartTime=2022-05-05T16:37:41 EndTime=2022-05-05T17:37:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-05T16:37:16 Scheduler=Backfill
   Partition=request AllocNode:Sid=rustyamd1:111648
   ReqNodeList=workergpu[15,052] ExcNodeList=(null)
   NodeList=workergpu[15,052]
   BatchHost=workergpu15
   NumNodes=2 NumCPUs=64 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=1250G,node=2,billing=64,gres/gpu=6
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4
     Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1)
     Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3)
   MinCPUsNode=1 MinMemoryNode=500G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/home/dylan
   Power=
   TresPerTask=gres:gpu:1


NodeName=workergpu15 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=36 CPUTot=36 CPULoad=0.03
   AvailableFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7
   ActiveFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7
   Gres=gpu:v100-32gb:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:v100-32gb:4(IDX:0-3),gdr:0
   NodeAddr=workergpu15 NodeHostName=workergpu15 Version=21.08.6
   OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021
   RealMemory=768000 AllocMem=768000 FreeMem=754182 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=450000 Weight=55 Owner=N/A MCS_label=N/A
   Partitions=gpu,request
   BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:17
   LastBusyTime=2022-05-05T16:37:17
   CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=4
   AllocTRES=cpu=36,mem=750G,gres/gpu=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=workergpu052 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=28 CPUTot=28 CPULoad=0.00
   AvailableFeatures=gpu,p100,ib,numai14,centos7
   ActiveFeatures=gpu,p100,ib,numai14,centos7
   Gres=gpu:p100-16gb:2(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:p100-16gb:2(IDX:0-1),gdr:0
   NodeAddr=workergpu052 NodeHostName=workergpu052 Version=21.08.6
   OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021
   RealMemory=512000 AllocMem=512000 FreeMem=503747 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=950000 Weight=45 Owner=N/A MCS_label=N/A
   Partitions=gpu,request
   BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:19
   LastBusyTime=2022-05-05T16:37:19
   CfgTRES=cpu=28,mem=500G,billing=28,gres/gpu=2
   AllocTRES=cpu=28,mem=500G,gres/gpu=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment 3 Dylan Simon 2022-05-05 15:00:46 MDT
Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames (workergpuXX and workergpuXXX) and something is sorting them in different orders?
Comment 4 Michael Hinton 2022-05-05 15:36:09 MDT
Hi Dylan,

(In reply to Dylan Simon from comment #3)
> Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames
> (workergpuXX and workergpuXXX) and something is sorting them in different
> orders?
That would be my first guess. There might be different sorting approaches at different points. My initial guess is that Slurm builds the GRES bitmaps sorting nodes with a strict alphanumeric sort, while perhaps scontrol prints out host lists according to some natural sort where leading 0's are ignored. Or vice versa.

Could you try making your GPU node names of consistent width and see if that fixes the issue?

Thanks,
-Michael
Comment 5 Michael Hinton 2022-05-05 15:44:32 MDT
As for the "No space left on device" error, I wonder if this is related to bug 5082. In that bug, the user also was using CentOS 7. What exact version of CentOS is node workergpu15 running? Also, maybe double check to make sure you aren't actually running out of disk space.
Comment 7 Dylan Simon 2022-05-05 17:55:07 MDT
We're running CentOS 7.9.2009 with our own 5.4.163 kernel.  However, I can also reproduce this issue on Rocky 8.5 as well.  I'm very sure that the "No space" message is because the cpuset cgroup has no CPUs assigned to it.

I just tried renaming workergpu052 to workergpu52 and repeating the same test with workergpu15 and did not see the problem, so it does seem likely to be a sorting issue.
Comment 8 Michael Hinton 2022-05-10 11:30:51 MDT
Hi Dylan,

I'm going to reduce this to a severity 4, since there is a workaround for the issue. The fix for this won't make it into 22.05, since that is coming out this month, but we'll look into fixing this.

Thanks!
-Michael
Comment 34 Marshall Garey 2022-08-19 12:00:58 MDT
*** Ticket 14701 has been marked as a duplicate of this ticket. ***
Comment 35 Brian Christiansen 2022-08-22 10:14:28 MDT
Hi Dylan,

We've fixed this issue in the following commits ahead of 22.05.4:

| * 8bebdd1147 (origin/slurm-22.05) NEWS for the previous 5 commits
| * be3362824a Sort job and step nodelists in Slurm user commands
| * f9d976b8f4 Add slurm_sort_node_list_str() to sort a node_list string
| * bcda9684a9 Add hostset_[de]ranged_string_xmalloc functions
| * bd7641da57 Change all hostset_finds to hostlist_finds
| * 9129b8a211 Fix expected ordering for hostlists from controller and daemons

Let us know if you have any questions.

Thanks,
Brian