12826 – Configuring MIG in a Slurm Cluster

Ticket 12826 - Configuring MIG in a Slurm Cluster

Summary: Configuring MIG in a Slurm Cluster

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	21.08.3
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-11-07 18:38 MST by Damien
Modified:	2021-11-09 09:48 MST (History)
CC List:	1 user (show)

See Also:
Site:	Monash University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configurations (20.00 KB, application/x-tar) 2021-11-07 18:41 MST, Damien	Details
slurmd log with debug2 (38.96 KB, text/x-log) 2021-11-08 19:44 MST, chris.hines	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Damien 2021-11-07 18:38:48 MST

Hi SchedMD,

I'm having some difficulty with configuring MIG on my new cluster.
Using Ubuntu 20.04 Slurm 21.08.3 MIG devices seem to be picked up by autodetect but do not appear to be added to the cgroup allow list for any steps. 

In particular my slurmd.log looks like

[2021-11-07T22:29:22.981] debug:  gres/gpu: init: loaded
[2021-11-07T22:29:22.987] [59.batch] debug:  CPUs:64 Boards:1 Sockets:64 CoresPerSocket:1 ThreadsPerCore:1
[2021-11-07T22:29:22.988] [59.batch] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2021-11-07T22:29:22.991] [59.batch] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2021-11-07T22:29:22.992] [59.batch] debug:  laying out the 64 tasks on 1 hosts mlerp-node05 dist 2
[2021-11-07T22:29:22.992] [59.batch] debug:  Message thread started pid = 17504
[2021-11-07T22:29:22.992] [59.batch] debug:  switch/none: init: switch NONE plugin loaded
[2021-11-07T22:29:22.995] [59.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2021-11-07T22:29:22.997] [59.batch] debug:  task/cgroup: init: core enforcement enabled
[2021-11-07T22:29:22.997] [59.batch] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:604311M allowed:100%(enforced), swap:0%(permissive), max:100%(604311M) max+swap:100%(1208622M) min:30M kmem:100%(604311M permissive) min:30M swappiness:0(unset)
[2021-11-07T22:29:22.997] [59.batch] debug:  task/cgroup: init: memory enforcement enabled
[2021-11-07T22:29:22.999] [59.batch] debug:  task/cgroup: init: device enforcement enabled
[2021-11-07T22:29:22.999] [59.batch] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2021-11-07T22:29:22.999] [59.batch] cred/munge: init: Munge credential signature plugin loaded
[2021-11-07T22:29:22.999] [59.batch] debug:  job_container/none: init: job_container none plugin loaded
[2021-11-07T22:29:22.999] [59.batch] debug:  spank: opening plugin stack /opt/slurm/var/spool/conf-cache/plugstack.conf
[2021-11-07T22:29:23.000] [59.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2021-11-07T22:29:23.000] [59.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2021-11-07T22:29:23.000] [59.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2021-11-07T22:29:23.000] [59.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2021-11-07T22:29:23.001] [59.batch] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=604311MB memsw.limit=unlimited
[2021-11-07T22:29:23.001] [59.batch] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=604311MB memsw.limit=unlimited
[2021-11-07T22:29:23.001] [59.batch] debug:  cgroup/v1: _oom_event_monitor: started.
[2021-11-07T22:29:23.001] [59.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2021-11-07T22:29:23.001] [59.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1001/job_59/devices.allow: Invalid argument
[2021-11-07T22:29:23.052] [59.batch] debug levels are stderr='error', logfile='debug', syslog='quiet'
[2021-11-07T22:29:23.053] [59.batch] starting 1 tasks
[2021-11-07T22:29:23.053] [59.batch] task 0 (17513) started 2021-11-07T22:29:23
[2021-11-07T22:29:23.054] [59.batch] debug:  Setting slurmstepd oom_adj to -1000
[2021-11-07T22:29:23.083] [59.batch] debug:  task/affinity: task_p_pre_launch: affinity StepId=59.batch, task:0 bind:(null type)
[2021-11-07T22:29:42.834] debug:  attempting to run health_check [/opt/nhc-1.4.2/sbin/nhc]

I suspect the problem to be around writing to devices.allow and receiving the message Invalid argument.

I hacked in an extra error statement to the cgroup plugin to find out that the string being written is "0 rwm" ... which is a little odd since I though it was supposed to be strings like "c NNN:MMM rwm"

I've attached a tarball of my slurm config. Any help would be greatly appreciated

Comment 1 Damien 2021-11-07 18:39:34 MST

In addition, if I remove the cgroup_allowed_devices_file.conf (which I realise isn't necessary since about slurm 18) the particular message  about Invalid arguments goes away, however I still can't see the device.

Dr Chris Hines
Senior Research DevOps Engineer

Comment 2 Damien 2021-11-07 18:41:19 MST

Created attachment 22161 [details]
Slurm configurations

Comment 3 chris.hines 2021-11-07 20:28:42 MST

Hi,
Some more information. I added additional printf statements and realised that all of my devices are being added to devices.deny

My node looks like
ubuntu@mlerp-login0:~$ scontrol show node mlerp-node05
NodeName=mlerp-node05 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=64 CPUTot=64 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:2g.10gb:3(S:0-63)
   NodeAddr=mlerp-node05 NodeHostName=mlerp-node05 Version=21.08.3
   OS=Linux 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 
   RealMemory=604160 AllocMem=0 FreeMem=603434 Sockets=64 Boards=1
   MemSpecLimit=8192
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=3145728 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=batch 
   BootTime=2021-11-05T03:19:23 SlurmdStartTime=2021-11-08T02:49:45
   LastBusyTime=2021-11-08T02:50:00
   CfgTRES=cpu=64,mem=590G,billing=64
   AllocTRES=cpu=64,mem=590G,billing=64
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

and my jobscript looks like

#!/bin/bash
#SBATCH --mem=1G
#SBATCH --time=10:00:00
#SBATCH --gres=gpu:2g.10gb:1
#SBATCH -w mlerp-node05

hostname
nvidia-smi

but all the nvidia-caps are being added to devices.deny

Regards,
--
Dr Chris Hines

Comment 4 Michael Hinton 2021-11-08 10:04:30 MST

(In reply to Damien from comment #0)
> I'm having some difficulty with configuring MIG on my new cluster.
> Using Ubuntu 20.04 Slurm 21.08.3 MIG devices seem to be picked up by
> autodetect but do not appear to be added to the cgroup allow list for any
> steps. 
> 
> In particular my slurmd.log looks like
Damien, could you set SlurmdDebug=debug2 and restart the slurmd? That should give us more information.

Thanks,
-Michael

Comment 5 Michael Hinton 2021-11-08 10:53:17 MST

(In reply to Damien from comment #0)
> I hacked in an extra error statement to the cgroup plugin to find out that
> the string being written is "0 rwm" ... which is a little odd since I though
> it was supposed to be strings like "c NNN:MMM rwm"
I think that could occur if the dev_path passed into gres_device_major() in gres.c doesn't point to a device file... But I'm not sure how this can happen.

Comment 6 chris.hines 2021-11-08 19:44:11 MST

Created attachment 22178 [details]
slurmd log with debug2

Hi Michael,
Thanks @Damien for submitting this bug for me.

Attached is a slurmd log from 'systemctl start slurmd' to job cancel

Note that lines that include the word "wrote" are due to me hacking in 

        error("%s: wrote %s to cgroup %s: %m",
              __func__, content, file_path);

at line 401 of cgroup_common.c function common_file_write_content.

This tells me that slurmd is denying access to all my MIG devices on purpose, but its not clear why.

Is there a debug level to ask slurmd to report what gres it thinks its supposed to allocate to the job?

Comment 7 chris.hines 2021-11-08 19:48:50 MST

Hi Michael 
re the entries for "0 rwm" in devices.allow I think I understand this now. I think this occurred when I was using a quickly copied cgroup_allowed_devices_file that included entries for /dev/sd* which didn't exist on this node. It looks like it tried to identify the correct major and allow the device by default. So thats how you can force it to point to a device file that doesn't exist. Stupidity and quickly copy pasting when I should know better

Anyway I've since removed the cgroup_allowed_devices_file because it appears to be unnecessary.

Regards,
--
Chris Hines.

Comment 8 chris.hines 2021-11-08 21:50:52 MST

Hi Michael,
You can resolve this ticket now as I managed to figure out my mistake.
It seems that somewhere along the lines I got SelectType="select/linear" in my slurm.conf which was mangling the gres. I ended up with jobs like

JobId=80 JobName=test.sh
   UserId=chines(1001) GroupId=chines(1003) MCS_label=N/A
   Priority=13756 Nice=0 Account=chines QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:05 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2021-11-09T04:30:13 EligibleTime=2021-11-09T04:30:13
   AccrueTime=2021-11-09T04:30:13
   StartTime=2021-11-09T04:30:13 EndTime=2021-11-09T14:30:13 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-09T04:30:13 Scheduler=Backfill
   Partition=batch AllocNode:Sid=mlerp-login0:707018
   ReqNodeList=mlerp-node05 ExcNodeList=(null)
   NodeList=mlerp-node05
   BatchHost=mlerp-node05
   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,node=1,billing=64
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:7g.40gb:0
     Nodes=mlerp-node05 CPU_IDs=0-63 Mem=0 GRES=gpu:7g.40gb:0(IDX:)
   MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/chines/test.sh
   WorkDir=/home/chines
   StdErr=/home/chines/slurm-80.out
   StdIn=/dev/null
   StdOut=/home/chines/slurm-80.out
   Power=
   TresPerNode=gres:gpu:7g.40gb:1

Notice that TresPerNode lists 1 device, but JOB_GRES lists 0 devices.

Having switched my selecttype to cons_tres things behave much better

Thanks for your help

Comment 9 Michael Hinton 2021-11-09 09:48:21 MST

(In reply to chris.hines from comment #7)
> Anyway I've since removed the cgroup_allowed_devices_file because it appears
> to be unnecessary.
Yes, it is now unnecessary, as all devices are blacklisted by default, and only devices listed in Files= in gres.conf or autodetected will be allowed to be used by Slurm.

(In reply to chris.hines from comment #8)
> It seems that somewhere along the lines I got SelectType="select/linear" in
> my slurm.conf which was mangling the gres.
I saw you had SelectType="select/linear", but I didn't know that was on accident. We recommend using select/cons_tres, because that supports all the latest GPU features, and you can still get select/linear behavior with cons_tres by setting partitions to Oversubscribe=Exclusive.

Marking as resolved. Thanks!
-Michael