Hi SchedMD, I'm having some difficulty with configuring MIG on my new cluster. Using Ubuntu 20.04 Slurm 21.08.3 MIG devices seem to be picked up by autodetect but do not appear to be added to the cgroup allow list for any steps. In particular my slurmd.log looks like [2021-11-07T22:29:22.981] debug: gres/gpu: init: loaded [2021-11-07T22:29:22.987] [59.batch] debug: CPUs:64 Boards:1 Sockets:64 CoresPerSocket:1 ThreadsPerCore:1 [2021-11-07T22:29:22.988] [59.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded [2021-11-07T22:29:22.991] [59.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2021-11-07T22:29:22.992] [59.batch] debug: laying out the 64 tasks on 1 hosts mlerp-node05 dist 2 [2021-11-07T22:29:22.992] [59.batch] debug: Message thread started pid = 17504 [2021-11-07T22:29:22.992] [59.batch] debug: switch/none: init: switch NONE plugin loaded [2021-11-07T22:29:22.995] [59.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff [2021-11-07T22:29:22.997] [59.batch] debug: task/cgroup: init: core enforcement enabled [2021-11-07T22:29:22.997] [59.batch] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:604311M allowed:100%(enforced), swap:0%(permissive), max:100%(604311M) max+swap:100%(1208622M) min:30M kmem:100%(604311M permissive) min:30M swappiness:0(unset) [2021-11-07T22:29:22.997] [59.batch] debug: task/cgroup: init: memory enforcement enabled [2021-11-07T22:29:22.999] [59.batch] debug: task/cgroup: init: device enforcement enabled [2021-11-07T22:29:22.999] [59.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2021-11-07T22:29:22.999] [59.batch] cred/munge: init: Munge credential signature plugin loaded [2021-11-07T22:29:22.999] [59.batch] debug: job_container/none: init: job_container none plugin loaded [2021-11-07T22:29:22.999] [59.batch] debug: spank: opening plugin stack /opt/slurm/var/spool/conf-cache/plugstack.conf [2021-11-07T22:29:23.000] [59.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63' [2021-11-07T22:29:23.000] [59.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63' [2021-11-07T22:29:23.000] [59.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63' [2021-11-07T22:29:23.000] [59.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63' [2021-11-07T22:29:23.001] [59.batch] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=604311MB memsw.limit=unlimited [2021-11-07T22:29:23.001] [59.batch] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=604311MB memsw.limit=unlimited [2021-11-07T22:29:23.001] [59.batch] debug: cgroup/v1: _oom_event_monitor: started. [2021-11-07T22:29:23.001] [59.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument [2021-11-07T22:29:23.001] [59.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1001/job_59/devices.allow: Invalid argument [2021-11-07T22:29:23.052] [59.batch] debug levels are stderr='error', logfile='debug', syslog='quiet' [2021-11-07T22:29:23.053] [59.batch] starting 1 tasks [2021-11-07T22:29:23.053] [59.batch] task 0 (17513) started 2021-11-07T22:29:23 [2021-11-07T22:29:23.054] [59.batch] debug: Setting slurmstepd oom_adj to -1000 [2021-11-07T22:29:23.083] [59.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59.batch, task:0 bind:(null type) [2021-11-07T22:29:42.834] debug: attempting to run health_check [/opt/nhc-1.4.2/sbin/nhc] I suspect the problem to be around writing to devices.allow and receiving the message Invalid argument. I hacked in an extra error statement to the cgroup plugin to find out that the string being written is "0 rwm" ... which is a little odd since I though it was supposed to be strings like "c NNN:MMM rwm" I've attached a tarball of my slurm config. Any help would be greatly appreciated
In addition, if I remove the cgroup_allowed_devices_file.conf (which I realise isn't necessary since about slurm 18) the particular message about Invalid arguments goes away, however I still can't see the device. Dr Chris Hines Senior Research DevOps Engineer
Created attachment 22161 [details] Slurm configurations
Hi, Some more information. I added additional printf statements and realised that all of my devices are being added to devices.deny My node looks like ubuntu@mlerp-login0:~$ scontrol show node mlerp-node05 NodeName=mlerp-node05 Arch=x86_64 CoresPerSocket=1 CPUAlloc=64 CPUTot=64 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:2g.10gb:3(S:0-63) NodeAddr=mlerp-node05 NodeHostName=mlerp-node05 Version=21.08.3 OS=Linux 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 RealMemory=604160 AllocMem=0 FreeMem=603434 Sockets=64 Boards=1 MemSpecLimit=8192 State=ALLOCATED ThreadsPerCore=1 TmpDisk=3145728 Weight=1 Owner=N/A MCS_label=N/A Partitions=batch BootTime=2021-11-05T03:19:23 SlurmdStartTime=2021-11-08T02:49:45 LastBusyTime=2021-11-08T02:50:00 CfgTRES=cpu=64,mem=590G,billing=64 AllocTRES=cpu=64,mem=590G,billing=64 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s and my jobscript looks like #!/bin/bash #SBATCH --mem=1G #SBATCH --time=10:00:00 #SBATCH --gres=gpu:2g.10gb:1 #SBATCH -w mlerp-node05 hostname nvidia-smi but all the nvidia-caps are being added to devices.deny Regards, -- Dr Chris Hines
(In reply to Damien from comment #0) > I'm having some difficulty with configuring MIG on my new cluster. > Using Ubuntu 20.04 Slurm 21.08.3 MIG devices seem to be picked up by > autodetect but do not appear to be added to the cgroup allow list for any > steps. > > In particular my slurmd.log looks like Damien, could you set SlurmdDebug=debug2 and restart the slurmd? That should give us more information. Thanks, -Michael
(In reply to Damien from comment #0) > I hacked in an extra error statement to the cgroup plugin to find out that > the string being written is "0 rwm" ... which is a little odd since I though > it was supposed to be strings like "c NNN:MMM rwm" I think that could occur if the dev_path passed into gres_device_major() in gres.c doesn't point to a device file... But I'm not sure how this can happen.
Created attachment 22178 [details] slurmd log with debug2 Hi Michael, Thanks @Damien for submitting this bug for me. Attached is a slurmd log from 'systemctl start slurmd' to job cancel Note that lines that include the word "wrote" are due to me hacking in error("%s: wrote %s to cgroup %s: %m", __func__, content, file_path); at line 401 of cgroup_common.c function common_file_write_content. This tells me that slurmd is denying access to all my MIG devices on purpose, but its not clear why. Is there a debug level to ask slurmd to report what gres it thinks its supposed to allocate to the job?
Hi Michael re the entries for "0 rwm" in devices.allow I think I understand this now. I think this occurred when I was using a quickly copied cgroup_allowed_devices_file that included entries for /dev/sd* which didn't exist on this node. It looks like it tried to identify the correct major and allow the device by default. So thats how you can force it to point to a device file that doesn't exist. Stupidity and quickly copy pasting when I should know better Anyway I've since removed the cgroup_allowed_devices_file because it appears to be unnecessary. Regards, -- Chris Hines.
Hi Michael, You can resolve this ticket now as I managed to figure out my mistake. It seems that somewhere along the lines I got SelectType="select/linear" in my slurm.conf which was mangling the gres. I ended up with jobs like JobId=80 JobName=test.sh UserId=chines(1001) GroupId=chines(1003) MCS_label=N/A Priority=13756 Nice=0 Account=chines QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:05 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2021-11-09T04:30:13 EligibleTime=2021-11-09T04:30:13 AccrueTime=2021-11-09T04:30:13 StartTime=2021-11-09T04:30:13 EndTime=2021-11-09T14:30:13 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-09T04:30:13 Scheduler=Backfill Partition=batch AllocNode:Sid=mlerp-login0:707018 ReqNodeList=mlerp-node05 ExcNodeList=(null) NodeList=mlerp-node05 BatchHost=mlerp-node05 NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,node=1,billing=64 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* JOB_GRES=gpu:7g.40gb:0 Nodes=mlerp-node05 CPU_IDs=0-63 Mem=0 GRES=gpu:7g.40gb:0(IDX:) MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/chines/test.sh WorkDir=/home/chines StdErr=/home/chines/slurm-80.out StdIn=/dev/null StdOut=/home/chines/slurm-80.out Power= TresPerNode=gres:gpu:7g.40gb:1 Notice that TresPerNode lists 1 device, but JOB_GRES lists 0 devices. Having switched my selecttype to cons_tres things behave much better Thanks for your help
(In reply to chris.hines from comment #7) > Anyway I've since removed the cgroup_allowed_devices_file because it appears > to be unnecessary. Yes, it is now unnecessary, as all devices are blacklisted by default, and only devices listed in Files= in gres.conf or autodetected will be allowed to be used by Slurm. (In reply to chris.hines from comment #8) > It seems that somewhere along the lines I got SelectType="select/linear" in > my slurm.conf which was mangling the gres. I saw you had SelectType="select/linear", but I didn't know that was on accident. We recommend using select/cons_tres, because that supports all the latest GPU features, and you can still get select/linear behavior with cons_tres by setting partitions to Oversubscribe=Exclusive. Marking as resolved. Thanks! -Michael