10273 – Unable to schedule more than one MPS per node

Bug 10273 - Unable to schedule more than one MPS per node

Summary: Unable to schedule more than one MPS per node

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	20.02.5
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-11-23 16:14 MST by Trey Dockendorf
Modified:	2020-11-24 10:06 MST (History)
CC List:	1 user (show)

See Also:	7834
Site:	Ohio State OSC
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (31.96 KB, text/plain) 2020-11-23 16:14 MST, Trey Dockendorf	Details
gres.conf (1.74 KB, text/plain) 2020-11-23 16:14 MST, Trey Dockendorf	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Trey Dockendorf 2020-11-23 16:14:26 MST

Created attachment 16792 [details]
slurm.conf

I have configured 2 nodes in our cluster with GRES mps with Count=28 per node.  I am submitting jobs with --gres=mps:1 and one only job is starting per node.  Maybe this is a lack of understanding on how MPS works but my read of documentation was I could run multiple MPS jobs on a single node so long as there was a sufficiently high Count of the GRES.

Attaching slurm.conf and gres.conf.

Job submission:

$ sbatch -n 1 --gres=mps:1 -p quick  --wrap 'nvidia-smi ; sleep 300'
Submitted batch job 6928
$ sbatch -n 1 --gres=mps:1 -p quick  --wrap 'nvidia-smi ; sleep 300'
Submitted batch job 6929
$ sbatch -n 1 --gres=mps:1 -p quick  --wrap 'nvidia-smi ; sleep 300'
Submitted batch job 6930
$ sbatch -n 1 --gres=mps:1 -p quick  --wrap 'nvidia-smi ; sleep 300'
Submitted batch job 6931

$ squeue -u tdockendorf -a
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
              6930     quick     wrap tdockend PD       0:00      1 (Resources) 
              6931     quick     wrap tdockend PD       0:00      1 (Priority) 
              6928     quick     wrap tdockend  R       0:15      1 o0805 
              6929     quick     wrap tdockend  R       0:15      1 o0806 

NODES:

[root@owens-rw01 ~]# scontrol show node=o0805
NodeName=o0805 Arch=x86_64 CoresPerSocket=14 
   CPUAlloc=1 CPUTot=28 CPULoad=0.19
   AvailableFeatures=r730,gpu,eth-owens-rack19h1,ib-i2l1s06,ib-i2,19,p100,quick
   ActiveFeatures=r730,gpu,eth-owens-rack19h1,ib-i2l1s06,ib-i2,19,p100,quick
   Gres=gpu:p100:1,mps:p100:28,pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1,vis:1
   NodeAddr=10.2.19.1 NodeHostName=o0805 Version=20.02.5
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 
   RealMemory=120832 AllocMem=4315 FreeMem=123145 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=2 Owner=N/A MCS_label=N/A
   Partitions=quick 
   BootTime=2020-11-23T17:07:40 SlurmdStartTime=2020-11-23T17:55:05
   CfgTRES=cpu=28,mem=118G,billing=28,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/gpu=1,gres/gpu:p100=1,gres/ime=1,gres/mps=28,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1,gres/vis=1
   AllocTRES=cpu=1,mem=4315M,gres/mps=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@owens-rw01 ~]# scontrol show node=o0806
NodeName=o0806 Arch=x86_64 CoresPerSocket=14 
   CPUAlloc=1 CPUTot=28 CPULoad=0.14
   AvailableFeatures=r730,gpu,eth-owens-rack19h1,ib-i2l1s06,ib-i2,19,p100,quick
   ActiveFeatures=r730,gpu,eth-owens-rack19h1,ib-i2l1s06,ib-i2,19,p100,quick
   Gres=gpu:p100:1,mps:p100:28,pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1,vis:1
   NodeAddr=10.2.19.2 NodeHostName=o0806 Version=20.02.5
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 
   RealMemory=120832 AllocMem=4315 FreeMem=123069 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=2 Owner=N/A MCS_label=N/A
   Partitions=quick 
   BootTime=2020-11-23T17:01:37 SlurmdStartTime=2020-11-23T17:55:04
   CfgTRES=cpu=28,mem=118G,billing=28,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/gpu=1,gres/gpu:p100=1,gres/ime=1,gres/mps=28,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1,gres/vis=1
   AllocTRES=cpu=1,mem=4315M,gres/mps=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 1 Trey Dockendorf 2020-11-23 16:14:43 MST

Created attachment 16793 [details]
gres.conf

Comment 3 Trey Dockendorf 2020-11-24 09:44:39 MST

I'm going to close this. Our use case was to deploy a set of nodes in hidden partition called "quick" which we set to only allow submission of jobs from web nodes.  The goal was to allow these jobs to all access the GPUs on a node without requesting a GPU so that say 28 single core jobs could all access the GPU.  We can't modify cgroup.conf to disable constraining devices because we use configless cgroup.conf from our scheduler and we only want this cgroup bypass for 2 out our 824 nodes.  So the solution I've come up with was to not define GRES for GPUs on the node which appears to prevent SLURM from denying access to the GPUs via the cgroup.

- Trey

Comment 4 Michael Hinton 2020-11-24 10:01:54 MST

Perfect, thanks for the update on that. Since you aren't defining GRES on those nodes anymore, does that mean you aren't using MPS now?

If you do decide to use MPS, I should probably mention a built-in limitation that you will probably run into sooner or later. According to the MPS NVIDIA docs, only one user can use the MPS server at a time. You might want to test your MPS setup with separate users just to confirm. This limitation might be a deal-breaker for your intended purposes. See bug 7834 comment 12.

-Michael

Comment 5 Trey Dockendorf 2020-11-24 10:06:04 MST

Yes now that those nodes have no GPU GRES defined the MPS logic was also removed because the MPS GRES was just an attempt at allowing shared access to GPUs.

Thanks,
- Trey