Ticket 4767

Summary:	Set QOS for GRES:GPU=6
Product:	Slurm	Reporter:	Damien <damien.leong>
Component:	Configuration	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3397
Site:	Monash University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	17.11.6	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Damien 2018-02-12 06:28:05 MST

Good Afternoon SLURM Support

We seeking clarifications on setting up QOS based on GRES:GPU=4 per user. Previously we are using QOS with restrictions based on number of hosts used per user, but this is found to be inefficient in our workloads.

1) So we wish(or at least try) to move QOS restriction based on GRES:GPU=4, in short, each user account can only used up to 4 GPU cards (MAX).

2) Or more specify, like  GRES:GPU:P40=4 , each user account can only used all up 4x P40 GPU cards at any one time.


We use this command "  sacctmgr add qos P40 MaxTRESPerUser=gres/gpu=4 Flags=OverPartQOS   " but is doesn't work as expected. 


Slurm.conf:
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
PriorityType=priority/multifactor

AccountingStorageType=accounting_storage/slurmdbd

GresTypes=gpu

NodeName=nodes[000-012] Gres=gpu:2 Procs=28 RealMemory=233472 Sockets=2 CoresPerSocket=14  Weight=7 State=UNKNOWN



cgroup.conf: 
CgroupAutomount=yes
ConstrainDevices=yes
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedDevicesFile=/opt/slurm-16.05.4/etc/cgroup_allowed_devices.conf


cgroup_allowed_devices.conf: 
/dev/vd*
/dev/null
/dev/zero
/dev/urandom
/dev/cpu/*/*


gres.conf: 
Name=gpu Type=P40-PCIE-16GB File=/dev/nvidia0 CPUs=0-27
Name=gpu Type=P40-PCIE-16GB File=/dev/nvidia1 CPUs=0-27



Did we miss out something ? or this setup correct ?


Kindly advise. Thanks.


Cheers

Damien

Comment 1 Felip Moll 2018-02-12 06:46:25 MST

Make sure to have set AccountingStorageTRES=gres/gpu in slurm.conf, then restarted slurmctld. (The restart is required to push the update to slurmdbd.)

Running 'scontrol reconfigure' after that will prevent a lot of warning messages about the configuration file being out of sync with the compute nodes, although that's not a big deal here; they aren't affected by that change.

After that's done, you can set your limit with:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu=4

Users running under that QOS will then be limited to 4 devices (either requested as --gres=gpu:4 or --gres=gpu:p40:4, or through a combination of jobs).

What is the exact misbehavior you are seeing?

Which version are you running?

Comment 2 Damien 2018-02-12 08:45:34 MST

Hi.

Thanks for the replies.


Can we put these together ?

AccountingStorageTRES=gres/gpu
AccountingStorageType=accounting_storage/slurmdbd


Any impact ?


Cheers

Damien

Comment 3 Felip Moll 2018-02-12 09:25:04 MST

(In reply to Damien from comment #2)
> Hi.
> 
> Thanks for the replies.
> 
> 
> Can we put these together ?
> 
> AccountingStorageTRES=gres/gpu
> AccountingStorageType=accounting_storage/slurmdbd
> 

You must put both if you want to track TRES resources like gpu, and if you want to use slurmdbd.

These are independent, see man slurm.conf for extended info about this params:

AccountingStorageTRES: Comma separated list of resources you wish to  track  on  the  cluster. ...
AccountingStorageType: The  accounting  storage  mechanism  type. ...


> Any impact ?

If you don't track gres/gpu, you will not be able to count resources and therefore limits will not be applied.

I assumed you have AccountingStorageEnforce=qos and GresTypes=gpu already, and NodeName entries accordingly set to gres.conf:

NodeName=gamba3 ... Gres=gpu:tesla:2 

gres.conf:

NodeName=gamba3 Name=gpu Type=tesla File=/dev/nvidia[10-11]



Tell me how it goes. I checked it and think the behavior is what you want.

Comment 4 Damien 2018-02-12 16:50:23 MST

Hi Felip

Many thanks for your replies.

We are using 16.05.04, Yes, we are planning to update this soon.



For part 2) question: We have two favours of GPUs in this cluster, For an example, the P100s and the P40s.

Can we enforce a cluster-wide QOS on a particular model of GPU, rather then GPU as general ?


Something like:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu:P40=4


Does this make sense ? logical ?




Kindly advise. Thanks.

Cheers


Damien




(In reply to Felip Moll from comment #3)
> 
> You must put both if you want to track TRES resources like gpu, and if you
> want to use slurmdbd.
> 
> These are independent, see man slurm.conf for extended info about this
> params:
> 
> AccountingStorageTRES: Comma separated list of resources you wish to  track 
> on  the  cluster. ...
> AccountingStorageType: The  accounting  storage  mechanism  type. ...
> 
> 
> > Any impact ?
> 
> If you don't track gres/gpu, you will not be able to count resources and
> therefore limits will not be applied.
> 
> I assumed you have AccountingStorageEnforce=qos and GresTypes=gpu already,
> and NodeName entries accordingly set to gres.conf:
> 
> NodeName=gamba3 ... Gres=gpu:tesla:2 
> 
> gres.conf:
> 
> NodeName=gamba3 Name=gpu Type=tesla File=/dev/nvidia[10-11]
> 
> 
> 
> Tell me how it goes. I checked it and think the behavior is what you want.

Comment 5 Felip Moll 2018-02-13 05:26:31 MST

(In reply to Damien from comment #4)
> Hi Felip
> 
> Many thanks for your replies.
> 
> We are using 16.05.04, Yes, we are planning to update this soon.

This is necessary, remember that our support model requires customers to stay within the last two major releases (currently 17.02 or 17.11).

> For part 2) question: We have two favours of GPUs in this cluster, For an
> example, the P100s and the P40s.
> 
> Can we enforce a cluster-wide QOS on a particular model of GPU, rather then
> GPU as general ?
> 
> 
> Something like:
> 
> sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu:P40=4
> 
> 
> Does this make sense ? logical ?
> 

Despite it makes sense and is logical, I am sorry to say that current implementation have some limitations and this is one of them. Currently only gpu as general can be tracked. Same question was done in bug 3397.

Hope it helped.

Comment 8 Felip Moll 2018-02-13 14:48:06 MST

Damien,

This comment:

> Despite it makes sense and is logical, I am sorry to say that current
> implementation have some limitations and this is one of them. Currently only
> gpu as general can be tracked. Same question was done in bug 3397.
> 
> Hope it helped.

was for version <17.02.

In version 17.02 we already support your demanded feature, i.e.:

slurm.conf:

AccountingStorageTRES=gres/gpu:p100,gres/gpu:tesla
GresTypes=gpu
NodeName=xx2 ... Gres=gpu:p100:1
NodeName=xx3 ... Gres=gpu:tesla:1
NodeName=xx4 ... Gres=gpu:tesla:3

and

]$ cat gres.conf
NodeName=gamba2 Name=gpu Type=p100 File=/dev/nv1
NodeName=gamba3 Name=gpu Type=tesla File=/dev/nv2
NodeName=gamba4 Name=gpu Type=tesla File=/dev/nv[3-5]

]$ sacctmgr show qos -pn
test|0|00:00:00||cluster|||1.000000|||||||||||gres/gpu:tesla=1|||||||

Should do what you are seeking for.

The problem is that I just found a bug in 17.11 with this feature. I am working currently to fix it, but the functionality must be there and fit all your needs.

Comment 12 Damien 2018-03-13 06:38:45 MDT

Hi Felip

Yes, this solution works for us, just test it correctly.

The main key for us is:
--
AccountingStorageEnforce=qos
--


Thanks.


Damien

Comment 13 Damien 2018-03-26 09:00:02 MDT

Thanks for your help in this matter.


Please close this ticket.


Cheers

Damien

Comment 14 Felip Moll 2018-03-26 10:11:58 MDT

(In reply to Damien from comment #13)
> Thanks for your help in this matter.
> 
> 
> Please close this ticket.
> 
> 
> Cheers
> 
> Damien

Hi Damien, thanks for your answer.

I would like to keep this open a little bit more since I am working on a fix on this matter. Specifically:

When we have a QoS limit like maxtrespu=gres:tesla:1 and we submit a job asking --gres=gpu:3 , the QoS limit is ignored.

I don't know if it is currently affecting you, if it is a concert a possible workaround would be to use a jobsumit lua plugin to deny sending jobs that don't specify the gpu model.

Comment 27 Felip Moll 2018-05-03 04:16:01 MDT

Hi, 

Finally documentation has been added for this issue so I close this bug now.

Commit: c2c06468 available on 17.11.6.