Summary: | Set QOS for GRES:GPU=6 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Damien <damien.leong> |
Component: | Configuration | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | felip.moll |
Version: | - Unsupported Older Versions | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=3397 | ||
Site: | Monash University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.6 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Damien
2018-02-12 06:28:05 MST
Make sure to have set AccountingStorageTRES=gres/gpu in slurm.conf, then restarted slurmctld. (The restart is required to push the update to slurmdbd.) Running 'scontrol reconfigure' after that will prevent a lot of warning messages about the configuration file being out of sync with the compute nodes, although that's not a big deal here; they aren't affected by that change. After that's done, you can set your limit with: sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu=4 Users running under that QOS will then be limited to 4 devices (either requested as --gres=gpu:4 or --gres=gpu:p40:4, or through a combination of jobs). What is the exact misbehavior you are seeing? Which version are you running? Hi. Thanks for the replies. Can we put these together ? AccountingStorageTRES=gres/gpu AccountingStorageType=accounting_storage/slurmdbd Any impact ? Cheers Damien (In reply to Damien from comment #2) > Hi. > > Thanks for the replies. > > > Can we put these together ? > > AccountingStorageTRES=gres/gpu > AccountingStorageType=accounting_storage/slurmdbd > You must put both if you want to track TRES resources like gpu, and if you want to use slurmdbd. These are independent, see man slurm.conf for extended info about this params: AccountingStorageTRES: Comma separated list of resources you wish to track on the cluster. ... AccountingStorageType: The accounting storage mechanism type. ... > Any impact ? If you don't track gres/gpu, you will not be able to count resources and therefore limits will not be applied. I assumed you have AccountingStorageEnforce=qos and GresTypes=gpu already, and NodeName entries accordingly set to gres.conf: NodeName=gamba3 ... Gres=gpu:tesla:2 gres.conf: NodeName=gamba3 Name=gpu Type=tesla File=/dev/nvidia[10-11] Tell me how it goes. I checked it and think the behavior is what you want. Hi Felip Many thanks for your replies. We are using 16.05.04, Yes, we are planning to update this soon. For part 2) question: We have two favours of GPUs in this cluster, For an example, the P100s and the P40s. Can we enforce a cluster-wide QOS on a particular model of GPU, rather then GPU as general ? Something like: sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu:P40=4 Does this make sense ? logical ? Kindly advise. Thanks. Cheers Damien (In reply to Felip Moll from comment #3) > > You must put both if you want to track TRES resources like gpu, and if you > want to use slurmdbd. > > These are independent, see man slurm.conf for extended info about this > params: > > AccountingStorageTRES: Comma separated list of resources you wish to track > on the cluster. ... > AccountingStorageType: The accounting storage mechanism type. ... > > > > Any impact ? > > If you don't track gres/gpu, you will not be able to count resources and > therefore limits will not be applied. > > I assumed you have AccountingStorageEnforce=qos and GresTypes=gpu already, > and NodeName entries accordingly set to gres.conf: > > NodeName=gamba3 ... Gres=gpu:tesla:2 > > gres.conf: > > NodeName=gamba3 Name=gpu Type=tesla File=/dev/nvidia[10-11] > > > > Tell me how it goes. I checked it and think the behavior is what you want. (In reply to Damien from comment #4) > Hi Felip > > Many thanks for your replies. > > We are using 16.05.04, Yes, we are planning to update this soon. This is necessary, remember that our support model requires customers to stay within the last two major releases (currently 17.02 or 17.11). > For part 2) question: We have two favours of GPUs in this cluster, For an > example, the P100s and the P40s. > > Can we enforce a cluster-wide QOS on a particular model of GPU, rather then > GPU as general ? > > > Something like: > > sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu:P40=4 > > > Does this make sense ? logical ? > Despite it makes sense and is logical, I am sorry to say that current implementation have some limitations and this is one of them. Currently only gpu as general can be tracked. Same question was done in bug 3397. Hope it helped. Damien,
This comment:
> Despite it makes sense and is logical, I am sorry to say that current
> implementation have some limitations and this is one of them. Currently only
> gpu as general can be tracked. Same question was done in bug 3397.
>
> Hope it helped.
was for version <17.02.
In version 17.02 we already support your demanded feature, i.e.:
slurm.conf:
AccountingStorageTRES=gres/gpu:p100,gres/gpu:tesla
GresTypes=gpu
NodeName=xx2 ... Gres=gpu:p100:1
NodeName=xx3 ... Gres=gpu:tesla:1
NodeName=xx4 ... Gres=gpu:tesla:3
and
]$ cat gres.conf
NodeName=gamba2 Name=gpu Type=p100 File=/dev/nv1
NodeName=gamba3 Name=gpu Type=tesla File=/dev/nv2
NodeName=gamba4 Name=gpu Type=tesla File=/dev/nv[3-5]
]$ sacctmgr show qos -pn
test|0|00:00:00||cluster|||1.000000|||||||||||gres/gpu:tesla=1|||||||
Should do what you are seeking for.
The problem is that I just found a bug in 17.11 with this feature. I am working currently to fix it, but the functionality must be there and fit all your needs.
Hi Felip Yes, this solution works for us, just test it correctly. The main key for us is: -- AccountingStorageEnforce=qos -- Thanks. Damien Thanks for your help in this matter. Please close this ticket. Cheers Damien (In reply to Damien from comment #13) > Thanks for your help in this matter. > > > Please close this ticket. > > > Cheers > > Damien Hi Damien, thanks for your answer. I would like to keep this open a little bit more since I am working on a fix on this matter. Specifically: When we have a QoS limit like maxtrespu=gres:tesla:1 and we submit a job asking --gres=gpu:3 , the QoS limit is ignored. I don't know if it is currently affecting you, if it is a concert a possible workaround would be to use a jobsumit lua plugin to deny sending jobs that don't specify the gpu model. Hi, Finally documentation has been added for this issue so I close this bug now. Commit: c2c06468 available on 17.11.6. |