Ticket 4315

Summary:	Configuration Maxnodesperuser in QOS
Product:	Slurm	Reporter:	Damien <damien.leong>
Component:	Configuration	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	Monash University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Damien 2017-10-30 09:33:23 MDT

Hi Slurm Support

We are trying to configure 'MaxNodesPerUser' in QOS, but have some strange results, see this:
----

[root@m3-login2 ~]# sacctmgr modify QOS name=m3h set MaxNodesPU=4
 Modified qos...
  m3h
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root@m3-login2 ~]#
[root@m3-login2 ~]# sacctmgr show qos m3h
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
       m3h          0   00:00:00                cluster                                                        1.000000                                                                                                                                       node=4



[smaruf@m3-login2 ~]$
[smaruf@m3-login2 ~]$ squeue -u smaruf
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1153281       m3h run_docm   smaruf  R 3-13:54:28      1 m3h001
           1154011       m3h run_docm   smaruf  R 2-18:47:58      1 m3h006
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157417
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157418
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157419
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157420
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157421
[smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
Submitted batch job 1157422


[smaruf@m3-login2 ~]$
[smaruf@m3-login2 ~]$ squeue -u smaruf
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1157419       m3h slurm-se   smaruf PD       0:00      1 (QOSMaxNodePerUserLimit)
           1157420       m3h slurm-se   smaruf PD       0:00      1 (QOSMaxNodePerUserLimit)
           1157421       m3h slurm-se   smaruf PD       0:00      1 (QOSMaxNodePerUserLimit)
           1157422       m3h slurm-se   smaruf PD       0:00      1 (QOSMaxNodePerUserLimit)
           1153281       m3h run_docm   smaruf  R 3-13:55:01      1 m3h001
           1154011       m3h run_docm   smaruf  R 2-18:48:31      1 m3h006
           1157418       m3h slurm-se   smaruf  R       0:15      1 m3h001
           1157417       m3h slurm-se   smaruf  R       0:18      1 m3h001
[smaruf@m3-login2 ~]$


----

These all are 1-CPU-core jobs, m3h001 has 24 CPU-cores. In theory, we should be able to squeeze more jobs into one m3h node, but the rest are hit with 'QOSMaxNodePerUserLimit'. The intention of setting the previous 'sacctmgr' command is not to allow one user utilise more than 4 nodes in this partition, but this seems to translate to allow 4 jobs only in m3h partition per user.


We have an older cluster running slurs 14.08, when doing the same sacctmgr command, it gives us not to allow one user utilise more than 4 nodes in the selected partition.


Is our syntax command correct ? or our interpretation of slurm documentation is wrong ?



Kindly advise. Thanks.


Cheers

Damien

Comment 1 Tim Wickberg 2017-10-30 14:16:52 MDT

(In reply to Damien from comment #0)
> Hi Slurm Support
> 
> We are trying to configure 'MaxNodesPerUser' in QOS, but have some strange
> results, see this:
> ----
> 
> [root@m3-login2 ~]# sacctmgr modify QOS name=m3h set MaxNodesPU=4
>  Modified qos...
>   m3h
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
> [root@m3-login2 ~]#
> [root@m3-login2 ~]# sacctmgr show qos m3h
>       Name   Priority  GraceTime    Preempt PreemptMode                     
> Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin
> GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins    
> MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA
> MaxSubmitPA       MinTRES
> ---------- ---------- ---------- ---------- -----------
> ---------------------------------------- ---------- -----------
> ------------- ------------- ------------- ------- --------- -----------
> ------------- -------------- ------------- ----------- -------------
> --------- ----------- ------------- --------- ----------- -------------
>        m3h          0   00:00:00                cluster                     
> 1.000000                                                                    
> node=4
> 
> 
> 
> [smaruf@m3-login2 ~]$
> [smaruf@m3-login2 ~]$ squeue -u smaruf
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>            1153281       m3h run_docm   smaruf  R 3-13:54:28      1 m3h001
>            1154011       m3h run_docm   smaruf  R 2-18:47:58      1 m3h006
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157417
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157418
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157419
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157420
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157421
> [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script
> Submitted batch job 1157422
> 
> 
> [smaruf@m3-login2 ~]$
> [smaruf@m3-login2 ~]$ squeue -u smaruf
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>            1157419       m3h slurm-se   smaruf PD       0:00      1
> (QOSMaxNodePerUserLimit)
>            1157420       m3h slurm-se   smaruf PD       0:00      1
> (QOSMaxNodePerUserLimit)
>            1157421       m3h slurm-se   smaruf PD       0:00      1
> (QOSMaxNodePerUserLimit)
>            1157422       m3h slurm-se   smaruf PD       0:00      1
> (QOSMaxNodePerUserLimit)
>            1153281       m3h run_docm   smaruf  R 3-13:55:01      1 m3h001
>            1154011       m3h run_docm   smaruf  R 2-18:48:31      1 m3h006
>            1157418       m3h slurm-se   smaruf  R       0:15      1 m3h001
>            1157417       m3h slurm-se   smaruf  R       0:18      1 m3h001
> [smaruf@m3-login2 ~]$
> 
> 
> ----
> 
> These all are 1-CPU-core jobs, m3h001 has 24 CPU-cores. In theory, we should
> be able to squeeze more jobs into one m3h node, but the rest are hit with
> 'QOSMaxNodePerUserLimit'. The intention of setting the previous 'sacctmgr'
> command is not to allow one user utilise more than 4 nodes in this
> partition, but this seems to translate to allow 4 jobs only in m3h partition
> per user.
> 
> 
> We have an older cluster running slurs 14.08, when doing the same sacctmgr
> command, it gives us not to allow one user utilise more than 4 nodes in the
> selected partition.
> 
> 
> Is our syntax command correct ? or our interpretation of slurm documentation
> is wrong ?

A job running on a node always counts as a separate node; the way this is designed it's not able to group multiple jobs running on the same node together into a single "node" resource. So as soon as you have jobs running across four nodes, no further jobs will launch, even if they could fit in alongside those other jobs.

Our usual recommendation is to structure the QOS limits around CPUs to avoid this complication.

- Tim

Comment 2 Damien 2017-10-31 06:10:53 MDT

Hi Tim 

Thanks for your reply.

1) If we are going to implement with CPU limits, For example 4 nodes (24 cores each) which will be a 96 CPUs limit on a select partition, We might have single core job which ask for high memory and with other single core jobs from a single user will still potentially takes more nodes (more than 4 nodes). Is there a method to prevent this ? 

We are trying to limit users from taking up more of this premium nodes/partition.


2) In a related note, each of this has 2x P100 GPU cards , see:
--
cat gres.conf 
#slurm gres file for m3h001 
#No Of Devices=2
Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia0 CPUs=0-27
Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia1 CPUs=0-27
--

Can we use QOS to limit 4x GPUs per user ? If possible, How can sacctmgr this QOS ? with which parameters or syntax ?



Kindly advise. Thanks.


Cheers

Damien



(In reply to Tim Wickberg from comment #1)

> > These all are 1-CPU-core jobs, m3h001 has 24 CPU-cores. In theory, we should
> > be able to squeeze more jobs into one m3h node, but the rest are hit with
> > 'QOSMaxNodePerUserLimit'. The intention of setting the previous 'sacctmgr'
> > command is not to allow one user utilise more than 4 nodes in this
> > partition, but this seems to translate to allow 4 jobs only in m3h partition
> > per user.
> > 
> > 
> > We have an older cluster running slurs 14.08, when doing the same sacctmgr
> > command, it gives us not to allow one user utilise more than 4 nodes in the
> > selected partition.
> > 
> > 
> > Is our syntax command correct ? or our interpretation of slurm documentation
> > is wrong ?
> 
> A job running on a node always counts as a separate node; the way this is
> designed it's not able to group multiple jobs running on the same node
> together into a single "node" resource. So as soon as you have jobs running
> across four nodes, no further jobs will launch, even if they could fit in
> alongside those other jobs.
> 
> Our usual recommendation is to structure the QOS limits around CPUs to avoid
> this complication.
> 
> - Tim

Comment 3 Tim Wickberg 2017-12-06 00:57:32 MST

> 1) If we are going to implement with CPU limits, For example 4 nodes (24
> cores each) which will be a 96 CPUs limit on a select partition, We might
> have single core job which ask for high memory and with other single core
> jobs from a single user will still potentially takes more nodes (more than 4
> nodes). Is there a method to prevent this ? 
> 
> We are trying to limit users from taking up more of this premium
> nodes/partition.

You'd want to look into limits around GrpTRES, and to set them on mem and/or cpu values. So you could limit them to 40 cpus and 300gb of memory total with something like:

sacctmgr update user tim set grptres=cpu=40,mem=300gb

> 2) In a related note, each of this has 2x P100 GPU cards , see:
> --
> cat gres.conf 
> #slurm gres file for m3h001 
> #No Of Devices=2
> Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia0 CPUs=0-27
> Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia1 CPUs=0-27
> --
> 
> Can we use QOS to limit 4x GPUs per user ? If possible, How can sacctmgr
> this QOS ? with which parameters or syntax ?

You could either use a QOS, or set the limit on the user directly. You'd want to do a few things:

1) Make sure that you have the gpu type defined in AccountingStorageTRES in slurm.conf, e.g.:

AccountingStorageTRES=gres/gpu

(Restart slurmctld after making any change to that line.)

2) Use either MaxTRES (to limit what a single job can do) or GrpTRES (to limit what the collection of jobs can do, either for an individual user/account or in the QOS) to limit the gres/gpu type. Something like:

sacctmgr update qos normal set grptres=gres/gpu=4

on an appropriate QOS would handle what I believe you're after.

Comment 4 Damien 2017-12-07 19:41:02 MST

Thanks for this details.


We will give this a try and do more testings to achieve our objectives. 



Cheers

Damien

Comment 5 Tim Wickberg 2017-12-19 20:19:22 MST

Hey Damien -

Marking resolved/infogiven now; please reopen if you have any further questions.

- Tim