Bug 2580

Summary:	Questions regarding resource limits
Product:	Slurm	Reporter:	Jeff White <jeff.white>
Component:	Accounting	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.8
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=4245
Site:	Washington State University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	It's called slurm.conf, what do you think it would be?

Description Jeff White 2016-03-23 08:20:33 MDT

I am trying to determine how to implement certain resource limits on our cluster and I'm looking for guidance on how we should do so.

My partitions are configured as follows:

#
# PARTITIONS
#
# The default memory per CPU is (128GB / 20 CPU cores ~= 6GB) by default
# as that is the minimum a node can have as of 2016-3-14 - Jeff White
PartitionName=DEFAULT MaxTime=10080 State=UP Default=NO DefMemPerCPU=6144
PartitionName=cahnrs Nodes=cn[1-13] DefMemPerCPU=12288 AllowAccounts=cahnrs
PartitionName=cahnrs_bigmem Nodes=sn[4] DefMemPerCPU=1992294 AllowAccounts=cahnrs
PartitionName=cahnrs_gpu Nodes=sn[2] DefMemPerCPU=256000 AllowAccounts=cahnrs
PartitionName=cas Nodes=cn[14-25] DefMemPerCPU=12288 AllowAccounts=cas
PartitionName=vcea Nodes=cn[26-28] DefMemPerCPU=12288 AllowAccounts=vcea
PartitionName=popgenom Nodes=cn[29-30] DefMemPerCPU=12288 AllowAccounts=cahnrs
PartitionName=katz Nodes=cn[31] DefMemPerCPU=12288 AllowAccounts=katz
PartitionName=free Nodes=cn[32-35] Default=YES DefMemPerCPU=12288 AllowAccounts=all
PartitionName=free_gpu Nodes=sn[3] DefMemPerCPU=256000 AllowAccounts=all
PartitionName=free_phi Nodes=sn[1] DefMemPerCPU=256000 AllowAccounts=all
PartitionName=popgenom Nodes=cn[33-34] DefMemPerCPU=12288 AllowAccounts=popgenom
PartitionName=linpack Nodes=cn[21-28] DefMemPerCPU=12288  AllowAccounts=all

I have accounts configured so that each partition has a 1-to-1 mapping to an account.  Each account is then associated with the users who should be able to access the partition.  The parent of all of those accounts is an account called "all" which has no users directly associated with it.

I have not configured anything with qos, mostly because I'm not clear on what "qos" is and if I absolutely need it or not.  From what I can tell there are different types of resource limitations, some for partitions, some for qos, some for "associations", etc.

The first limit I'm being asked for is to prevent a single user (regardless of number of jobs they submit) from using all nodes in a partition.  Can you recommend how we can do that in Slurm?  Do I need to use qos for something like that or is there a setting that can be applied directly to a partition?

Comment 1 Jeff White 2016-03-23 08:44:08 MDT

Created attachment 2906 [details]
It's called slurm.conf, what do you think it would be?

Comment 2 Tim Wickberg 2016-03-23 09:05:57 MDT

(In reply to Jeff White from comment #0)
> I am trying to determine how to implement certain resource limits on our
> cluster and I'm looking for guidance on how we should do so.


> I have accounts configured so that each partition has a 1-to-1 mapping to an
> account.  Each account is then associated with the users who should be able
> to access the partition.  The parent of all of those accounts is an account
> called "all" which has no users directly associated with it.

If I recall, you have a strict "condo" model where each group only has access to their own hardware?

If so then your partitions look reasonable.

> I have not configured anything with qos, mostly because I'm not clear on
> what "qos" is and if I absolutely need it or not.  From what I can tell
> there are different types of resource limitations, some for partitions, some
> for qos, some for "associations", etc.

We could document this a bit better. I assume you've looked at

http://slurm.schedmd.com/qos.html
http://slurm.schedmd.com/resource_limits.html

As association is a mapping of limits to a (cluster,account), possibly further limited by the partition and user.

sacctmgr is the tool to view and modify all of these, I'll give some examples below.

Think of a QOS as a set of limits (and bucket of usage which is compared to the limit) that isn't tied to the accounting hierarchy.

> The first limit I'm being asked for is to prevent a single user (regardless
> of number of jobs they submit) from using all nodes in a partition.  Can you
> recommend how we can do that in Slurm?  Do I need to use qos for something
> like that or is there a setting that can be applied directly to a partition?

There's a few approaches to this depending on your preference.

One way would be to, per user, set a maximum number of nodes they have access to:

sacctmgr update user tim set maxtres=node=2

Another approach would be to set a max grprunmin

sacctmgr update user tim set grpwall=1000

this would limit me to a maximum of 1000 cpu-minutes of jobs running total on the cluster at any time, which may be simpler to explain than why a nodes may be sitting idle otherwise.

If you'd rather not set these limits individually, you could use a QOS per account to set a value for everyone in the account:

sacctmgr add qos acct_timlab maxtresperuser=node=2

(Note that "acct_tim" is just my shorthand for the QOS being for the timlab account - the names can be any text string you prefer.)

sacctmgr update account timlab set qos=acct_timlab

An easy way to verify this is set correctly then is:
sacctmgr show assoc tree format=account,user,qos

Alternatively, if you get to a point where multiple accounts have access to the same partition, you can build a QOS with the various limits and set it on the partition itself and have that apply to everyone running on that partition.

Let me know how I can elaborate on this, I'm hoping this at least points you in the right direction.

- Tim

Comment 3 Jeff White 2016-03-24 10:16:06 MDT

(In reply to Tim Wickberg from comment #2)
> If I recall, you have a strict "condo" model where each group only has
> access to their own hardware?
> 
> If so then your partitions look reasonable.

That's correct.  There's also a "free" partition anyone can access and at some point in the future there will be a backfill partition that will also be accessible by any user.

> One way would be to, per user, set a maximum number of nodes they have
> access to:
> 
> sacctmgr update user tim set maxtres=node=2
> 
> Another approach would be to set a max grprunmin
> 
> sacctmgr update user tim set grpwall=1000
> 
> this would limit me to a maximum of 1000 cpu-minutes of jobs running total
> on the cluster at any time, which may be simpler to explain than why a nodes
> may be sitting idle otherwise.
> 
> If you'd rather not set these limits individually, you could use a QOS per
> account to set a value for everyone in the account:
> 
> sacctmgr add qos acct_timlab maxtresperuser=node=2
> 
> (Note that "acct_tim" is just my shorthand for the QOS being for the timlab
> account - the names can be any text string you prefer.)
> 
> sacctmgr update account timlab set qos=acct_timlab
> 
> An easy way to verify this is set correctly then is:
> sacctmgr show assoc tree format=account,user,qos
> 
> Alternatively, if you get to a point where multiple accounts have access to
> the same partition, you can build a QOS with the various limits and set it
> on the partition itself and have that apply to everyone running on that
> partition.

We have the opposite at the moment, for example my user is in multiple accounts.  Most users are only in one account though and that account is a 1-to-1 mapping to a partition.  The exception is an account named "all" which is the parent of all other accounts.  No idea if that's a good idea but it works so that's what I did.

What you describe above seems like that would be good for global limits but what I'm looking for I guess is "max nodes per user per partition regardless of number of jobs".  In an ideal world it would simply be "PartitionName=blah MaxNodesPerUser=50%" so a single user can't take more than 50% of a partition (regardless of what they may be doing in other partitions).  How could we implement that?  Partitions have a "MaxNodes" parameter but that's per job so a user can simply submit multiple jobs to get around the limitation.

Comment 4 Tim Wickberg 2016-03-25 06:26:01 MDT

> We have the opposite at the moment, for example my user is in multiple
> accounts.  Most users are only in one account though and that account is a
> 1-to-1 mapping to a partition.  The exception is an account named "all"
> which is the parent of all other accounts.  No idea if that's a good idea
> but it works so that's what I did.
> 
> What you describe above seems like that would be good for global limits but
> what I'm looking for I guess is "max nodes per user per partition regardless
> of number of jobs".  In an ideal world it would simply be
> "PartitionName=blah MaxNodesPerUser=50%" so a single user can't take more
> than 50% of a partition (regardless of what they may be doing in other
> partitions).  How could we implement that?  Partitions have a "MaxNodes"
> parameter but that's per job so a user can simply submit multiple jobs to
> get around the limitation.

The easiest way to do this will be with a "Partition QOS" defined with a strict node limit (there is not "50%" setting available, it works off strict counts only so you'd need to adjust these to suit).

The MaxTRESPerUser flag is designed to handle this exact situation. Briefly, you'd define a QOS to use on the partition as:

sacctmgr add qos part_example maxtresperuser=node=2

Once created, you can now apply this QOS to a partition by setting QOS=part_example in the Partition definition in slurm.conf.

For example, my line is now:
PartitionName=example Nodes=zoidberg[01-04] MaxTime=7-0 QOS=part_example

Running 'scontrol reconfigure' will apply that change to the partition definition.

'scontrol show part example' can be used to confirm the setting is applied. 'scontrol show assoc_mgr' can give you a detailed look into the internal status of the various QOS and association limits that are currently in use on the cluster.

You may also want to set SchedulerParameters=assoc_limit_continue to avoid the highest-priority job in a given partition from blocking other jobs in that partition from launching.

Comment 5 Jeff White 2016-03-25 09:08:20 MDT

I created a QOS as you described and it doesn't seem to be working as expected.  I'm now doing this on a development system, I'll upload its config.  So to make things simple I have a single partition and I applied a single QOS to it:

#
# PARTITIONS
#
PartitionName=DEFAULT MaxTime=10080 State=UP Default=NO DefMemPerCPU=100
PartitionName=whatever Nodes=dn[1-4] QOS=whatever

That QOS was created with `sacctmgr add qos whatever maxtresperuser=node=2`.  I restarted slurmctld then submitted a few single CPU jobs.  Two of them began running, taking up a single node which has 2 CPU cores.  The next two jobs went into PENDING with QOSMaxNodePerUserLimit.  Shouldn't the QOS have allowed these jobs to run as my user only had running jobs on a single node?

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                33  whatever run_burn jeff.whi PD       0:00      1 (QOSMaxNodePerUserLimit)
                34  whatever run_burn jeff.whi PD       0:00      1 (QOSMaxNodePerUserLimit)
                31  whatever run_burn jeff.whi  R       4:52      1 dn1
                32  whatever run_burn jeff.whi  R       4:52      1 dn1

# sacctmgr show qos whatever
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU       MinTRES 
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- 
  whatever          0   00:00:00                cluster                                                        1.000000                                                                                                                                       node=2

# scontrol show part whatever
PartitionName=whatever
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=whatever
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dn[1-4]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=8 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerCPU=100 MaxMemPerNode=UNLIMITED

Comment 6 Tim Wickberg 2016-03-25 09:12:17 MDT

> That QOS was created with `sacctmgr add qos whatever maxtresperuser=node=2`.
> I restarted slurmctld then submitted a few single CPU jobs.  Two of them
> began running, taking up a single node which has 2 CPU cores.  The next two
> jobs went into PENDING with QOSMaxNodePerUserLimit.  Shouldn't the QOS have
> allowed these jobs to run as my user only had running jobs on a single node?

Each of the "nodes" from separate jobs counts independently, even though the jobs are packed into a single node. So in this case one node isn't one node, but two.

I should have pointed out that limit. I'd suggest using the cpu counts instead, that'll be a bit more intuitive.

Comment 7 Jeff White 2016-03-29 08:21:19 MDT

Closing issue, we were able to get limits in place as described.