Bug 3600

Summary:	How to implement user resource throttling using GrpTRES settings?
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Limits	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	tshaw
Version:	16.05.10
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	Output from the command: squeue -u murme

Description Ole.H.Nielsen@fysik.dtu.dk 2017-03-20 05:26:04 MDT

We're in the process of discovering how to implement user resource throttling in Slurm.  The goal is to limit the maximum number of CPU cores allowed for running jobs, and it would also be good to limit too many long-running jobs which might monopolize the cluster.

Our old cluster uses the MAUI scheduler, which has detailed "throttling policies", see http://docs.adaptivecomputing.com/maui/throttling306.php, which we would like to adopt also in our Slurm cluster.

I guess that the relevant configurations somehow would use the https://slurm.schedmd.com/resource_limits.html, but I have a hard time converting the information in that manual into an actual configuration policy.

Question 1: If I want to limit user xxx to running on max 1000 CPU cores at a time, and also prohibit too many long-running jobs from starting, could you kindly confirm that the following would be the correct approach:

sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000

It's a little tricky to display these new Grp* settings, but this works for me:

sacctmgr show assoc where user=xxx

Question 2: If I would like to implement also throttling for a group of users under a given account, should I then apply GrpTRES and GrpTRESRunMin to the parent account, or should I use a QOS?  Examples would be appreciated.

Thanks,
Ole

Comment 2 Tim Shaw 2017-03-20 13:15:49 MDT

Ole,

Yes, using the command:

sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000

for user "xxx" is the correct way to setup these limits.  Some good commands to see these limits are:

sacctmgr show assoc tree

and

show assoc tree format=account,user,maxtres

Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the parent account, but don't use a QOS.  That way, if you ever want to override these limits you can use a QOS to do so.

Hope that helps.

Regards.

Tim

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2017-03-20 15:15:52 MDT

(In reply to Tim Shaw from comment #2)
> Yes, using the command:
> 
> sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000
> 
> for user "xxx" is the correct way to setup these limits.  Some good commands
> to see these limits are:
> 
> sacctmgr show assoc tree
> 
> and
> 
> show assoc tree format=account,user,maxtres

Thanks, this command gives the desired info:

sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin

> Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the
> parent account, but don't use a QOS.  That way, if you ever want to override
> these limits you can use a QOS to do so.

Why wouldn't I just reapply new GrpTRES and GrpTRESRunMin limits?  I guess I still need to understand what a QOS does for you with Slurm (with Maui we use QOS to bump specific jobs or users up in priority).  

Could you kindly give examples of how you might use QOS for some specific purposes?

Thanks,
Ole

Comment 4 Tim Shaw 2017-03-21 09:59:50 MDT

Ole,

I'll try explain this better. With Slurm, the associations are meant to establish base limits on the defined partitions, accounts, & users. Because limits propagate down through the association tree, you only need to define limits at a high level and those limits will be applied to all partitions, accounts, & users that are below it (parent to child). You can also override those high level (parent) limits by explicitly setting different limits at any lower level (on the child). So using the association tree is the best way to get some base limits applied that you want for most cases. QOS's are meant to override any of those base limits for exceptional cases. Like Maui, you can use QOS's to set a different priority. Again, the QOS would be overriding the base priority that could be set in the associations. In Slurm, QOS's are the most powerful enforcing tool.

For example, setting a priority on an Account and overriding it with a QOS:

First set up weights in the slurm.conf for fairshare and QOS:

PriorityWeightFairshare=1000
PriorityWeightQOS=10000

Add an account with a fairshare priority:

sacctmgr add account science Description="science accounts" Organization=science fairshare=5

Add an QOS meant to override the base priority:

sacctmgr add qos high priority=10

Submit a normal job to the account:

sbatch -A science job.sh

Note: because all jobs have a QOS set, in this job the "normal" QOS that has nothing configured is used. The "normal" QOS is the default QOS.

Submit another job to the account but override the priority using the high QOS:

sbatch -A science --qos=high job.sh

This job's priority is higher because the high qos is a factor in the job priority now.

Hope that helps.

Tim

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2017-03-22 03:15:57 MDT

(In reply to Tim Shaw from comment #2)
> Yes, using the command:
> 
> sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000
> 
> for user "xxx" is the correct way to setup these limits.  Some good commands
> to see these limits are:
> 
> sacctmgr show assoc tree
> 
> and
> 
> show assoc tree format=account,user,maxtres

I had set the GrpTRES & GrpTRESRunMin for every user association, and this indeed resulted in the desired throttling of user jobs.

> Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the
> parent account, but don't use a QOS.  That way, if you ever want to override
> these limits you can use a QOS to do so.

Using this approach, I've also applied GrpTRES & GrpTRESRunMin to the root account, expecting the values to be inherited down the account tree until it hits the actual user associations.  I verified that only the root account, and no other associations, had set the GrpTRES & GrpTRESRunMin by:

sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin

Unfortunately, this setup doesn't give the expected result.  No user jobs are now held Idle (throttled) due to the limits.  I have waited a number of minutes for the slurmctld to recalculate priorities every 5 minutes.  One user with queued jobs now runs on 1080 CPU cores despite the limit of 1000.

Question: Could you kindly explain to me exactly what is meant by your advice "apply GrpTRES & GrpTRESRunMin to the parent account"?  Wouldn't you expect the limits to be inherited from the root account down the association tree?

I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1).  Now I should have  the previous GrpTRES & GrpTRESRunMin setup, and this is verified by:
sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin

However, the limits are no longer working!  Here are my observations for one user "murme" with many jobs in the queue:

1. I attach a file with the output of "squeue -u murme" showing that limits are not applied.

2. The sacctmgr says that the limits should be set:

# sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin | grep murme
     ecsvip               murme      cpu=1000   cpu=2000000 

3. Another command seemingly says that there are no limits (consistent with the output of squeue):

# sacctmgr show user name=murme format=account,user,grptres,GrpTRESRunMin
   Account       User       GrpTRES GrpTRESRunMin 
---------- ---------- ------------- ------------- 
                murme     

I wonder if we have found a bug, or if I've made some mistake or have a misunderstanding?

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2017-03-22 03:17:47 MDT

Created attachment 4238 [details]
Output from the command: squeue -u murme

Comment 7 Tim Shaw 2017-03-22 11:49:25 MDT

>Using this approach, I've also applied GrpTRES & GrpTRESRunMin to the root >account,
>expecting the values to be inherited down the account tree until it hits >the actual 
>user associations.  I verified that only the root account, and no other >associations,
>had set the GrpTRES & GrpTRESRunMin by:

>sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin

>Unfortunately, this setup doesn't give the expected result.  No user jobs are now
>held Idle (throttled) due to the limits.  I have waited a number of minutes for the
>slurmctld to recalculate priorities every 5 minutes.  One user with queued jobs now 
>runs on 1080 CPU cores despite the limit of 1000.

Max limits are inherited down association tree but GrpTRES & GrpTRESRunMin (as well as all group limits) are for creating groups in the hierarchy tree and therefore don't get inherited per se.  A group is considered all the associations below where the group limits are set.  So these can't be inherited because they're creating a group and applying a combined total limit for all the associations in it.  I think in an attempt to educate you more on associations in general, I may have confused you with regards to GrpTRES & GrpTRESRunMin so hopefully this clears things up.  Because you want to set total cpus for each user, you were doing it correctly by setting GrpTRES and GrpTRESRunMin on each user (which creates a little group that only contains one user with limits).

Also, when printing associations with "sacctmgr", you will only see group limits at the level where they were set, not on every association in the group.  Which explains why when you set them on the root level, they didn't appear anywhere else.


>I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root
>account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1).  Now I should have the previous
>GrpTRES & GrpTRESRunMin setup, and this is verified by:
>sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin

>However, the limits are no longer working!

First, just make sure you still have AccountingStorageEnforce enabled in the slurm.conf.  For example:

AccountingStorageEnforce=associations,limits,qos,safe

If not, set it and restart the controller.

Also, check to make sure the slurmctld also sees GrpTRES & GrpTRESRunMin set on the associations with:

scontrol show assoc_mgr

You should see user field that looks like this with GrpTRES & GrpTRESRunMins set:

ClusterName=chiron Account=science UserName=tshaw(1010) Partition= ID=4
    SharesRaw/Norm/Level/Factor=1/1.00/1/1.00
    UsageRaw/Norm/Efctv=0.00/0.00/0.00
    ParentAccount= Lft=3 DefAssoc=Yes
    GrpJobs=N(0)
    GrpSubmitJobs=N(0) GrpWall=N(0.00)
--> GrpTRES=cpu=1000(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
--> GrpTRESRunMins=cpu=2000000(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=

If not, try restarting the controller (if you haven't already).

Let me know if you're still seeing the problem.

Tim

Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2017-03-23 06:16:09 MDT

(In reply to Tim Shaw from comment #7)
> >I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root
> >account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1).  Now I should have the previous
> >GrpTRES & GrpTRESRunMin setup, and this is verified by:
> >sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin
> 
> >However, the limits are no longer working!
> 
> First, just make sure you still have AccountingStorageEnforce enabled in the
> slurm.conf.  For example:
> 
> AccountingStorageEnforce=associations,limits,qos,safe
> 
> If not, set it and restart the controller.

I already had associations,limits,qos but now I added safe also and restarted slurmctld.

> Also, check to make sure the slurmctld also sees GrpTRES & GrpTRESRunMin set
> on the associations with:
> 
> scontrol show assoc_mgr
> 
> You should see user field that looks like this with GrpTRES & GrpTRESRunMins
> set:
> 
> ClusterName=chiron Account=science UserName=tshaw(1010) Partition= ID=4
>     SharesRaw/Norm/Level/Factor=1/1.00/1/1.00
>     UsageRaw/Norm/Efctv=0.00/0.00/0.00
>     ParentAccount= Lft=3 DefAssoc=Yes
>     GrpJobs=N(0)
>     GrpSubmitJobs=N(0) GrpWall=N(0.00)
> --> GrpTRES=cpu=1000(0),mem=N(0),energy=N(0),node=N(0)
>     GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
> --> GrpTRESRunMins=cpu=2000000(0),mem=N(0),energy=N(0),node=N(0)
>     MaxJobs= MaxSubmitJobs= MaxWallPJ=
>     MaxTRESPJ=
>     MaxTRESPN=
>     MaxTRESMinsPJ=
> 
> If not, try restarting the controller (if you haven't already).
> 
> Let me know if you're still seeing the problem.

The problem is seen on and off over time.  When I came in this morning, all queued jobs of user murme had status AssocGrpCpuLimit rather than the value of Priority last night.  Something changed overnight.

At this time the user's jobs have mixed status, a selection as shown by squeue:

25123    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
25124    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
25145    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
25232     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)
25233     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)
25234     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)

despite the user having reached the limit GrpTRES=cpu=1000 as seen by "scontrol show assoc_mgr":

ClusterName=niflheim Account=ecsvip UserName=murme(213660) Partition= ID=204
    SharesRaw/Norm/Level/Factor=3/0.01/439/0.06
    UsageRaw/Norm/Efctv=196334254.70/0.03/0.03
    ParentAccount= Lft=336 DefAssoc=Yes
    GrpJobs=N(55)
    GrpSubmitJobs=N(124) GrpWall=N(153342.24)
    GrpTRES=cpu=1000(1000),mem=N(7168000),energy=N(0),node=N(55)
    GrpTRESMins=cpu=N(3272237),mem=N(25003603883),energy=N(0),node=N(153342)
    GrpTRESRunMins=cpu=2000000(1359464),mem=N(9189229813),energy=N(0),node=N(83671)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=

This doesn't make sense to me: Either the user has hit the GrpTRES limit or he hasn't, so I think that all jobs should have status AssocGrpCpuLimit at this time.  I did restart slurmctld this morning, so that shouldn't cause the problem.

Comment 9 Tim Shaw 2017-03-23 11:38:23 MDT

>At this time the user's jobs have mixed status, a selection as shown by squeue:

>25123    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
>25124    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
>25145    xeon24 9-OCH3_S    murme PD       0:00      1 (Priority)
>25232     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)
>25233     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)
>25234     xeon8    8-13T    murme PD       0:00      1 (AssocGrpCpuLimit)

The reason why not all user murme's jobs have the "AssocGrpCpuLimit" reason is because job limits are not looked at until the job is being evaluated to run.  Because jobs 25123, 25124, & 25145 cannot run right now due to their priority, their limits are not even considered.  Because these job cannot run right now, Slurm doesn't waste time processing their limits.  As their priority increases high enough for the job to start running, then the job's limits are evaluated and enforced and the reason with change to AssocGrpCpuLimit.

The important thing here is that not you're not seeing jobs run after the users's limits (i.e. GrpTRES=cpu=1000) have been reached.

Can you confirm that is case?

Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2017-03-28 08:14:50 MDT

(In reply to Tim Shaw from comment #9)
> The reason why not all user murme's jobs have the "AssocGrpCpuLimit" reason
> is because job limits are not looked at until the job is being evaluated to
> run.  Because jobs 25123, 25124, & 25145 cannot run right now due to their
> priority, their limits are not even considered.  Because these job cannot
> run right now, Slurm doesn't waste time processing their limits.  As their
> priority increases high enough for the job to start running, then the job's
> limits are evaluated and enforced and the reason with change to
> AssocGrpCpuLimit.

Thanks for the explanation.  This is certainly confusing for users (and new Slurm administrators), but I can see how it makes sense performance-wise to delay the evaluation of limits.

> The important thing here is that not you're not seeing jobs run after the
> users's limits (i.e. GrpTRES=cpu=1000) have been reached.
> 
> Can you confirm that is case?

Yes, it seems that our configured limits are now being enforced correctly.

This case may be closed now.