Summary: | How to implement user resource throttling using GrpTRES settings? | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | Limits | Assignee: | Director of Support <support> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | tshaw |
Version: | 16.05.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Output from the command: squeue -u murme |
Description
Ole.H.Nielsen@fysik.dtu.dk
2017-03-20 05:26:04 MDT
Ole, Yes, using the command: sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000 for user "xxx" is the correct way to setup these limits. Some good commands to see these limits are: sacctmgr show assoc tree and show assoc tree format=account,user,maxtres Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the parent account, but don't use a QOS. That way, if you ever want to override these limits you can use a QOS to do so. Hope that helps. Regards. Tim (In reply to Tim Shaw from comment #2) > Yes, using the command: > > sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000 > > for user "xxx" is the correct way to setup these limits. Some good commands > to see these limits are: > > sacctmgr show assoc tree > > and > > show assoc tree format=account,user,maxtres Thanks, this command gives the desired info: sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin > Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the > parent account, but don't use a QOS. That way, if you ever want to override > these limits you can use a QOS to do so. Why wouldn't I just reapply new GrpTRES and GrpTRESRunMin limits? I guess I still need to understand what a QOS does for you with Slurm (with Maui we use QOS to bump specific jobs or users up in priority). Could you kindly give examples of how you might use QOS for some specific purposes? Thanks, Ole Ole, I'll try explain this better. With Slurm, the associations are meant to establish base limits on the defined partitions, accounts, & users. Because limits propagate down through the association tree, you only need to define limits at a high level and those limits will be applied to all partitions, accounts, & users that are below it (parent to child). You can also override those high level (parent) limits by explicitly setting different limits at any lower level (on the child). So using the association tree is the best way to get some base limits applied that you want for most cases. QOS's are meant to override any of those base limits for exceptional cases. Like Maui, you can use QOS's to set a different priority. Again, the QOS would be overriding the base priority that could be set in the associations. In Slurm, QOS's are the most powerful enforcing tool. For example, setting a priority on an Account and overriding it with a QOS: First set up weights in the slurm.conf for fairshare and QOS: PriorityWeightFairshare=1000 PriorityWeightQOS=10000 Add an account with a fairshare priority: sacctmgr add account science Description="science accounts" Organization=science fairshare=5 Add an QOS meant to override the base priority: sacctmgr add qos high priority=10 Submit a normal job to the account: sbatch -A science job.sh Note: because all jobs have a QOS set, in this job the "normal" QOS that has nothing configured is used. The "normal" QOS is the default QOS. Submit another job to the account but override the priority using the high QOS: sbatch -A science --qos=high job.sh This job's priority is higher because the high qos is a factor in the job priority now. Hope that helps. Tim (In reply to Tim Shaw from comment #2) > Yes, using the command: > > sacctmgr modify user xxx set GrpTRES=cpu=1000 GrpTRESRunMin=cpu=2000000 > > for user "xxx" is the correct way to setup these limits. Some good commands > to see these limits are: > > sacctmgr show assoc tree > > and > > show assoc tree format=account,user,maxtres I had set the GrpTRES & GrpTRESRunMin for every user association, and this indeed resulted in the desired throttling of user jobs. > Regarding question #2, you will want apply GrpTRES & GrpTRESRunMin to the > parent account, but don't use a QOS. That way, if you ever want to override > these limits you can use a QOS to do so. Using this approach, I've also applied GrpTRES & GrpTRESRunMin to the root account, expecting the values to be inherited down the account tree until it hits the actual user associations. I verified that only the root account, and no other associations, had set the GrpTRES & GrpTRESRunMin by: sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin Unfortunately, this setup doesn't give the expected result. No user jobs are now held Idle (throttled) due to the limits. I have waited a number of minutes for the slurmctld to recalculate priorities every 5 minutes. One user with queued jobs now runs on 1080 CPU cores despite the limit of 1000. Question: Could you kindly explain to me exactly what is meant by your advice "apply GrpTRES & GrpTRESRunMin to the parent account"? Wouldn't you expect the limits to be inherited from the root account down the association tree? I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1). Now I should have the previous GrpTRES & GrpTRESRunMin setup, and this is verified by: sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin However, the limits are no longer working! Here are my observations for one user "murme" with many jobs in the queue: 1. I attach a file with the output of "squeue -u murme" showing that limits are not applied. 2. The sacctmgr says that the limits should be set: # sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin | grep murme ecsvip murme cpu=1000 cpu=2000000 3. Another command seemingly says that there are no limits (consistent with the output of squeue): # sacctmgr show user name=murme format=account,user,grptres,GrpTRESRunMin Account User GrpTRES GrpTRESRunMin ---------- ---------- ------------- ------------- murme I wonder if we have found a bug, or if I've made some mistake or have a misunderstanding? Created attachment 4238 [details]
Output from the command: squeue -u murme
>Using this approach, I've also applied GrpTRES & GrpTRESRunMin to the root >account, >expecting the values to be inherited down the account tree until it hits >the actual >user associations. I verified that only the root account, and no other >associations, >had set the GrpTRES & GrpTRESRunMin by: >sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin >Unfortunately, this setup doesn't give the expected result. No user jobs are now >held Idle (throttled) due to the limits. I have waited a number of minutes for the >slurmctld to recalculate priorities every 5 minutes. One user with queued jobs now >runs on 1080 CPU cores despite the limit of 1000. Max limits are inherited down association tree but GrpTRES & GrpTRESRunMin (as well as all group limits) are for creating groups in the hierarchy tree and therefore don't get inherited per se. A group is considered all the associations below where the group limits are set. So these can't be inherited because they're creating a group and applying a combined total limit for all the associations in it. I think in an attempt to educate you more on associations in general, I may have confused you with regards to GrpTRES & GrpTRESRunMin so hopefully this clears things up. Because you want to set total cpus for each user, you were doing it correctly by setting GrpTRES and GrpTRESRunMin on each user (which creates a little group that only contains one user with limits). Also, when printing associations with "sacctmgr", you will only see group limits at the level where they were set, not on every association in the group. Which explains why when you set them on the root level, they didn't appear anywhere else. >I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root >account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1). Now I should have the previous >GrpTRES & GrpTRESRunMin setup, and this is verified by: >sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin >However, the limits are no longer working! First, just make sure you still have AccountingStorageEnforce enabled in the slurm.conf. For example: AccountingStorageEnforce=associations,limits,qos,safe If not, set it and restart the controller. Also, check to make sure the slurmctld also sees GrpTRES & GrpTRESRunMin set on the associations with: scontrol show assoc_mgr You should see user field that looks like this with GrpTRES & GrpTRESRunMins set: ClusterName=chiron Account=science UserName=tshaw(1010) Partition= ID=4 SharesRaw/Norm/Level/Factor=1/1.00/1/1.00 UsageRaw/Norm/Efctv=0.00/0.00/0.00 ParentAccount= Lft=3 DefAssoc=Yes GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00) --> GrpTRES=cpu=1000(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) --> GrpTRESRunMins=cpu=2000000(0),mem=N(0),energy=N(0),node=N(0) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= If not, try restarting the controller (if you haven't already). Let me know if you're still seeing the problem. Tim (In reply to Tim Shaw from comment #7) > >I then reapplied GrpTRES & GrpTRESRunMin to every user association, and cleared the root > >account's GrpTRES (cpu=-1) & GrpTRESRunMin (cpu=-1). Now I should have the previous > >GrpTRES & GrpTRESRunMin setup, and this is verified by: > >sacctmgr show assoc tree format=account,user,grptres,GrpTRESRunMin > > >However, the limits are no longer working! > > First, just make sure you still have AccountingStorageEnforce enabled in the > slurm.conf. For example: > > AccountingStorageEnforce=associations,limits,qos,safe > > If not, set it and restart the controller. I already had associations,limits,qos but now I added safe also and restarted slurmctld. > Also, check to make sure the slurmctld also sees GrpTRES & GrpTRESRunMin set > on the associations with: > > scontrol show assoc_mgr > > You should see user field that looks like this with GrpTRES & GrpTRESRunMins > set: > > ClusterName=chiron Account=science UserName=tshaw(1010) Partition= ID=4 > SharesRaw/Norm/Level/Factor=1/1.00/1/1.00 > UsageRaw/Norm/Efctv=0.00/0.00/0.00 > ParentAccount= Lft=3 DefAssoc=Yes > GrpJobs=N(0) > GrpSubmitJobs=N(0) GrpWall=N(0.00) > --> GrpTRES=cpu=1000(0),mem=N(0),energy=N(0),node=N(0) > GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) > --> GrpTRESRunMins=cpu=2000000(0),mem=N(0),energy=N(0),node=N(0) > MaxJobs= MaxSubmitJobs= MaxWallPJ= > MaxTRESPJ= > MaxTRESPN= > MaxTRESMinsPJ= > > If not, try restarting the controller (if you haven't already). > > Let me know if you're still seeing the problem. The problem is seen on and off over time. When I came in this morning, all queued jobs of user murme had status AssocGrpCpuLimit rather than the value of Priority last night. Something changed overnight. At this time the user's jobs have mixed status, a selection as shown by squeue: 25123 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) 25124 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) 25145 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) 25232 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) 25233 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) 25234 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) despite the user having reached the limit GrpTRES=cpu=1000 as seen by "scontrol show assoc_mgr": ClusterName=niflheim Account=ecsvip UserName=murme(213660) Partition= ID=204 SharesRaw/Norm/Level/Factor=3/0.01/439/0.06 UsageRaw/Norm/Efctv=196334254.70/0.03/0.03 ParentAccount= Lft=336 DefAssoc=Yes GrpJobs=N(55) GrpSubmitJobs=N(124) GrpWall=N(153342.24) GrpTRES=cpu=1000(1000),mem=N(7168000),energy=N(0),node=N(55) GrpTRESMins=cpu=N(3272237),mem=N(25003603883),energy=N(0),node=N(153342) GrpTRESRunMins=cpu=2000000(1359464),mem=N(9189229813),energy=N(0),node=N(83671) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= This doesn't make sense to me: Either the user has hit the GrpTRES limit or he hasn't, so I think that all jobs should have status AssocGrpCpuLimit at this time. I did restart slurmctld this morning, so that shouldn't cause the problem. >At this time the user's jobs have mixed status, a selection as shown by squeue: >25123 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) >25124 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) >25145 xeon24 9-OCH3_S murme PD 0:00 1 (Priority) >25232 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) >25233 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) >25234 xeon8 8-13T murme PD 0:00 1 (AssocGrpCpuLimit) The reason why not all user murme's jobs have the "AssocGrpCpuLimit" reason is because job limits are not looked at until the job is being evaluated to run. Because jobs 25123, 25124, & 25145 cannot run right now due to their priority, their limits are not even considered. Because these job cannot run right now, Slurm doesn't waste time processing their limits. As their priority increases high enough for the job to start running, then the job's limits are evaluated and enforced and the reason with change to AssocGrpCpuLimit. The important thing here is that not you're not seeing jobs run after the users's limits (i.e. GrpTRES=cpu=1000) have been reached. Can you confirm that is case? (In reply to Tim Shaw from comment #9) > The reason why not all user murme's jobs have the "AssocGrpCpuLimit" reason > is because job limits are not looked at until the job is being evaluated to > run. Because jobs 25123, 25124, & 25145 cannot run right now due to their > priority, their limits are not even considered. Because these job cannot > run right now, Slurm doesn't waste time processing their limits. As their > priority increases high enough for the job to start running, then the job's > limits are evaluated and enforced and the reason with change to > AssocGrpCpuLimit. Thanks for the explanation. This is certainly confusing for users (and new Slurm administrators), but I can see how it makes sense performance-wise to delay the evaluation of limits. > The important thing here is that not you're not seeing jobs run after the > users's limits (i.e. GrpTRES=cpu=1000) have been reached. > > Can you confirm that is case? Yes, it seems that our configured limits are now being enforced correctly. This case may be closed now. |