Ticket 7668 - Improve sacctmgr to print all user limits as well as current usage
Summary: Improve sacctmgr to print all user limits as well as current usage
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 18.08.8
Hardware: Linux Linux
: --- C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-08-30 06:19 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2019-08-30 06:19 MDT (History)
0 users

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2019-08-30 06:19:20 MDT
As discussed in Bug 6790, users need to be able to inquire Slurm about any kind of user or account limit, and furthermore inquire about the current usage towards that limit.  This would enable users to understand why their jobs are Pending due to Slurm limits.

Unfortunately, the current Slurm commands sacctmgr and sshare are capable of printing only a subset of all possible limits, and they can't print the current usage numbers. 

Therefore I have written a small "showuserlimits" tool available from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits.  It can display all types of Slurm limits, including for example:

showuserlimits -l MaxJobsAccrue
showuserlimits -l GrpJobsAccrue

The current limit and usage number is actually printed by the Slurm command:
$ scontrol -o show assoc_mgr users=xxx account=camdvip flags=assoc
and displays limits in the format like "GrpTRES=cpu=1500(80)" (i.e., Limit(Value) pairs) for every limit and association of user xxx.

Let me roughly sketch how one might add new options to the command "sacctmgr show association cluster=xxx account=yyy user=zzz partition=www".

Currently one can specify the selection of an incomplete set of limits using the "format=aaa" option.  Please note that the use of "sacctmgr format=..." is actually undocumented in the https://slurm.schedmd.com/sacctmgr.html page (except for a couple of usage examples).  See also my comments in https://bugs.schedmd.com/show_bug.cgi?id=6790#c3.  Could you kindly document properly the format= option in the man-page?

Contribution: I propose that the "format=aaa" option should be extended to allow the specification of *any* available limit, PLUS the user's current usage for each limit.  Examples:

sacctmgr show association user=zzz format=GrpTRESRunMins/sublimit

where "sublimit" may be any available item such as cpu, mem, energy, node, billing, gres/gpu etc.

To print also the current usage, I suggest to add a new specifier to the format option, for example:

format=name%U

in which case the Limit as well as Usage will be printed.  How should the output look?  I don't think Limit(Usage) as printed by scontrol is easy to parse, so I suggest some variants:  Limit/Usage or Limit,Usage.

Actually, user xxx will also be subject to the limits of his parent account.  The parent limits can be printed by omitting the "users=xxx" option and looking only at the first line with an empty user field.  It would be good to print the user limits together with the parent limits, which is what the "showuserlimits" script does.  I wonder if sacctmgr can be made to print both the user's limits as well as the limits of his parent account?  Users will definitely need to known which limits have hit them, be it user and/or account limits!


As an illustration the "showuserlimits" script reveals the full set of limits in our 18.08 installation:

$ showuserlimits 
Association (Parent account):
	   ClusterName = 	niflheim
	       Account = 	camdvip
	      UserName = 	None, current value = Parent account
	     Partition = 	None, current value = Any partition
	            ID = 	25
	SharesRaw/Norm/Level/Factor = 	2147483647/0.00/565/0.00
	UsageRaw/Norm/Efctv = 	3752412418.72/0.11/0.11
	 ParentAccount = 	camd, current value = 16
	           Lft = 	967
	      DefAssoc = 	No
	       GrpJobs = 	None, current value = 29
	 GrpJobsAccrue = 	None, current value = 36
	 GrpSubmitJobs = 	None, current value = 72
	       GrpWall = 	None, current value = 1683530.91
	       GrpTRES = 
		     cpu:	Limit = None, current value = 3152
		     mem:	Limit = None, current value = 29122000
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 127
		 billing:	Limit = None, current value = 3843
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	   GrpTRESMins = 
		     cpu:	Limit = None, current value = 58267403
		     mem:	Limit = None, current value = 506526257641
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 3105343
		 billing:	Limit = None, current value = 62429586
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	GrpTRESRunMins = 
		     cpu:	Limit = None, current value = 5724829
		     mem:	Limit = None, current value = 47109030100
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 250151
		 billing:	Limit = None, current value = 7120542
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	       MaxJobs = 	
	 MaxJobsAccrue = 	
	 MaxSubmitJobs = 	
	     MaxWallPJ = 	
	     MaxTRESPJ = 	
	     MaxTRESPN = 	
	 MaxTRESMinsPJ = 	
	 MinPrioThresh = 	
Association (User):
	   ClusterName = 	niflheim
	       Account = 	camdvip
	      UserName = 	xxx, UID=123456
	     Partition = 	None, current value = Any partition
	            ID = 	58
	SharesRaw/Norm/Level/Factor = 	3/0.01/565/0.50
	UsageRaw/Norm/Efctv = 	975630.49/0.00/0.00
	 ParentAccount = 	
	           Lft = 	1018
	      DefAssoc = 	Yes
	       GrpJobs = 	None, current value = 0
	 GrpJobsAccrue = 	None, current value = 0
	 GrpSubmitJobs = 	None, current value = 0
	       GrpWall = 	None, current value = 241.08
	       GrpTRES = 
		     cpu:	Limit = 2000, current value = 0
		     mem:	Limit = None, current value = 0
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 0
		 billing:	Limit = None, current value = 0
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	   GrpTRESMins = 
		     cpu:	Limit = None, current value = 9737
		     mem:	Limit = None, current value = 87251849
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 243
		 billing:	Limit = None, current value = 16068
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	GrpTRESRunMins = 
		     cpu:	Limit = 1000000, current value = 0
		     mem:	Limit = None, current value = 0
		  energy:	Limit = None, current value = 0
		    node:	Limit = None, current value = 0
		 billing:	Limit = None, current value = 0
		 fs/disk:	Limit = None, current value = 0
		    vmem:	Limit = None, current value = 0
		   pages:	Limit = None, current value = 0

	       MaxJobs = 	200, current value = 0
	 MaxJobsAccrue = 	30, current value = 0
	 MaxSubmitJobs = 	200, current value = 0
	     MaxWallPJ = 	
	     MaxTRESPJ = 	
	     MaxTRESPN = 	
	 MaxTRESMinsPJ = 	
	 MinPrioThresh =