Bug 12031

Summary:	sacct cannot print GPU usage information
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Accounting	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 05:40:24 MDT

Management has asked me to provide cluster usage reports ASAP giving specified data for each job, such as provided by

$ sacct -o JobID,User,AllocNodes,AllocCPUS

Further information needed is job walltime, which I can calculate as End-Start.

However, the number of GPUs used per job seems to be unavailable to sacct in the database.  Jobs that are currently running can be inquired like this:

$ squeue -hO tres-per-node -j 3827335
gpu:RTX3090:1    

It would seem that the tres-per-node information is not recorded in the database.

Question: Is there some way to read the tres-per-node information from the database?

Question: Is there some way to configure the recording of tres-per-node, if it isn't there already?

Thanks,
Ole

Comment 1 Ben Roberts 2021-07-14 08:33:30 MDT

Hi Ole,

You can have the TRES information recorded in the database.  It shows up in the AllocTRES field, but the controller does have to be configured to record that information.  You use the AccountingStorageTRES parameter to define which TRES you would like to track in the database.

Here is an example where I have the following in my slurm.conf:
AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1



I submit a job that requests 1 'tesla1' gpu and you can see that sacct reports that that gpu was allocated:

$ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
Submitted batch job 29745

$ sacct -j 29745 --format=jobid,account,alloctres%45
       JobID    Account                                     AllocTRES 
------------ ---------- --------------------------------------------- 
29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 


Let me know if you have any questions about this or if you're seeing different results.

Thanks,
Ben

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 09:26:05 MDT

Hi Ben,

(In reply to Ben Roberts from comment #1)
> You can have the TRES information recorded in the database.  It shows up in
> the AllocTRES field, but the controller does have to be configured to record
> that information.  You use the AccountingStorageTRES parameter to define
> which TRES you would like to track in the database.
> 
> Here is an example where I have the following in my slurm.conf:
> AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1
> 
> 
> 
> I submit a job that requests 1 'tesla1' gpu and you can see that sacct
> reports that that gpu was allocated:
> 
> $ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
> Submitted batch job 29745
> 
> $ sacct -j 29745 --format=jobid,account,alloctres%45
>        JobID    Account                                     AllocTRES 
> ------------ ---------- --------------------------------------------- 
> 29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 
> 
> Let me know if you have any questions about this or if you're seeing
> different results.

Thanks for the info! I configured this in slurm.conf:

AccountingStorageTRES=gres/gpu,gres/gpu:K20Xm,gres/gpu:RTX3090

and restarted slurmctld.  Now we do get GPU accounting as desired:

$ sacct -j 3829403 -p --format=jobid,account,alloctres
JobID|Account|AllocTRES|
3829403|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.batch|ecsstud|cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.extern|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|

I guess you may close this case now.

Thanks,
Ole

Comment 3 Ben Roberts 2021-07-14 10:18:20 MDT

I'm glad that will get you the information you need going forward, but I'm sorry that you don't have the historical data you were looking for.  I'll close this ticket but let us know if there is anything else we can do to help.

Thanks,
Ben