Bug 12031

Summary: sacct cannot print GPU usage information
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: AccountingAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 05:40:24 MDT
Management has asked me to provide cluster usage reports ASAP giving specified data for each job, such as provided by

$ sacct -o JobID,User,AllocNodes,AllocCPUS

Further information needed is job walltime, which I can calculate as End-Start.

However, the number of GPUs used per job seems to be unavailable to sacct in the database.  Jobs that are currently running can be inquired like this:

$ squeue -hO tres-per-node -j 3827335
gpu:RTX3090:1    

It would seem that the tres-per-node information is not recorded in the database.

Question: Is there some way to read the tres-per-node information from the database?

Question: Is there some way to configure the recording of tres-per-node, if it isn't there already?

Thanks,
Ole
Comment 1 Ben Roberts 2021-07-14 08:33:30 MDT
Hi Ole,

You can have the TRES information recorded in the database.  It shows up in the AllocTRES field, but the controller does have to be configured to record that information.  You use the AccountingStorageTRES parameter to define which TRES you would like to track in the database.

Here is an example where I have the following in my slurm.conf:
AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1



I submit a job that requests 1 'tesla1' gpu and you can see that sacct reports that that gpu was allocated:

$ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
Submitted batch job 29745

$ sacct -j 29745 --format=jobid,account,alloctres%45
       JobID    Account                                     AllocTRES 
------------ ---------- --------------------------------------------- 
29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 


Let me know if you have any questions about this or if you're seeing different results.

Thanks,
Ben
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 09:26:05 MDT
Hi Ben,

(In reply to Ben Roberts from comment #1)
> You can have the TRES information recorded in the database.  It shows up in
> the AllocTRES field, but the controller does have to be configured to record
> that information.  You use the AccountingStorageTRES parameter to define
> which TRES you would like to track in the database.
> 
> Here is an example where I have the following in my slurm.conf:
> AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1
> 
> 
> 
> I submit a job that requests 1 'tesla1' gpu and you can see that sacct
> reports that that gpu was allocated:
> 
> $ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
> Submitted batch job 29745
> 
> $ sacct -j 29745 --format=jobid,account,alloctres%45
>        JobID    Account                                     AllocTRES 
> ------------ ---------- --------------------------------------------- 
> 29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 
> 
> Let me know if you have any questions about this or if you're seeing
> different results.

Thanks for the info! I configured this in slurm.conf:

AccountingStorageTRES=gres/gpu,gres/gpu:K20Xm,gres/gpu:RTX3090

and restarted slurmctld.  Now we do get GPU accounting as desired:

$ sacct -j 3829403 -p --format=jobid,account,alloctres
JobID|Account|AllocTRES|
3829403|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.batch|ecsstud|cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.extern|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|

I guess you may close this case now.

Thanks,
Ole
Comment 3 Ben Roberts 2021-07-14 10:18:20 MDT
I'm glad that will get you the information you need going forward, but I'm sorry that you don't have the historical data you were looking for.  I'll close this ticket but let us know if there is anything else we can do to help.

Thanks,
Ben