Bug 12031 - sacct cannot print GPU usage information
Summary: sacct cannot print GPU usage information
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other bugs)
Version: 20.11.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-07-14 05:40 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2021-07-14 10:18 MDT (History)
0 users

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 05:40:24 MDT
Management has asked me to provide cluster usage reports ASAP giving specified data for each job, such as provided by

$ sacct -o JobID,User,AllocNodes,AllocCPUS

Further information needed is job walltime, which I can calculate as End-Start.

However, the number of GPUs used per job seems to be unavailable to sacct in the database.  Jobs that are currently running can be inquired like this:

$ squeue -hO tres-per-node -j 3827335
gpu:RTX3090:1    

It would seem that the tres-per-node information is not recorded in the database.

Question: Is there some way to read the tres-per-node information from the database?

Question: Is there some way to configure the recording of tres-per-node, if it isn't there already?

Thanks,
Ole
Comment 1 Ben Roberts 2021-07-14 08:33:30 MDT
Hi Ole,

You can have the TRES information recorded in the database.  It shows up in the AllocTRES field, but the controller does have to be configured to record that information.  You use the AccountingStorageTRES parameter to define which TRES you would like to track in the database.

Here is an example where I have the following in my slurm.conf:
AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1



I submit a job that requests 1 'tesla1' gpu and you can see that sacct reports that that gpu was allocated:

$ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
Submitted batch job 29745

$ sacct -j 29745 --format=jobid,account,alloctres%45
       JobID    Account                                     AllocTRES 
------------ ---------- --------------------------------------------- 
29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 


Let me know if you have any questions about this or if you're seeing different results.

Thanks,
Ben
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-07-14 09:26:05 MDT
Hi Ben,

(In reply to Ben Roberts from comment #1)
> You can have the TRES information recorded in the database.  It shows up in
> the AllocTRES field, but the controller does have to be configured to record
> that information.  You use the AccountingStorageTRES parameter to define
> which TRES you would like to track in the database.
> 
> Here is an example where I have the following in my slurm.conf:
> AccountingStorageTRES=billing,cpu,energy,mem,node,gres/gpu:tesla1
> 
> 
> 
> I submit a job that requests 1 'tesla1' gpu and you can see that sacct
> reports that that gpu was allocated:
> 
> $ sbatch -N1 --gres=gpu:tesla1:1 --wrap='srun sleep 5'
> Submitted batch job 29745
> 
> $ sacct -j 29745 --format=jobid,account,alloctres%45
>        JobID    Account                                     AllocTRES 
> ------------ ---------- --------------------------------------------- 
> 29745              sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.batch        sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 29745.extern       sub1 billing=2,cpu=1,gres/gpu:tesla1=1,mem=100M,n+ 
> 29745.0            sub1       cpu=1,gres/gpu:tesla1=1,mem=100M,node=1 
> 
> 
> Let me know if you have any questions about this or if you're seeing
> different results.

Thanks for the info! I configured this in slurm.conf:

AccountingStorageTRES=gres/gpu,gres/gpu:K20Xm,gres/gpu:RTX3090

and restarted slurmctld.  Now we do get GPU accounting as desired:

$ sacct -j 3829403 -p --format=jobid,account,alloctres
JobID|Account|AllocTRES|
3829403|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.batch|ecsstud|cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|
3829403.extern|ecsstud|billing=28,cpu=14,gres/gpu:rtx3090=1,gres/gpu=1,mem=30G,node=1|

I guess you may close this case now.

Thanks,
Ole
Comment 3 Ben Roberts 2021-07-14 10:18:20 MDT
I'm glad that will get you the information you need going forward, but I'm sorry that you don't have the historical data you were looking for.  I'll close this ticket but let us know if there is anything else we can do to help.

Thanks,
Ben