Ticket 3397

Summary: MaxTRESPU for GRES
Product: Slurm Reporter: Jeff White <jeff.white>
Component: AccountingAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4767
Site: Washington State University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
gres.conf

Description Jeff White 2017-01-12 17:30:56 MST
Created attachment 3933 [details]
slurm.conf

This relates to bug ID 1225.  In it I see:

sacctmgr modify QOS foo tres=gpu/tesla set MaxTRESPerUser=10

... and a mention of Slurm version 15.08.  I'm running 16.05.5-1 and that command fails:

# sacctmgr modify QOS kamiak_test tres=gpu/tesla set MaxTRESPerUser=1
 Unknown condition: tres=gpu/tesla
 Use keyword 'set' to modify SLURM_PRINT_VALUE
sacctmgr: error: slurmdb_format_tres_str: no value found

I'm already using MaxTRESPU to limit CPU and memory.  I'm trying to now limit GPUs as well.  Is there a way to do so in 16.05?
Comment 1 Jeff White 2017-01-12 17:31:50 MST
Created attachment 3934 [details]
gres.conf
Comment 2 Tim Wickberg 2017-01-12 17:53:44 MST
You'll need to modify your slurm.conf file to inform the slurmdbd that you have those TRES types. You'd want to add something like

AccountingStorageTRES=gres/gpu,gres/gpu/tesla

to slurm.conf before you can set that QOS limit. You'd need to restart slurmctld after this, and then run 'scontrol reconfigure' to update all the nodes as well.

Once done, you can set a limit like so:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu/tesla=10
Comment 3 Tim Wickberg 2017-01-12 18:42:21 MST
I misspoke in part, the syntax is subtly different and there are some limitations in how the current implementation works. I'm outlining a simplified version that avoids some unexpected side-effects below.

I'd recommend setting:

AccountingStorageTRES=gres/gpu

in slurm.conf, then restarting slurmctld. (The restart is required to push the update to slurmdbd.)

Running 'scontrol reconfigure' after that will prevent a lot of warning messages about the configuration file being out of sync with the compute nodes, although that's not a big deal here; they aren't affected by that change.

After that's done, you can set your limit with:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu=10

Users running under that QOS will then be limited to 10 devices (either requested as --gres=gpu:10 or --gres=gpu:tesla:10, or through a combination of jobs).

- Tim
Comment 4 Jeff White 2017-01-13 10:52:19 MST
The settings mentioned in comment 3 appear to do what I need to be able to add GPUs to MaxTRESPerUser.