Bug 3397 - MaxTRESPU for GRES
Summary: MaxTRESPU for GRES
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other bugs)
Version: 16.05.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-01-12 17:30 MST by Jeff White
Modified: 2018-02-13 05:26 MST (History)
0 users

See Also:
Site: Washington State University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (13.49 KB, text/plain)
2017-01-12 17:30 MST, Jeff White
Details
gres.conf (152 bytes, text/plain)
2017-01-12 17:31 MST, Jeff White
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff White 2017-01-12 17:30:56 MST
Created attachment 3933 [details]
slurm.conf

This relates to bug ID 1225.  In it I see:

sacctmgr modify QOS foo tres=gpu/tesla set MaxTRESPerUser=10

... and a mention of Slurm version 15.08.  I'm running 16.05.5-1 and that command fails:

# sacctmgr modify QOS kamiak_test tres=gpu/tesla set MaxTRESPerUser=1
 Unknown condition: tres=gpu/tesla
 Use keyword 'set' to modify SLURM_PRINT_VALUE
sacctmgr: error: slurmdb_format_tres_str: no value found

I'm already using MaxTRESPU to limit CPU and memory.  I'm trying to now limit GPUs as well.  Is there a way to do so in 16.05?
Comment 1 Jeff White 2017-01-12 17:31:50 MST
Created attachment 3934 [details]
gres.conf
Comment 2 Tim Wickberg 2017-01-12 17:53:44 MST
You'll need to modify your slurm.conf file to inform the slurmdbd that you have those TRES types. You'd want to add something like

AccountingStorageTRES=gres/gpu,gres/gpu/tesla

to slurm.conf before you can set that QOS limit. You'd need to restart slurmctld after this, and then run 'scontrol reconfigure' to update all the nodes as well.

Once done, you can set a limit like so:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu/tesla=10
Comment 3 Tim Wickberg 2017-01-12 18:42:21 MST
I misspoke in part, the syntax is subtly different and there are some limitations in how the current implementation works. I'm outlining a simplified version that avoids some unexpected side-effects below.

I'd recommend setting:

AccountingStorageTRES=gres/gpu

in slurm.conf, then restarting slurmctld. (The restart is required to push the update to slurmdbd.)

Running 'scontrol reconfigure' after that will prevent a lot of warning messages about the configuration file being out of sync with the compute nodes, although that's not a big deal here; they aren't affected by that change.

After that's done, you can set your limit with:

sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu=10

Users running under that QOS will then be limited to 10 devices (either requested as --gres=gpu:10 or --gres=gpu:tesla:10, or through a combination of jobs).

- Tim
Comment 4 Jeff White 2017-01-13 10:52:19 MST
The settings mentioned in comment 3 appear to do what I need to be able to add GPUs to MaxTRESPerUser.