Created attachment 3933 [details] slurm.conf This relates to bug ID 1225. In it I see: sacctmgr modify QOS foo tres=gpu/tesla set MaxTRESPerUser=10 ... and a mention of Slurm version 15.08. I'm running 16.05.5-1 and that command fails: # sacctmgr modify QOS kamiak_test tres=gpu/tesla set MaxTRESPerUser=1 Unknown condition: tres=gpu/tesla Use keyword 'set' to modify SLURM_PRINT_VALUE sacctmgr: error: slurmdb_format_tres_str: no value found I'm already using MaxTRESPU to limit CPU and memory. I'm trying to now limit GPUs as well. Is there a way to do so in 16.05?
Created attachment 3934 [details] gres.conf
You'll need to modify your slurm.conf file to inform the slurmdbd that you have those TRES types. You'd want to add something like AccountingStorageTRES=gres/gpu,gres/gpu/tesla to slurm.conf before you can set that QOS limit. You'd need to restart slurmctld after this, and then run 'scontrol reconfigure' to update all the nodes as well. Once done, you can set a limit like so: sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu/tesla=10
I misspoke in part, the syntax is subtly different and there are some limitations in how the current implementation works. I'm outlining a simplified version that avoids some unexpected side-effects below. I'd recommend setting: AccountingStorageTRES=gres/gpu in slurm.conf, then restarting slurmctld. (The restart is required to push the update to slurmdbd.) Running 'scontrol reconfigure' after that will prevent a lot of warning messages about the configuration file being out of sync with the compute nodes, although that's not a big deal here; they aren't affected by that change. After that's done, you can set your limit with: sacctmgr modify QOS foo set MaxTRESPerUser=gres/gpu=10 Users running under that QOS will then be limited to 10 devices (either requested as --gres=gpu:10 or --gres=gpu:tesla:10, or through a combination of jobs). - Tim
The settings mentioned in comment 3 appear to do what I need to be able to add GPUs to MaxTRESPerUser.