Summary: | Configuration Maxnodesperuser in QOS | ||
---|---|---|---|
Product: | Slurm | Reporter: | Damien <damien.leong> |
Component: | Configuration | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 16.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Monash University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Damien
2017-10-30 09:33:23 MDT
(In reply to Damien from comment #0) > Hi Slurm Support > > We are trying to configure 'MaxNodesPerUser' in QOS, but have some strange > results, see this: > ---- > > [root@m3-login2 ~]# sacctmgr modify QOS name=m3h set MaxNodesPU=4 > Modified qos... > m3h > Would you like to commit changes? (You have 30 seconds to decide) > (N/y): y > [root@m3-login2 ~]# > [root@m3-login2 ~]# sacctmgr show qos m3h > Name Priority GraceTime Preempt PreemptMode > Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin > GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins > MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA > MaxSubmitPA MinTRES > ---------- ---------- ---------- ---------- ----------- > ---------------------------------------- ---------- ----------- > ------------- ------------- ------------- ------- --------- ----------- > ------------- -------------- ------------- ----------- ------------- > --------- ----------- ------------- --------- ----------- ------------- > m3h 0 00:00:00 cluster > 1.000000 > node=4 > > > > [smaruf@m3-login2 ~]$ > [smaruf@m3-login2 ~]$ squeue -u smaruf > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 1153281 m3h run_docm smaruf R 3-13:54:28 1 m3h001 > 1154011 m3h run_docm smaruf R 2-18:47:58 1 m3h006 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157417 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157418 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157419 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157420 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157421 > [smaruf@m3-login2 ~]$ sbatch slurm-serial-job-script > Submitted batch job 1157422 > > > [smaruf@m3-login2 ~]$ > [smaruf@m3-login2 ~]$ squeue -u smaruf > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 1157419 m3h slurm-se smaruf PD 0:00 1 > (QOSMaxNodePerUserLimit) > 1157420 m3h slurm-se smaruf PD 0:00 1 > (QOSMaxNodePerUserLimit) > 1157421 m3h slurm-se smaruf PD 0:00 1 > (QOSMaxNodePerUserLimit) > 1157422 m3h slurm-se smaruf PD 0:00 1 > (QOSMaxNodePerUserLimit) > 1153281 m3h run_docm smaruf R 3-13:55:01 1 m3h001 > 1154011 m3h run_docm smaruf R 2-18:48:31 1 m3h006 > 1157418 m3h slurm-se smaruf R 0:15 1 m3h001 > 1157417 m3h slurm-se smaruf R 0:18 1 m3h001 > [smaruf@m3-login2 ~]$ > > > ---- > > These all are 1-CPU-core jobs, m3h001 has 24 CPU-cores. In theory, we should > be able to squeeze more jobs into one m3h node, but the rest are hit with > 'QOSMaxNodePerUserLimit'. The intention of setting the previous 'sacctmgr' > command is not to allow one user utilise more than 4 nodes in this > partition, but this seems to translate to allow 4 jobs only in m3h partition > per user. > > > We have an older cluster running slurs 14.08, when doing the same sacctmgr > command, it gives us not to allow one user utilise more than 4 nodes in the > selected partition. > > > Is our syntax command correct ? or our interpretation of slurm documentation > is wrong ? A job running on a node always counts as a separate node; the way this is designed it's not able to group multiple jobs running on the same node together into a single "node" resource. So as soon as you have jobs running across four nodes, no further jobs will launch, even if they could fit in alongside those other jobs. Our usual recommendation is to structure the QOS limits around CPUs to avoid this complication. - Tim Hi Tim Thanks for your reply. 1) If we are going to implement with CPU limits, For example 4 nodes (24 cores each) which will be a 96 CPUs limit on a select partition, We might have single core job which ask for high memory and with other single core jobs from a single user will still potentially takes more nodes (more than 4 nodes). Is there a method to prevent this ? We are trying to limit users from taking up more of this premium nodes/partition. 2) In a related note, each of this has 2x P100 GPU cards , see: -- cat gres.conf #slurm gres file for m3h001 #No Of Devices=2 Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia0 CPUs=0-27 Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia1 CPUs=0-27 -- Can we use QOS to limit 4x GPUs per user ? If possible, How can sacctmgr this QOS ? with which parameters or syntax ? Kindly advise. Thanks. Cheers Damien (In reply to Tim Wickberg from comment #1) > > These all are 1-CPU-core jobs, m3h001 has 24 CPU-cores. In theory, we should > > be able to squeeze more jobs into one m3h node, but the rest are hit with > > 'QOSMaxNodePerUserLimit'. The intention of setting the previous 'sacctmgr' > > command is not to allow one user utilise more than 4 nodes in this > > partition, but this seems to translate to allow 4 jobs only in m3h partition > > per user. > > > > > > We have an older cluster running slurs 14.08, when doing the same sacctmgr > > command, it gives us not to allow one user utilise more than 4 nodes in the > > selected partition. > > > > > > Is our syntax command correct ? or our interpretation of slurm documentation > > is wrong ? > > A job running on a node always counts as a separate node; the way this is > designed it's not able to group multiple jobs running on the same node > together into a single "node" resource. So as soon as you have jobs running > across four nodes, no further jobs will launch, even if they could fit in > alongside those other jobs. > > Our usual recommendation is to structure the QOS limits around CPUs to avoid > this complication. > > - Tim > 1) If we are going to implement with CPU limits, For example 4 nodes (24 > cores each) which will be a 96 CPUs limit on a select partition, We might > have single core job which ask for high memory and with other single core > jobs from a single user will still potentially takes more nodes (more than 4 > nodes). Is there a method to prevent this ? > > We are trying to limit users from taking up more of this premium > nodes/partition. You'd want to look into limits around GrpTRES, and to set them on mem and/or cpu values. So you could limit them to 40 cpus and 300gb of memory total with something like: sacctmgr update user tim set grptres=cpu=40,mem=300gb > 2) In a related note, each of this has 2x P100 GPU cards , see: > -- > cat gres.conf > #slurm gres file for m3h001 > #No Of Devices=2 > Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia0 CPUs=0-27 > Name=gpu Type=P100-PCIE-16GB File=/dev/nvidia1 CPUs=0-27 > -- > > Can we use QOS to limit 4x GPUs per user ? If possible, How can sacctmgr > this QOS ? with which parameters or syntax ? You could either use a QOS, or set the limit on the user directly. You'd want to do a few things: 1) Make sure that you have the gpu type defined in AccountingStorageTRES in slurm.conf, e.g.: AccountingStorageTRES=gres/gpu (Restart slurmctld after making any change to that line.) 2) Use either MaxTRES (to limit what a single job can do) or GrpTRES (to limit what the collection of jobs can do, either for an individual user/account or in the QOS) to limit the gres/gpu type. Something like: sacctmgr update qos normal set grptres=gres/gpu=4 on an appropriate QOS would handle what I believe you're after. Thanks for this details. We will give this a try and do more testings to achieve our objectives. Cheers Damien Hey Damien - Marking resolved/infogiven now; please reopen if you have any further questions. - Tim |