Bug 5819 - CPU allocation is not fully cyclic when MaxCPUsPerNode is set on partition
Summary: CPU allocation is not fully cyclic when MaxCPUsPerNode is set on partition
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 18.08.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-10-06 21:22 MDT by CUI Hao
Modified: 2018-10-08 09:07 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description CUI Hao 2018-10-06 21:22:46 MDT
According to https://slurm.schedmd.com/cpu_management.html:

  The default allocation method within a node is cyclic allocation (allocate available CPUs in a round-robin fashion across the sockets within a node).

I think that cyclic allocation is useful when a cluster wants to reverse cores for GPU devices. For example, if a node has two GPUs bind to two different sockets and I set MaxCPUsPerNode on CPU-only partition to reverse at least two cores for GPU partition, cyclic CPU allocation will ensure that CPU jobs taken cores alternately from each socket and leave cores of different sockets for each GPU devices.

However, I didn't get the above expected behaviour. Our cluster has partition/node config like this:

  SelectType=select/cons_res
  SelectTypeParameters=CR_Core_Memory
  NodeName=wmc-slave-g3 CPUs=24 RealMemory=63500 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:2
  PartitionName=cpu1 Nodes=wmc-slave-g3 MaxCPUsPerNode=20
  PartitionName=gpu2 Nodes=wmc-slave-g3 MaxCPUsPerNode=2

If I run `srun -n 20 -p cpu1`, I notice that Slurm will allocate 8 cores from the first socket and 12 socket from the second socket. This is not optimal as the second socket have no cores for GPU queue!

I dig into select/cons_res source code and find the part start from the following lines probobly cause the problem (which exists before 18.08.1):
  /*
   * Ignore resources that would push a job allocation over the
   * partition CPU limit (if any). Remove cores from consideration by
   * taking them from the sockets with the lowest free_cores count.
   * This will tend to satisfy a job's --cores-per-socket specification.
   */

So select/cons_res remove free cores (excess to MaxCPUsPerNode limit) from "the sockets with the lowest free_cores count". As for my config, node wmc-slave-g3 has 24 cores and 4 excess cores for partition cpu1. If I submit a job when the node is free, select/cons_res marks 4 cores of the first socket unusable. This break the default cyclic allocation behaviour when more than 16 cores is required by the job.
Comment 1 Jacob Jenson 2018-10-08 09:07:24 MDT
CUI Hao,

The support system was not able to associate your email address with a support contract. We need to know which site you work for before we can route this ticket. 

Can you please tell me which site you work for so I can appropriately route this ticket? 

Jacob