Ticket 3505 - Jobs get scheduled on too many nodes, whereas they should be run on as few nodes as possible
Summary: Jobs get scheduled on too many nodes, whereas they should be run on as few no...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 16.05.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-02-27 05:35 MST by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2018-06-07 02:49 MDT (History)
1 user (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
our slurm.conf file (2.14 KB, text/plain)
2017-02-27 05:35 MST, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2017-02-27 05:35:54 MST
Created attachment 4107 [details]
our slurm.conf file

We have 179 24-core nodes, and it seems that Slurm schedules multi-cpu jobs on too many nodes in stead of keeping the jobs compact on as few nodes as possible.

Examples are:
sbatch -t 03:00:00 -n 48 --partition=xeon24 submit_sylg.py mBEEF_Ama2_NS.py 
  Gets scheduled on 3 24-core nodes in stead of 2 nodes.
sbatch -n2 -J DCDFTv2b -p xeon24 -t 720 script.sh 
  Gets scheduled on 2 24-core nodes in stead of 1 node.

Question: How can we make sure that jobs are scheduled to the minimum number of nodes required by the -n parameter?

FYI: We use PriorityType=priority/multifactor in slurm.conf (file is attached).
Comment 3 Alejandro Sanchez 2017-02-28 08:45:04 MST
Hi,

I think appending 'CR_LLN' to SelectTypeParameters or setting 'LLN' to a specific partition will give you what you want. In short, this will schedule resources to jobs on the least loaded nodes but increase fragmentation. Please, let me know if this fits your node minimization goal.
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2017-02-28 09:10:29 MST
(In reply to Alejandro Sanchez from comment #3)
> I think appending 'CR_LLN' to SelectTypeParameters or setting 'LLN' to a
> specific partition will give you what you want. In short, this will schedule
> resources to jobs on the least loaded nodes but increase fragmentation.
> Please, let me know if this fits your node minimization goal.

We're trying to achieve the opposite of the CR_LLN, which would apparently increase fragmentation.  The CR_LLN is defined in https://slurm.schedmd.com/slurm.conf.html and it doesn't appear to pack jobs compactly onto as few nodes as possible, but in stead spread jobs onto more nodes than the minimum necessary number.

I think what we're looking for is similar to the Moab/Maui scheduler's 
  NODEALLOCATIONPOLICY MINRESOURCE
defined as: "This algorithm priorities nodes according to the configured resources on each node. Those nodes with the fewest configured resources which still meet the job's resource constraints are selected." (see http://docs.adaptivecomputing.com/maui/5.2nodeallocation.php#MINRESOURCE)

This sounds to me like a common goal suitable for most HPC sites.
Comment 5 Alejandro Sanchez 2017-02-28 09:21:54 MST
Let's see if I understand what you want properly. You have this partition with 179 24-core nodes. Let's imagine that 2 of these nodes are half-allocated, so that 1 node has 12 CPUs allocated and another node has 12 CPUs allocated too, and the rest of the nodes are idle. Now you submit a new job to this partition with -n24, what you want is that these 24 tasks fill up these first two nodes so that the overall node utilization in the cluster is minimized, right? Not that the number of nodes using for the job is minimized (that would be LLN). Am I correct with your goal now?
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2017-02-28 09:28:21 MST
(In reply to Alejandro Sanchez from comment #5)
> Let's see if I understand what you want properly. You have this partition
> with 179 24-core nodes. Let's imagine that 2 of these nodes are
> half-allocated, so that 1 node has 12 CPUs allocated and another node has 12
> CPUs allocated too, and the rest of the nodes are idle. Now you submit a new
> job to this partition with -n24, what you want is that these 24 tasks fill
> up these first two nodes so that the overall node utilization in the cluster
> is minimized, right? Not that the number of nodes using for the job is
> minimized (that would be LLN). Am I correct with your goal now?

I don't think so.  What we want is for jobs to be scheduled to the minimum number of nodes possible, so that we avoid fragmentation of the cluster.  So a new job with -n24 should be scheduled to exactly 1 node, not 2 or 3 (assuming there exists at least 1 node with all 24 cores free).  This is not what we have experienced in the two examples given above, where jobs got scheduled to multiple nodes unnecessarily.
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2017-02-28 09:34:05 MST
(In reply to Alejandro Sanchez from comment #5)
> Let's see if I understand what you want properly. You have this partition
> with 179 24-core nodes. Let's imagine that 2 of these nodes are
> half-allocated, so that 1 node has 12 CPUs allocated and another node has 12
> CPUs allocated too, and the rest of the nodes are idle. Now you submit a new
> job to this partition with -n24, what you want is that these 24 tasks fill
> up these first two nodes so that the overall node utilization in the cluster
> is minimized, right? Not that the number of nodes using for the job is
> minimized (that would be LLN). Am I correct with your goal now?

Also, LLN seems to cause fragmentation as described in the slurm.conf page:

CR_LLN
    Schedule resources to jobs on the least loaded nodes (based upon the number of idle CPUs). This is generally only recommended for an environment with serial jobs as idle resources will tend to be highly fragmented, resulting in parallel jobs being distributed across many nodes.
Comment 8 Alejandro Sanchez 2017-02-28 10:59:09 MST
Ok, I'm gonna ask you some info to better analyze what's going on, for example with this submission:

"sbatch -n2 -J DCDFTv2b -p xeon24 -t 720 script.sh 
  Gets scheduled on 2 24-core nodes in stead of 1 node."

$ scontrol setdebugflags +steps
$ sinfo -p xeon24 -N -O "nodelist,statecompact,cpusstate,memory,allocmem,weight"
$ squeue -p xeon24
$ sbatch -n2 -J DCDFTv2b -p xeon24 -t 720 script.sh
$ sinfo -p xeon24 -N -O "nodelist,statecompact,cpusstate,memory,allocmem,weight"
$ squeue -p xeon24
$ scontrol show job <jobid>
$ scontrol setdebugflags -steps

And also the script.sh and the slurmctld.log part related to the job submission Step debug flag messages?

I want to see the initial nodes state right before the job runs, and what's the state once it is allocated. If you think the node state is gonna change between the submission and the job start, let me know. The thing is to analyze how the job is allocated resources and see if this is expected behavior with your config.

Independently of that, the "right" request for jobs that want a specific node count is to ask for that node count, otherwise the scheduler may spread them out.
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2017-03-06 07:51:14 MST
(In reply to Alejandro Sanchez from comment #8)
> Ok, I'm gonna ask you some info to better analyze what's going on, for
> example with this submission:
> 
> "sbatch -n2 -J DCDFTv2b -p xeon24 -t 720 script.sh 
>   Gets scheduled on 2 24-core nodes in stead of 1 node."
> 
> $ scontrol setdebugflags +steps
> $ sinfo -p xeon24 -N -O
> "nodelist,statecompact,cpusstate,memory,allocmem,weight"
> $ squeue -p xeon24
> $ sbatch -n2 -J DCDFTv2b -p xeon24 -t 720 script.sh
> $ sinfo -p xeon24 -N -O
> "nodelist,statecompact,cpusstate,memory,allocmem,weight"
> $ squeue -p xeon24
> $ scontrol show job <jobid>
> $ scontrol setdebugflags -steps
> 
> And also the script.sh and the slurmctld.log part related to the job
> submission Step debug flag messages?
> 
> I want to see the initial nodes state right before the job runs, and what's
> the state once it is allocated. If you think the node state is gonna change
> between the submission and the job start, let me know. The thing is to
> analyze how the job is allocated resources and see if this is expected
> behavior with your config.
> 
> Independently of that, the "right" request for jobs that want a specific
> node count is to ask for that node count, otherwise the scheduler may spread
> them out.

We're not currently seeing the job fragmentation issue on our cluster.  It may perhaps be due to a changing load pattern of the queue.  I suggest that we close this case for now.  If the problem recurs, I'll follow your procedure for obtaining debugging data.
Comment 16 Alejandro Sanchez 2017-03-07 10:16:15 MST
Ole,

The logic that is picking the resources for the job is in the _eval_nodes() function in src/plugins/select/cons_res/job_test.c.

After the available nodes and cores are identified, the _eval_nodes() function is not preferring idle nodes, but sequentially going through the lowest weight nodes and accumulating cores. Here are some possible solutions:

1. The node count can be specified at job submit time or using a job_submit plugin. This requires more work on the part of the user and could prevent the job from running if idle nodes are not available (i.e. depending upon the node count, idle nodes could be required rather than preferred).

2. The job can be submitted with the --exclusive option to get dedicated nodes. This is probably simpler than specifying a node count, but also could prevent the job from running if idle nodes are not available.

3. Add a new job option that does the opposite of "--spread-job". If set, that would use the same logic as CR_LLN, but only for jobs that specifically request that option. If we were to add that option, it could be set using a job_submit plugin for jobs requesting larger task counts an the user would not need to change behavior.

If we add logic for option 3, that would probably only go into Slurm version 17.11 (a patch might be provided for version 17.02) and it would probably not work with the topology/tree plugin.

Please, let me know if either options 1 or 2 suffices for you. Otherwise if we go for option 3 I'll go ahead and reclassify this as a sev-5.
Comment 19 Christopher Samuel 2018-06-07 02:26:46 MDT
Hi Alejandro,

This is a situation we're currently struggling with at Swinburne too, we would like nodes for jobs picked by least available resource that still fits the jobs requirements.  This way single core jobs will fill up spare cores on otherwise fragmented nodes (hopefully) leaving entire nodes free for larger parallel jobs.

In the interim we are looking to corral jobs that use single nodes to a partition that only has access to a subset of nodes and require jobs that use more than 1 node of 32 cores to request a multiple of the permitted 32 cores per node for non-GPU jobs (all done via the submit filter).

I'm not sure what the best way to pursue that with you folks though, whether you wish to reopen this ticket to keep the context or for me to create a new one?

All the best,
Chris
Comment 20 Alejandro Sanchez 2018-06-07 02:49:53 MDT
I'd create a new one.