Ticket 1827 - Scheduler Slowness
Summary: Scheduler Slowness
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-07-25 02:49 MDT by Paul Edmon
Modified: 2015-07-28 08:19 MDT (History)
2 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2015-07-25 02:49:41 MDT
Yesterday afternoon we noticed our scheduler grind to a crawl and it is spitting out messages like:

messages:Jul 25 10:35:05 holy-slurm01 slurmctld[26696]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29127 jobs tested
messages:Jul 25 10:35:59 holy-slurm01 slurmctld[26696]: build_job_queue has been running for 2000000 usec, exiting with 17100 of 29102 jobs tested
messages:Jul 25 10:36:01 holy-slurm01 slurmctld[26696]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29102 jobs tested
messages:Jul 25 10:36:49 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17600 of 29086 jobs tested
messages:Jul 25 10:36:58 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29088 jobs tested
messages:Jul 25 10:37:30 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29082 jobs tested
messages:Jul 25 10:37:58 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29091 jobs tested
messages:Jul 25 10:38:00 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29091 jobs tested
messages:Jul 25 10:38:02 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29091 jobs tested
messages:Jul 25 10:38:04 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17300 of 29091 jobs tested
messages:Jul 25 10:38:06 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29091 jobs tested
messages:Jul 25 10:38:28 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17600 of 29089 jobs tested
messages:Jul 25 10:38:39 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29089 jobs tested
messages:Jul 25 10:38:46 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29089 jobs tested
messages:Jul 25 10:38:48 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29089 jobs tested
messages:Jul 25 10:38:53 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17300 of 29089 jobs tested
messages:Jul 25 10:39:10 holy-slurm01 slurmctld[32993]: build_job_queue has been running for 2000000 usec, exiting with 17200 of 29090 jobs tested
messages:Jul 25 10:39:53 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17200 of 29098 jobs tested
messages:Jul 25 10:39:55 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17300 of 29098 jobs tested
messages:Jul 25 10:39:58 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17200 of 29098 jobs tested
messages:Jul 25 10:40:00 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29099 jobs tested
messages:Jul 25 10:40:05 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29099 jobs tested
messages:Jul 25 10:40:32 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29102 jobs tested
messages:Jul 25 10:40:37 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29098 jobs tested
messages:Jul 25 10:40:46 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29103 jobs tested
messages:Jul 25 10:40:49 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29103 jobs tested
messages:Jul 25 10:40:51 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29103 jobs tested
messages:Jul 25 10:40:54 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29103 jobs tested
messages:Jul 25 10:40:58 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29103 jobs tested
messages:Jul 25 10:41:00 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17400 of 29103 jobs tested
messages:Jul 25 10:41:02 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29103 jobs tested
messages:Jul 25 10:41:04 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17600 of 29103 jobs tested
messages:Jul 25 10:41:07 holy-slurm01 slurmctld[37318]: build_job_queue has been running for 2000000 usec, exiting with 17500 of 29103 jobs tested

(This isn't a full log I've just greped out the build_job_queue messages)

I've never seen this before in our environment. I thought that maybe we had too many jobs in the queue, but that appears to not be the case as it seems to be happening regardless of how many jobs there are.  So clearly it isn't the job count, something changed or is causing build job queue to take forever.  I added the max_sched_time=50 to our conf which seems to have somewhat restored scheduling.  With out it the main scheduling loop doesn't even run, only the backfill.  With it it does run all be it slowly.

I'm baffled as to what could have caused this change in behavior as we have made no changes on our end in terms of messing with the scheduler version or parameters.  The scheduler was happily scheduling jobs at full pace up till the last day or so.  At this point we have a bunch of jobs that could run pending because the scheduler is taking to long for the primary loop.

Any thoughts or suggestions about why the main loop is taking so long? You can see up till recently we were hovering around 1 second for the main loop:

http://portal.rc.fas.harvard.edu/slurmmon/?r=week&z=default

and we have suddenly jumped to pushing 2 if not longer.  Seems to be midday on the 24th that the scheduler suddenly starts slowing down.

Can a user job cause this? I have a hard time though thinking that a user job would make the main loop take so long.
Comment 1 Moe Jette 2015-07-25 03:02:25 MDT
Let me first start by telling you what the message indicates.

At the start of Slurm's scheduling logic, it scans the list of runnable jobs and
for each job, in each partition the job is eligible to run in, an entry is made in the list. Then quicksort is run on that list. Then an attempt is made to schedule those jobs. The critical factor here is the runnable job count plus the partition count for those jobs.

What is the typical count of eligible partitions on your ~30k jobs?

If you can't lower that count, you can increase the timeout, but the overhead you have would be rather crippling. This is an excerpt from the slurm.conf man page:

SchedulerParameters
    The interpretation of this parameter varies by SchedulerType. Multiple options may be comma separated.

build_queue_timeout=#
    Defines the maximum time that can be devoted to building a queue of jobs to be tested for scheduling. If the system has a huge number of jobs with dependencies, just building the job queue can take so much time as to adversely impact overall system performance and this parameter can be adjusted as needed. The default value is 2,000,000 microseconds (2 seconds).
Comment 2 Paul Edmon 2015-07-25 03:21:03 MDT
Ok, that makes sense.  So just to clarify.  If a user does:

#SBATCH -p partition1, partition2,partition3,partition4

That will be significantly more burden than just:

#SBATCH -p partition1

We have many people who will submit jobs to multiple partitions in order 
to maximize their throughput.  However, from the above it appears that 
this actually causes the scheduler to slow down significantly as well.  
We can have people who submit to up to 3-5 partitions at once, and 
thousands of jobs at that.

So if I'm understanding this properly the only solution is either:

A. Reduce total job count.
B. Stop submitting to multiple partitions.

I guess I had never considered the overhead of the number of partitions 
being submitted to before, but now that you point it out it makes 
sense.  If you can confirm my thoughts above I can look into what can be 
done as I will have to talk with my group here as to what the best 
option is.  As of right now I'm not sure for our environment.

I did try the build_queue_timeout, which did allow it to get through the 
full queue but as you noted all it does is just exaserbate the problem 
of slowness by giving it more time rather than actually solving the real 
issue.

-Paul Edmon-

On 7/25/2015 11:02 AM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1827 
> <http://bugs.schedmd.com/show_bug.cgi?id=1827>
> What 	Removed 	Added
> Assignee 	brian@schedmd.com 	jette@schedmd.com
>
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1827#c1> on bug 
> 1827 <http://bugs.schedmd.com/show_bug.cgi?id=1827> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Let me first start by telling you what the message indicates.
>
> At the start of Slurm's scheduling logic, it scans the list of runnable jobs
> and
> for each job, in each partition the job is eligible to run in, an entry is made
> in the list. Then quicksort is run on that list. Then an attempt is made to
> schedule those jobs. The critical factor here is the runnable job count plus
> the partition count for those jobs.
>
> What is the typical count of eligible partitions on your ~30k jobs?
>
> If you can't lower that count, you can increase the timeout, but the overhead
> you have would be rather crippling. This is an excerpt from the slurm.conf man
> page:
>
> SchedulerParameters
>      The interpretation of this parameter varies by SchedulerType. Multiple
> options may be comma separated.
>
> build_queue_timeout=#
>      Defines the maximum time that can be devoted to building a queue of jobs to
> be tested for scheduling. If the system has a huge number of jobs with
> dependencies, just building the job queue can take so much time as to adversely
> impact overall system performance and this parameter can be adjusted as needed.
> The default value is 2,000,000 microseconds (2 seconds).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 3 Moe Jette 2015-07-25 03:28:10 MDT
(In reply to Paul Edmon from comment #2)
> Ok, that makes sense.  So just to clarify.  If a user does:
> 
> #SBATCH -p partition1, partition2,partition3,partition4
> 
> That will be significantly more burden than just:
> 
> #SBATCH -p partition1

Correct, and not just 4x worse because sorting takes order n log(n)


> We have many people who will submit jobs to multiple partitions in order 
> to maximize their throughput.  However, from the above it appears that 
> this actually causes the scheduler to slow down significantly as well.  
> We can have people who submit to up to 3-5 partitions at once, and 
> thousands of jobs at that.
> 
> So if I'm understanding this properly the only solution is either:
> 
> A. Reduce total job count.
> B. Stop submitting to multiple partitions.longer than that,

Or
C. At least reduce the number of partitions each job submits to.


> I guess I had never considered the overhead of the number of partitions 
> being submitted to before, but now that you point it out it makes 
> sense.  If you can confirm my thoughts above I can look into what can be 
> done as I will have to talk with my group here as to what the best 
> option is.  As of right now I'm not sure for our environment.
> 
> I did try the build_queue_timeout, which did allow it to get through the 
> full queue but as you noted all it does is just exaserbate the problem 
> of slowness by giving it more time rather than actually solving the real 
> issue.

Exactly.
Comment 4 Paul Edmon 2015-07-25 04:08:06 MDT
Thanks.  This confirms my analysis.

I will discuss it with my team here as to what we are going to do about it.

-Paul Edmon-

On 7/25/2015 11:28 AM, bugs@schedmd.com wrote:
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1827#c3> on bug 
> 1827 <http://bugs.schedmd.com/show_bug.cgi?id=1827> from Moe Jette 
> <mailto:jette@schedmd.com> *
> (In reply to Paul Edmon fromcomment #2  <show_bug.cgi?id=1827#c2>)
> > Ok, that makes sense.  So just to clarify.  If a user does:
> >
> > #SBATCH -p partition1, partition2,partition3,partition4
> >
> > That will be significantly more burden than just:
> >
> > #SBATCH -p partition1
>
> Correct, and not just 4x worse because sorting takes order n log(n)
>
>
> > We have many people who will submit jobs to multiple partitions in order
> > to maximize their throughput.  However, from the above it appears that
> > this actually causes the scheduler to slow down significantly as well.
> > We can have people who submit to up to 3-5 partitions at once, and
> > thousands of jobs at that.
> >
> > So if I'm understanding this properly the only solution is either:
> >
> > A. Reduce total job count.
> > B. Stop submitting to multiple partitions.longer than that,
>
> Or
> C. At least reduce the number of partitions each job submits to.
>
>
> > I guess I had never considered the overhead of the number of partitions
> > being submitted to before, but now that you point it out it makes
> > sense.  If you can confirm my thoughts above I can look into what can be
> > done as I will have to talk with my group here as to what the best
> > option is.  As of right now I'm not sure for our environment.
> >
> > I did try the build_queue_timeout, which did allow it to get through the
> > full queue but as you noted all it does is just exaserbate the problem
> > of slowness by giving it more time rather than actually solving the real
> > issue.
>
> Exactly.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Moe Jette 2015-07-27 04:41:36 MDT
I've looked at the code some more and have some ideas.

1. I've added logic to log the number of job-partition pairs being added to the queue for scheduling purposes. This may help us better see what is happening:
https://github.com/SchedMD/slurm/commit/4f40d604b7d8360cc2cc541e6f66d21f3e3b0ff2

2. Rather than logging the timeout each time a queue is build, this reduces the frequency to once every 10 minutes, change the frequency if you care.
https://github.com/SchedMD/slurm/commit/aee694eb79fd0ef62d5dc9b1f632315609f47ffe

3. Something that could consume a fair bit of time is validating job dependencies. Do your jobs have large numbers of dependent jobs?

4. You could make your system more responsive, at a cost of checking fewer jobs, by reducing the build_queue_timeout configuration parameter.
Comment 6 Moe Jette 2015-07-27 05:26:01 MDT
Also see bug 1794 ("slurmctld performance") for what looks like a related problem. I've closed that older bug and will continue pursuing the issue based upon this newer bug.
Comment 7 Paul Edmon 2015-07-27 08:57:30 MDT
I think the cause of the slowness this time was both multiple partitions 
and job dependencies.  We had a user that had 3000 jobs all with 
dependencies.  As well as another user who was submitting to 4 queues 
roughly 10,000 jobs.  So the combined effect crushed our scheduler.  Now 
that we know what to look for we can look for jobs of this type in the 
future.

Is there a way to limit specifically the number of dependent jobs you 
are allowed to submit?

-Paul Edmon-

On 07/27/2015 01:26 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <http://bugs.schedmd.com/show_bug.cgi?id=1827#c6> on bug 
> 1827 <http://bugs.schedmd.com/show_bug.cgi?id=1827> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Also seebug 1794  <show_bug.cgi?id=1794>  ("slurmctld performance") for what looks like a related
> problem. I've closed that older bug and will continue pursuing the issue based
> upon this newer bug.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 8 Moe Jette 2015-07-27 09:02:19 MDT
(In reply to Paul Edmon from comment #7)
> Is there a way to limit specifically the number of dependent jobs you 
> are allowed to submit?

No, we don't have a good mechanism to do that.

About your only option today would be to limit the number of dependencies each job has by using a job_submit plugin to look for the "dependency" field then return an error if the string length is over some limit or if you want to be fancy, parse it and determine the number of job IDs in the string.
Comment 9 Moe Jette 2015-07-27 09:48:10 MDT
Depending upon your workflow, job arrays and/or job steps may help to reduce the scheduling overhead. Some sites have huge job arrays (order 1 million tasks in a single job array) and/or many job steps (order 20,000 steps in a single job).
Comment 10 Paul Edmon 2015-07-27 13:41:06 MDT
Yeah, we are working toward that.  Most people here are still used to 
LSF and aren't really aware of many of the advanced features in slurm.  
We are working on it though.

-Paul Edmon-

On 7/27/2015 5:48 PM, bugs@schedmd.com wrote:
>
> *Comment # 9 <http://bugs.schedmd.com/show_bug.cgi?id=1827#c9> on bug 
> 1827 <http://bugs.schedmd.com/show_bug.cgi?id=1827> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Depending upon your workflow, job arrays and/or job steps may help to reduce
> the scheduling overhead. Some sites have huge job arrays (order 1 million tasks
> in a single job array) and/or many job steps (order 20,000 steps in a single
> job).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 11 Moe Jette 2015-07-28 08:00:15 MDT
Paul, How are things working on this front?
Comment 12 Paul Edmon 2015-07-28 08:08:28 MDT
Great, much better now.  The culprit of our slowness was the huge 
dependency tree created by 3000 jobs that were all interdependent. That 
plus another 7000 jobs asking for 4 partitions.  All that combined 
caused the scheduler to grind to a halt.

Now that they are clear, everything is back to normal.  So you can close 
this ticket.  Thanks for the help.  I never knew that dependencies would 
cause such a slow down, but now that I think about it it is obvious.  We 
will keep our eyes on it in the future.

-Paul Edmon-

On 07/28/2015 04:00 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1827#c11> on 
> bug 1827 <http://bugs.schedmd.com/show_bug.cgi?id=1827> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Paul, How are things working on this front?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 Moe Jette 2015-07-28 08:19:44 MDT
Problem identified and being addressed.