Ticket 1794

Summary:	slurmctld performance
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmctld	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	brian, da
Version:	14.11.7
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	15.08.0-pre7	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	slurm.conf graph.php

Description Paul Edmon 2015-07-09 04:04:48 MDT

Created attachment 2024 [details]
slurm.conf

I wrote in to the list about increasing the slurmctld thread count but Moe suggested I send in our conf, sdiag stats, and other information.  I've attached them to this bug.  You can see our graphed stats on:

https://portal.rc.fas.harvard.edu/slurmmon/

Any suggestions for changes to improve performance are welcome.  Naturally I would prefer not to up the thread count if possible.  Still we are open to experimenting with it, if only to prove to ourselves that it would in fact do nothing.

Thanks.

-Paul Edmon-

I redacted calls per user from the end of the sdiag.  If you want to see calls per user I can include that if you want.

[root@holy-slurm01 ~]# sdiag
*******************************************************
sdiag output at Thu Jul  9 12:02:56 2015
Data since      Thu Jul  9 09:39:01 2015
*******************************************************
Server thread count: 15
Agent queue size:    0

Jobs submitted: 4002
Jobs started:   7118
Jobs completed: 4796
Jobs canceled:  143
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   47032
	Max cycle:    1224487
	Total cycles: 5213
	Mean cycle:   393134
	Mean depth cycle:  112
	Cycles per minute: 36
	Last queue length: 8254

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
	Total backfilled jobs (since last slurm start): 1065
	Total backfilled jobs (since last stats cycle start): 1065
	Total cycles: 69
	Last cycle when: Thu Jul  9 12:01:32 2015
	Last cycle: 3658343324
	Max cycle:  4290960642
	Mean cycle: 4479402
	Last depth cycle: 5284
	Last depth cycle (try sched): 136
	Depth Mean: 8389
	Depth Mean (try depth): 157
	Last queue length: 8196
	Queue length mean: 8508

Remote Procedure Call statistics by message type
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:5444   ave_time:2679785 total_time:14588753043
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:5288   ave_time:1384030 total_time:7318752904
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:4315   ave_time:489615 total_time:2112692493
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:3952   ave_time:6448294 total_time:25483660105
	REQUEST_PARTITION_INFO                  ( 2009) count:3238   ave_time:171535 total_time:555431221
	REQUEST_JOB_INFO                        ( 2003) count:2646   ave_time:3701201 total_time:9793379715
	REQUEST_NODE_INFO                       ( 2007) count:2284   ave_time:747145 total_time:1706479740
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:704    ave_time:7057942 total_time:4968791264
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:174    ave_time:1434180 total_time:249547349
	REQUEST_UPDATE_JOB                      ( 3001) count:112    ave_time:1125815 total_time:126091335
	REQUEST_STATS_INFO                      ( 2035) count:101    ave_time:532    total_time:53814
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:97     ave_time:606393 total_time:58820182
	REQUEST_KILL_JOB                        ( 5032) count:82     ave_time:1198310 total_time:98261449
	REQUEST_RESERVATION_INFO                ( 2024) count:72     ave_time:1180752 total_time:85014158
	REQUEST_STEP_COMPLETE                   ( 5016) count:63     ave_time:1474554 total_time:92896914
	REQUEST_JOB_READY                       ( 4019) count:61     ave_time:6135231 total_time:374249138
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:54     ave_time:7544618 total_time:407409397
	REQUEST_JOB_USER_INFO                   ( 2039) count:49     ave_time:8565854 total_time:419726852
	REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:48     ave_time:5430663 total_time:260671854
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:38     ave_time:1564598 total_time:59454728
	REQUEST_UPDATE_NODE                     ( 3002) count:36     ave_time:1265569 total_time:45560491
	REQUEST_JOB_NOTIFY                      ( 4022) count:14     ave_time:8052552 total_time:112735733
	REQUEST_SHARE_INFO                      ( 2022) count:3      ave_time:5538   total_time:16614
	REQUEST_JOB_REQUEUE                     ( 5023) count:1      ave_time:519794 total_time:519794
	ACCOUNTING_UPDATE_MSG                   (10001) count:1      ave_time:256679 total_time:256679

Remote Procedure Call statistics by user

Comment 1 Paul Edmon 2015-07-09 04:06:33 MDT

In addition we have this for our sysctl.conf on the master.

[root@holy-slurm01 etc]# cat sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536

# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

# Network settings
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 16384

# Original Settings for TCP buffers
#net.core.rmem_max = 124928
#net.core.wmem_max = 124928
#net.ipv4.tcp_rmem = 4096	87380	4194304
#net.ipv4.tcp_wmem = 4096	16384	4194304
#net.core.netdev_max_backlog = 1000
#net.ipv4.tcp_no_metrics_save = 0
#net.ipv4.tcp_moderate_rcvbuf = 1
#txqueuelen 1000
#MTU:1500 #MTU of 9000 is only for 10Gbs

#CERN suggested settings http://monalisa.cern.ch/FDT/documentation_syssettings.html
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1

Comment 2 Paul Edmon 2015-07-09 04:11:39 MDT

For additional context, we had a 4 hour stretch on the 16th of June where we saturated the thread limit and had 1000's of:

server_thread_count over limit (256), waiting

messages in the logs.  Seems to occur when we have large numbers of jobs being processed, either for exit or start or submit.

Comment 3 Moe Jette 2015-07-09 04:49:58 MDT

Nice graphics!

I see that you already have a lot of scheduling options and there has been a change in behaviour for bf_interval in the latest release of Slurm (14.11.8). In addition to this being the time after one backfill scheduling cycle ends before a new one is started, it is also a limit on how long any backfill scheduling cycle can run. Given how heavyweight the backfill scheduling logic is, I hardly expected to see anyone with bf_interval=1 (the default value is 30 seconds). I would strongly recommend increasing that to a larger value; at least 5. That will leave the slurmctld daemon with more time to handle other operations, which should result in less RPC backlog.

> Seems to occur when we have large numbers of jobs being processed,
> either for exit or start or submit.

There is new functionality in Slurm version 15.08, which we call message aggregation that should help a great deal in this respect. Similar message types will be combined into a single RPC by the slurmd daemons on compute (or "gateway" nodes). The slurmctld will get far fewer RPCs, but with more work in each one. We are testing this now on a large Cray system at KAUST which being used for high-throughput computing.

I hope that you can come to the Slurm User Group Meeting in September for more information:
http://slurm.schedmd.com/slurm_ug_agenda.html

Comment 4 Paul Edmon 2015-07-09 05:58:00 MDT

Yeah, we found graphing the sdiag stats to be helpful in diagnosing 
problems.

Thanks for the suggestions.  We are definitely looking forward to 
15.08.  As of right now we are at 14.11.7, and we plan to upgrade to .8 
at our next downtime at the start of August.  The 15.08 upgrade will 
probably wait for the release of 15.08.1.

We will see about the Slurm user group meeting.  Odds are one of us (or 
even multiple of us) will show up.

-Paul Edmon-

On 07/09/2015 12:49 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Assignee 	david@schedmd.com 	jette@schedmd.com
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c3> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Nice graphics!
>
> I see that you already have a lot of scheduling options and there has been a
> change in behaviour for bf_interval in the latest release of Slurm (14.11.8).
> In addition to this being the time after one backfill scheduling cycle ends
> before a new one is started, it is also a limit on how long any backfill
> scheduling cycle can run. Given how heavyweight the backfill scheduling logic
> is, I hardly expected to see anyone with bf_interval=1 (the default value is 30
> seconds). I would strongly recommend increasing that to a larger value; at
> least 5. That will leave the slurmctld daemon with more time to handle other
> operations, which should result in less RPC backlog.
>
> > Seems to occur when we have large numbers of jobs being processed,
> > either for exit or start or submit.
>
> There is new functionality in Slurm version 15.08, which we call message
> aggregation that should help a great deal in this respect. Similar message
> types will be combined into a single RPC by the slurmd daemons on compute (or
> "gateway" nodes). The slurmctld will get far fewer RPCs, but with more work in
> each one. We are testing this now on a large Cray system at KAUST which being
> used for high-throughput computing.
>
> I hope that you can come to the Slurm User Group Meeting in September for more
> information:
> http://slurm.schedmd.com/slurm_ug_agenda.html
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Moe Jette 2015-07-09 06:11:37 MDT

Based upon your information plus our experience at KAUST, the below patch should fix the problem:

https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch

Here is the commit description:
    The slurmctld logic throttles some RPCs so that only one of them
    can execute at a time in order to reduce contention for the job,
    partition and node locks (only one of the effected RPCs can execute
    at any time anyway and this lets other RPC types run). While an
    RPC is stuck in the throttle function, do not count that thread
    against the slurmctld thread limit.

This patch should work fine with version 14.11.7 or .8. The new logic will be released in 14.11.9, likely some time in August.

Don't forget to change your bf_interval too.

Comment 6 Paul Edmon 2015-07-09 06:17:43 MDT

Cool.  We will try it out.  I just changed bf_interval to 5 as you 
suggested. Thanks.

-Paul Edmon-

On 07/09/2015 02:11 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Version Fixed 		14.11.9
> Resolution 	--- 	FIXED
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Based upon your information plus our experience at KAUST, the below patch
> should fix the problem:
>
> https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
>
> Here is the commit description:
>      The slurmctld logic throttles some RPCs so that only one of them
>      can execute at a time in order to reduce contention for the job,
>      partition and node locks (only one of the effected RPCs can execute
>      at any time anyway and this lets other RPC types run). While an
>      RPC is stuck in the throttle function, do not count that thread
>      against the slurmctld thread limit.
>
> This patch should work fine with version 14.11.7 or .8. The new logic will be
> released in 14.11.9, likely some time in August.
>
> Don't forget to change your bf_interval too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 7 Paul Edmon 2015-07-09 06:41:43 MDT

Appears to be a bug in the patch when applied to 14.11.8.  I haven't 
tried for 14.11.7.

+ cd slurm-14.11.8
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ echo 'Patch #0 (ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch):'
Patch #0 (ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch):
+ /bin/cat 
/n/home_rc/pedmon/rpmbuild/SOURCES/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
+ /usr/bin/patch -p1 --fuzz=0
patching file NEWS
Hunk #1 FAILED at 3.
1 out of 1 hunk FAILED -- saving rejects to file NEWS.rej
patching file src/slurmctld/controller.c
patching file src/slurmctld/proc_req.c
patching file src/slurmctld/slurmctld.h
error: Bad exit status from /var/tmp/rpm-tmp.0Eg9iT (%prep)


RPM build errors:
     Bad exit status from /var/tmp/rpm-tmp.0Eg9iT (%prep)

-Paul Edmon-

On 07/09/2015 02:11 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Version Fixed 		14.11.9
> Resolution 	--- 	FIXED
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Based upon your information plus our experience at KAUST, the below patch
> should fix the problem:
>
> https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
>
> Here is the commit description:
>      The slurmctld logic throttles some RPCs so that only one of them
>      can execute at a time in order to reduce contention for the job,
>      partition and node locks (only one of the effected RPCs can execute
>      at any time anyway and this lets other RPC types run). While an
>      RPC is stuck in the throttle function, do not count that thread
>      against the slurmctld thread limit.
>
> This patch should work fine with version 14.11.7 or .8. The new logic will be
> released in 14.11.9, likely some time in August.
>
> Don't forget to change your bf_interval too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 8 Paul Edmon 2015-07-09 06:59:23 MDT

Taking out the NEWS parts seems to have fixed it.  I will let you know 
if there are any other problems.

-Paul Edmon-

On 07/09/2015 02:11 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Version Fixed 		14.11.9
> Resolution 	--- 	FIXED
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Based upon your information plus our experience at KAUST, the below patch
> should fix the problem:
>
> https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
>
> Here is the commit description:
>      The slurmctld logic throttles some RPCs so that only one of them
>      can execute at a time in order to reduce contention for the job,
>      partition and node locks (only one of the effected RPCs can execute
>      at any time anyway and this lets other RPC types run). While an
>      RPC is stuck in the throttle function, do not count that thread
>      against the slurmctld thread limit.
>
> This patch should work fine with version 14.11.7 or .8. The new logic will be
> released in 14.11.9, likely some time in August.
>
> Don't forget to change your bf_interval too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Paul Edmon 2015-07-09 07:37:52 MDT

Okay patch is in place on our production master.  Will let you know how 
it goes.

-Paul Edmon-

On 07/09/2015 02:11 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Version Fixed 		14.11.9
> Resolution 	--- 	FIXED
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Based upon your information plus our experience at KAUST, the below patch
> should fix the problem:
>
> https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
>
> Here is the commit description:
>      The slurmctld logic throttles some RPCs so that only one of them
>      can execute at a time in order to reduce contention for the job,
>      partition and node locks (only one of the effected RPCs can execute
>      at any time anyway and this lets other RPC types run). While an
>      RPC is stuck in the throttle function, do not count that thread
>      against the slurmctld thread limit.
>
> This patch should work fine with version 14.11.7 or .8. The new logic will be
> released in 14.11.9, likely some time in August.
>
> Don't forget to change your bf_interval too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Paul Edmon 2015-07-10 06:26:45 MDT

Created attachment 2031 [details]
graph.php

This patch caused an issue where if a bunch of jobs completed at once it 
would generate a huge agent size which then bogged down the whole 
scheduler such that it was unresponsive.  Upon restarting the scheduler 
it took 10 minutes to do so as it flushed all the completing jobs.  We 
had over 2000 jobs stuck in completing state, which didn't get fully 
flushed until we reverted to the unpatched version of 14.11.7.  Here is 
a picture of the agent queue size, which was huge.

Clearly the patch didn't quite act as intended as it seemed that it just 
ignored the completions to deal with later.  In our environment there is 
stuff going on all the time, so it still needs to complete them in a 
timely fashion, even if it is punted till later.

I think we will wait on this feature until it arrives fully baked in 
14.11.9 and 15.08.  Clearly it isn't there yet and needs more testing.

Let me know if you want any more info.

-Paul Edmon-

On 07/09/2015 02:11 PM, bugs@schedmd.com wrote:
> Moe Jette <mailto:jette@schedmd.com> changed bug 1794 
> <http://bugs.schedmd.com/show_bug.cgi?id=1794>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Version Fixed 		14.11.9
> Resolution 	--- 	FIXED
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug 
> 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> Based upon your information plus our experience at KAUST, the below patch
> should fix the problem:
>
> https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch
>
> Here is the commit description:
>      The slurmctld logic throttles some RPCs so that only one of them
>      can execute at a time in order to reduce contention for the job,
>      partition and node locks (only one of the effected RPCs can execute
>      at any time anyway and this lets other RPC types run). While an
>      RPC is stuck in the throttle function, do not count that thread
>      against the slurmctld thread limit.
>
> This patch should work fine with version 14.11.7 or .8. The new logic will be
> released in 14.11.9, likely some time in August.
>
> Don't forget to change your bf_interval too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 11 Moe Jette 2015-07-10 08:47:40 MDT

(In reply to Paul Edmon from comment #10)
> Created attachment 2031 [details]
> This patch caused an issue where if a bunch of jobs completed at once it 
> would generate a huge agent size which then bogged down the whole 
> scheduler such that it was unresponsive.

My apologies. We are continue to work on high-throughput issues with KAUST. In fact, I've spent pretty much all day on changes to the agent in order to improve its parallelism, especially for serial jobs. The changes are significant, so I want to only put the changes into version 15.08. In fact, the patch that I sent you should probably also go only into version 15.08. If you care to see the agent patch, here it is:
https://github.com/SchedMD/slurm/commit/53534f4907c0333696d2a04046c52a92a5e39c40

Comment 12 Moe Jette 2015-07-10 08:50:57 MDT

(In reply to Moe Jette from comment #11)
>  We are continue to work on high-throughput issues with KAUST.

FYI: KAUST is working with v15.08

Comment 13 Paul Edmon 2015-07-10 13:50:33 MDT

Yeah, I figured that was the case.  No worries.  Just wanted to inform 
you of the result.  We are really excited for 15.08, and look forward to 
its release.

-Paul Edmon-

On 7/10/2015 4:50 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c12> on 
> bug 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette 
> <mailto:jette@schedmd.com> *
> (In reply to Moe Jette fromcomment #11  <show_bug.cgi?id=1794#c11>)
> >  We are continue to work on high-throughput issues with KAUST.
>
> FYI: KAUST is working with v15.08
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 14 Moe Jette 2015-07-27 05:25:04 MDT

We've made some changes in SLurm version 15.08 for KAUST that should help you, but I believe the root cause at your site is probably better expressed in bug 1827 ("Scheduler Slowness"). I'm going to close this ticket and continue pursuing the issue under that newer bug.