Summary: | slurmctld performance | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | slurmctld | Assignee: | Moe Jette <jette> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 15.08.0-pre7 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurm.conf
graph.php |
Description
Paul Edmon
2015-07-09 04:04:48 MDT
In addition we have this for our sysctl.conf on the master. [root@holy-slurm01 etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details. # Controls IP packet forwarding net.ipv4.ip_forward = 0 # Controls source route verification net.ipv4.conf.default.rp_filter = 1 # Do not accept source routing net.ipv4.conf.default.accept_source_route = 0 # Controls the System Request debugging functionality of the kernel kernel.sysrq = 0 # Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1 # Controls the use of TCP syncookies net.ipv4.tcp_syncookies = 1 # Disable netfilter on bridges. net.bridge.bridge-nf-call-ip6tables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-arptables = 0 # Controls the default maxmimum size of a mesage queue kernel.msgmnb = 65536 # Controls the maximum size of a message, in bytes kernel.msgmax = 65536 # Controls the maximum shared segment size, in bytes kernel.shmmax = 68719476736 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 4294967296 # Network settings net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 16384 # Original Settings for TCP buffers #net.core.rmem_max = 124928 #net.core.wmem_max = 124928 #net.ipv4.tcp_rmem = 4096 87380 4194304 #net.ipv4.tcp_wmem = 4096 16384 4194304 #net.core.netdev_max_backlog = 1000 #net.ipv4.tcp_no_metrics_save = 0 #net.ipv4.tcp_moderate_rcvbuf = 1 #txqueuelen 1000 #MTU:1500 #MTU of 9000 is only for 10Gbs #CERN suggested settings http://monalisa.cern.ch/FDT/documentation_syssettings.html net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 net.ipv4.tcp_rmem = 4096 87380 8388608 net.ipv4.tcp_wmem = 4096 65536 8388608 net.core.netdev_max_backlog = 250000 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 For additional context, we had a 4 hour stretch on the 16th of June where we saturated the thread limit and had 1000's of: server_thread_count over limit (256), waiting messages in the logs. Seems to occur when we have large numbers of jobs being processed, either for exit or start or submit. Nice graphics! I see that you already have a lot of scheduling options and there has been a change in behaviour for bf_interval in the latest release of Slurm (14.11.8). In addition to this being the time after one backfill scheduling cycle ends before a new one is started, it is also a limit on how long any backfill scheduling cycle can run. Given how heavyweight the backfill scheduling logic is, I hardly expected to see anyone with bf_interval=1 (the default value is 30 seconds). I would strongly recommend increasing that to a larger value; at least 5. That will leave the slurmctld daemon with more time to handle other operations, which should result in less RPC backlog. > Seems to occur when we have large numbers of jobs being processed, > either for exit or start or submit. There is new functionality in Slurm version 15.08, which we call message aggregation that should help a great deal in this respect. Similar message types will be combined into a single RPC by the slurmd daemons on compute (or "gateway" nodes). The slurmctld will get far fewer RPCs, but with more work in each one. We are testing this now on a large Cray system at KAUST which being used for high-throughput computing. I hope that you can come to the Slurm User Group Meeting in September for more information: http://slurm.schedmd.com/slurm_ug_agenda.html Yeah, we found graphing the sdiag stats to be helpful in diagnosing problems. Thanks for the suggestions. We are definitely looking forward to 15.08. As of right now we are at 14.11.7, and we plan to upgrade to .8 at our next downtime at the start of August. The 15.08 upgrade will probably wait for the release of 15.08.1. We will see about the Slurm user group meeting. Odds are one of us (or even multiple of us) will show up. -Paul Edmon- On 07/09/2015 12:49 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Assignee david@schedmd.com jette@schedmd.com > > *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c3> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Nice graphics! > > I see that you already have a lot of scheduling options and there has been a > change in behaviour for bf_interval in the latest release of Slurm (14.11.8). > In addition to this being the time after one backfill scheduling cycle ends > before a new one is started, it is also a limit on how long any backfill > scheduling cycle can run. Given how heavyweight the backfill scheduling logic > is, I hardly expected to see anyone with bf_interval=1 (the default value is 30 > seconds). I would strongly recommend increasing that to a larger value; at > least 5. That will leave the slurmctld daemon with more time to handle other > operations, which should result in less RPC backlog. > > > Seems to occur when we have large numbers of jobs being processed, > > either for exit or start or submit. > > There is new functionality in Slurm version 15.08, which we call message > aggregation that should help a great deal in this respect. Similar message > types will be combined into a single RPC by the slurmd daemons on compute (or > "gateway" nodes). The slurmctld will get far fewer RPCs, but with more work in > each one. We are testing this now on a large Cray system at KAUST which being > used for high-throughput computing. > > I hope that you can come to the Slurm User Group Meeting in September for more > information: > http://slurm.schedmd.com/slurm_ug_agenda.html > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Based upon your information plus our experience at KAUST, the below patch should fix the problem: https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch Here is the commit description: The slurmctld logic throttles some RPCs so that only one of them can execute at a time in order to reduce contention for the job, partition and node locks (only one of the effected RPCs can execute at any time anyway and this lets other RPC types run). While an RPC is stuck in the throttle function, do not count that thread against the slurmctld thread limit. This patch should work fine with version 14.11.7 or .8. The new logic will be released in 14.11.9, likely some time in August. Don't forget to change your bf_interval too. Cool. We will try it out. I just changed bf_interval to 5 as you suggested. Thanks. -Paul Edmon- On 07/09/2015 02:11 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Status UNCONFIRMED RESOLVED > Version Fixed 14.11.9 > Resolution --- FIXED > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Based upon your information plus our experience at KAUST, the below patch > should fix the problem: > > https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch > > Here is the commit description: > The slurmctld logic throttles some RPCs so that only one of them > can execute at a time in order to reduce contention for the job, > partition and node locks (only one of the effected RPCs can execute > at any time anyway and this lets other RPC types run). While an > RPC is stuck in the throttle function, do not count that thread > against the slurmctld thread limit. > > This patch should work fine with version 14.11.7 or .8. The new logic will be > released in 14.11.9, likely some time in August. > > Don't forget to change your bf_interval too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Appears to be a bug in the patch when applied to 14.11.8. I haven't tried for 14.11.7. + cd slurm-14.11.8 + /bin/chmod -Rf a+rX,u+w,g-w,o-w . + echo 'Patch #0 (ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch):' Patch #0 (ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch): + /bin/cat /n/home_rc/pedmon/rpmbuild/SOURCES/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch + /usr/bin/patch -p1 --fuzz=0 patching file NEWS Hunk #1 FAILED at 3. 1 out of 1 hunk FAILED -- saving rejects to file NEWS.rej patching file src/slurmctld/controller.c patching file src/slurmctld/proc_req.c patching file src/slurmctld/slurmctld.h error: Bad exit status from /var/tmp/rpm-tmp.0Eg9iT (%prep) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.0Eg9iT (%prep) -Paul Edmon- On 07/09/2015 02:11 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Status UNCONFIRMED RESOLVED > Version Fixed 14.11.9 > Resolution --- FIXED > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Based upon your information plus our experience at KAUST, the below patch > should fix the problem: > > https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch > > Here is the commit description: > The slurmctld logic throttles some RPCs so that only one of them > can execute at a time in order to reduce contention for the job, > partition and node locks (only one of the effected RPCs can execute > at any time anyway and this lets other RPC types run). While an > RPC is stuck in the throttle function, do not count that thread > against the slurmctld thread limit. > > This patch should work fine with version 14.11.7 or .8. The new logic will be > released in 14.11.9, likely some time in August. > > Don't forget to change your bf_interval too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Taking out the NEWS parts seems to have fixed it. I will let you know if there are any other problems. -Paul Edmon- On 07/09/2015 02:11 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Status UNCONFIRMED RESOLVED > Version Fixed 14.11.9 > Resolution --- FIXED > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Based upon your information plus our experience at KAUST, the below patch > should fix the problem: > > https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch > > Here is the commit description: > The slurmctld logic throttles some RPCs so that only one of them > can execute at a time in order to reduce contention for the job, > partition and node locks (only one of the effected RPCs can execute > at any time anyway and this lets other RPC types run). While an > RPC is stuck in the throttle function, do not count that thread > against the slurmctld thread limit. > > This patch should work fine with version 14.11.7 or .8. The new logic will be > released in 14.11.9, likely some time in August. > > Don't forget to change your bf_interval too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Okay patch is in place on our production master. Will let you know how it goes. -Paul Edmon- On 07/09/2015 02:11 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Status UNCONFIRMED RESOLVED > Version Fixed 14.11.9 > Resolution --- FIXED > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Based upon your information plus our experience at KAUST, the below patch > should fix the problem: > > https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch > > Here is the commit description: > The slurmctld logic throttles some RPCs so that only one of them > can execute at a time in order to reduce contention for the job, > partition and node locks (only one of the effected RPCs can execute > at any time anyway and this lets other RPC types run). While an > RPC is stuck in the throttle function, do not count that thread > against the slurmctld thread limit. > > This patch should work fine with version 14.11.7 or .8. The new logic will be > released in 14.11.9, likely some time in August. > > Don't forget to change your bf_interval too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 2031 [details] graph.php This patch caused an issue where if a bunch of jobs completed at once it would generate a huge agent size which then bogged down the whole scheduler such that it was unresponsive. Upon restarting the scheduler it took 10 minutes to do so as it flushed all the completing jobs. We had over 2000 jobs stuck in completing state, which didn't get fully flushed until we reverted to the unpatched version of 14.11.7. Here is a picture of the agent queue size, which was huge. Clearly the patch didn't quite act as intended as it seemed that it just ignored the completions to deal with later. In our environment there is stuff going on all the time, so it still needs to complete them in a timely fashion, even if it is punted till later. I think we will wait on this feature until it arrives fully baked in 14.11.9 and 15.08. Clearly it isn't there yet and needs more testing. Let me know if you want any more info. -Paul Edmon- On 07/09/2015 02:11 PM, bugs@schedmd.com wrote: > Moe Jette <mailto:jette@schedmd.com> changed bug 1794 > <http://bugs.schedmd.com/show_bug.cgi?id=1794> > What Removed Added > Status UNCONFIRMED RESOLVED > Version Fixed 14.11.9 > Resolution --- FIXED > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c5> on bug > 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > Based upon your information plus our experience at KAUST, the below patch > should fix the problem: > > https://github.com/SchedMD/slurm/commit/ad9c2413a735a6b15a0133ac5068d546c52de9a1.patch > > Here is the commit description: > The slurmctld logic throttles some RPCs so that only one of them > can execute at a time in order to reduce contention for the job, > partition and node locks (only one of the effected RPCs can execute > at any time anyway and this lets other RPC types run). While an > RPC is stuck in the throttle function, do not count that thread > against the slurmctld thread limit. > > This patch should work fine with version 14.11.7 or .8. The new logic will be > released in 14.11.9, likely some time in August. > > Don't forget to change your bf_interval too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #10) > Created attachment 2031 [details] > This patch caused an issue where if a bunch of jobs completed at once it > would generate a huge agent size which then bogged down the whole > scheduler such that it was unresponsive. My apologies. We are continue to work on high-throughput issues with KAUST. In fact, I've spent pretty much all day on changes to the agent in order to improve its parallelism, especially for serial jobs. The changes are significant, so I want to only put the changes into version 15.08. In fact, the patch that I sent you should probably also go only into version 15.08. If you care to see the agent patch, here it is: https://github.com/SchedMD/slurm/commit/53534f4907c0333696d2a04046c52a92a5e39c40 (In reply to Moe Jette from comment #11) > We are continue to work on high-throughput issues with KAUST. FYI: KAUST is working with v15.08 Yeah, I figured that was the case. No worries. Just wanted to inform you of the result. We are really excited for 15.08, and look forward to its release. -Paul Edmon- On 7/10/2015 4:50 PM, bugs@schedmd.com wrote: > > *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=1794#c12> on > bug 1794 <http://bugs.schedmd.com/show_bug.cgi?id=1794> from Moe Jette > <mailto:jette@schedmd.com> * > (In reply to Moe Jette fromcomment #11 <show_bug.cgi?id=1794#c11>) > > We are continue to work on high-throughput issues with KAUST. > > FYI: KAUST is working with v15.08 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > We've made some changes in SLurm version 15.08 for KAUST that should help you, but I believe the root cause at your site is probably better expressed in bug 1827 ("Scheduler Slowness"). I'm going to close this ticket and continue pursuing the issue under that newer bug. |