Bug 7946 - Slurm: Socket timed out on send/recv operation
Summary: Slurm: Socket timed out on send/recv operation
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 18.08.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Marcin Stolarek
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2019-10-17 04:36 MDT by Adel Aly
Modified: 2020-05-26 10:42 MDT (History)
6 users (show)

See Also:
Site: Zenuity Compute
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: SUSE
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log file (18.76 MB, text/plain)
2020-01-21 07:06 MST, Adel Aly
Details
submit examples (3.05 KB, text/plain)
2020-01-21 07:07 MST, Adel Aly
Details
slurmctld log file (20200305) (4.07 KB, text/plain)
2020-03-05 09:53 MST, Adel Aly
Details
slurm.conf (20200305) (4.07 KB, text/plain)
2020-03-05 09:54 MST, Adel Aly
Details
slurmctld log file (20203005) (35.81 KB, text/plain)
2020-03-05 10:11 MST, Adel Aly
Details
slurmctld.log (2.39 MB, application/gzip)
2020-03-09 09:02 MDT, Adel Aly
Details
slurmctld log (5.01 MB, application/gzip)
2020-03-10 03:27 MDT, Adel Aly
Details
slurmctld logs debug5 (3.77 MB, application/gzip)
2020-03-11 09:49 MDT, Adel Aly
Details
uptime+slurmctld logs (20200313) (15.36 MB, application/gzip)
2020-03-13 05:50 MDT, Adel Aly
Details
journal_dmesg (9.76 MB, application/gzip)
2020-03-13 10:19 MDT, Adel Aly
Details
slurmctld log file (20200318) (21.42 MB, application/x-bzip)
2020-03-18 14:31 MDT, Adel Aly
Details
slurmctld log file (20200318) (21.42 MB, application/x-bzip)
2020-03-18 14:42 MDT, Adel Aly
Details
perf data (18.11 MB, application/x-bzip)
2020-03-26 02:14 MDT, Adel Aly
Details
slurmctld log file (20200325) (14.72 MB, application/x-bzip)
2020-03-26 03:27 MDT, Adel Aly
Details
slurm.conf latest (4.21 KB, text/plain)
2020-03-26 04:35 MDT, Adel Aly
Details
perf archive 100 exit0 (6.39 MB, application/x-bzip)
2020-03-27 02:38 MDT, Adel Aly
Details
perf archive 500 exit0 (6.32 MB, application/x-bzip)
2020-03-27 02:43 MDT, Adel Aly
Details
FlameGraph - perf.100 case (951.85 KB, image/svg+xml)
2020-03-30 08:07 MDT, Marcin Stolarek
Details
Docs - add note about Prolog/EpilogSlurmctld to high_throughput.shtml (1.69 KB, patch)
2020-05-06 02:54 MDT, Marcin Stolarek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adel Aly 2019-10-17 04:36:45 MDT
Created attachment 11994 [details]
logfiles and config files

The slurm commands, sinfo and squeue are randomly producing this "Socket timed out on send/recv operation" when a large number of array jobs are being submitted to the cluster. 

We aren't able to reproduce this anomaly in a consistent way, so sometimes it happens very frequently and some other times it happens very few times and some other times it doesn't happen at all. We have tried several suggestions in the similar issues but that didn't help, unfortunately. 

We suspect that there may be a misconfiguration that is causing this issue. Any help to shed some light on this issue?

We have attached the configuration files along with the log files and sdiag output.

We are using Slurm version 18.08.07 on SLES12sp3 cluster.

Thank you in advance,
adel
Comment 2 Nate Rini 2019-10-17 11:24:45 MDT
(In reply to Adel Aly from comment #0)
> Created attachment 11994 [details]
> logfiles and config files

Please make sure to reset your StoragePass since it was included in the attachments.
Comment 3 Nate Rini 2019-10-17 11:37:41 MDT
(In reply to Adel Aly from comment #0)
> Created attachment 11994 [details]
> logfiles and config files

In the slurm.conf:
> CacheGroups=0
This config parameter has been deprecated.

Looking at the logs:
> [2019-10-11T17:33:06.588] error: _shutdown_bu_thread:send/recv xyz02: Connection refused
Is the backup controller having issues?

> [2019-10-03T19:24:33.324] error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout
Please apply the suggested MySQL/MariaDB settings per https://slurm.schedmd.com/accounting.html under "Slurm Accounting Configuration Before Build".

> [2019-10-11T11:03:51.623] error: Problem getting jobs for cluster
Can you please open a new ticket for this issue?

(In reply to Adel Aly from comment #0)
> The slurm commands, sinfo and squeue are randomly producing this "Socket
> timed out on send/recv operation" when a large number of array jobs are
> being submitted to the cluster. 
The logs also show a pretty busy slurm server with high agent counts on occasion. I suspect that may be the cause of seeing these timeouts.

Can you please take sdiag output every 5 minutes for an ~hour and then attach that output please.
Comment 4 Nate Rini 2019-10-24 19:00:33 MDT
Adel,

I'm going to time this ticket out (no response in a week). Once you can provide the requested data, please attach or reply with it and this ticket will automatically be opened and we can continue from there.

Thanks,
--Nate
Comment 5 Adel Aly 2019-10-30 07:13:15 MDT
Created attachment 12155 [details]
sdiag output

Thank you for the reply, Nate.

> Is the backup controller having issues?

It's up and running. The service slurmctld is not running on it. We have started it temporarily but that didn't resolve the issue.

> Can you please take sdiag output every 5 minutes for an ~hour and then attach that output please.

Please find it attached.
Comment 6 Nate Rini 2019-10-30 12:45:00 MDT
(In reply to Adel Aly from comment #5)
> > Is the backup controller having issues?
> It's up and running. The service slurmctld is not running on it. We have
> started it temporarily but that didn't resolve the issue.
If your not using it, I suggest removing it from slurm.conf. It can always be added back later.

> > Can you please take sdiag output every 5 minutes for an ~hour and then attach that output please.
> Please find it attached.

This cluster looks like its almost idle:
>$ grep 'Jobs pending' sdiag.out  |sort|uniq -c
>     12 Jobs pending:   10
>      1 Jobs pending:   11

The backfill looks like its not running very often and is a little slow:
> Max cycle:  186701222 -> 3.11 minutes

Lets change this in slurm.conf:
> SchedulerParameters=sched_min_interval=1000000
to 
> SchedulerParameters=sched_min_interval=1000000,bf_continue,max_rpc_cnt=100

(In reply to Adel Aly from comment #0)
> The slurm commands, sinfo and squeue are randomly producing this "Socket
> timed out on send/recv operation" when a large number of array jobs are
> being submitted to the cluster. 

Is it possible to submit a large batch of array jobs with higher debug level activated?
> scontrol setdebugflags +Protocol
> scontrol setdebug debug5
-- submit and run array jobs --
> scontrol setdebugflags -Protocol
> scontrol setdebug info

Please attach the slurmctld log.
Comment 7 Adel Aly 2019-11-12 08:40:06 MST
Created attachment 12303 [details]
slurmctld

Please find attached slurmctld after applying the suggested change and enabling debug mode.
Comment 10 Nate Rini 2019-11-13 18:00:18 MST
(In reply to Adel Aly from comment #7)
> Created attachment 12303 [details]
> slurmctld
> 
> Please find attached slurmctld after applying the suggested change and
> enabling debug mode.

I'm seeing some munge errors:
> Munge decode failed: Expired credential

Can you please provide the slurmd log for the following nodes:
> 10.60.20.114
> 10.60.20.12
> 10.60.20.163

Can you also please verify that all the clocks are in sync on the cluster.

How was slurmctld shutdown around 2019-11-11T16:30:08.385? From the logs, looks like it went down really hard.
Comment 11 Adel Aly 2019-11-14 04:02:37 MST
Created attachment 12320 [details]
slurmd log files

Please find the slurmd logs attached.

All the clocks are in sync across all the nodes.

The slurmctld has not been shutdown during this time. It was a reconfigure:

###########
[2019-11-11T16:30:08.354] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2019-11-11T16:30:08.358] No memory enforcing mechanism configured.
[2019-11-11T16:30:08.358] layouts: no layout to initialize
[2019-11-11T16:30:08.385] restoring original state of nodes
[2019-11-11T16:30:08.390] cons_res: select_p_node_init
[2019-11-11T16:30:08.390] cons_res: preparing for 3 partitions
[2019-11-11T16:30:08.419] cons_res: select_p_reconfigure
[2019-11-11T16:30:08.419] cons_res: select_p_node_init
[2019-11-11T16:30:08.419] cons_res: preparing for 3 partitions
[2019-11-11T16:30:08.421] No parameter for mcs plugin, default values set
[2019-11-11T16:30:08.421] mcs: MCSParameters = (null). ondemand set.
###########

We can send older logs if necessary.
Comment 12 Nate Rini 2019-11-14 10:47:37 MST
> [2019-09-16T17:30:09.854] Node configuration differs from hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=64:2(hw) CoresPerSocket=1:16(hw) ThreadsPerCore=1:2(hw)

Please attach your slurm.conf (if changed since comment #0) and gres.conf for the nodes in comment #11.
Comment 15 Nate Rini 2019-11-14 10:59:02 MST
Please call the following on the same nodes from comment #10:
> systemctl show munge #please correct if munge unit name is different
> remunge
> ps auxf
Comment 18 Nate Rini 2019-11-14 19:45:17 MST
Please also call the following on your controller (xyz01):
> slurmctld -V
> ldd $(which slurmctld)
> objdump -t $(which slurmctld)
Comment 19 Adel Aly 2019-11-22 04:58:30 MST
Created attachment 12376 [details]
output from active head node and compute nodes

Please find the required output attached.
Comment 24 Nate Rini 2019-11-25 17:26:21 MST
This error is highly present in your logs but this error was removed in Slurm 17.11 but your installed Slurm is 18.08:
> error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured

I suspect older plugins were not cleanly updated on this cluster. Having an older library link into Slurm could cause a huge amount of generally inexplicable errors (and/or memory corruption) to happen.

Is it possible to do a clean install of Slurm (please update to latest 18.08 too) and verify that the loaded gres_mic plugin is the one compiled for 18.08 and not 17.02?
Comment 27 Nate Rini 2019-11-26 09:52:42 MST
(In reply to Nate Rini from comment #24)
> verify that the loaded gres_mic plugin is the one compiled for 18.08 and not 17.02?

Please call the following on your slurm controller node:
> pmap -p $(pgrep slurmctld)
> pmap -p $(pgrep slurmctld) | awk '$4 ~ /.so$/ {print $4}'|sort|uniq |while read i; do ldd $i | while read l; do echo "$i: $l"; done; done 
> pmap -p $(pgrep slurmctld) | awk '$4 ~ /.so$/ {print $4}'|sort|uniq |while read i; do objdump -g $i | while read l; do echo "$i: $l"; done; done

We also saw this error in the logs
> Warning: slurmstepd startup took 21 sec, possible file system problem or full memory
Please configure slurmd on the nodes to use the memlock option: "-M" and then please add this to the slurm.conf "LaunchParameters=slurmstepd_memlock". This will cause the slurmd and slurmstepd daemons to lock their memory on the nodes to avoid delays caused by slow filesystems. This will not help if the user application is slow to start/load. Is it possible to monitor the logs for this error message in the future? This is a filesystem issue that causing Slurm to slow down.
Comment 28 Nate Rini 2019-11-26 10:00:20 MST
> [2019-11-10T16:30:24.730] error: You are using cons_res or gang scheduling with Fastschedule=0 and node configuration differs from hardware.  The node configuration used will be what is in the slurm.conf because of the bitmaps the slurmctld must create before the slurmd registers.

Is there any reason Fastschedule=0 has been set? This is a deprecated configuration option that can cause surprising issues. Please unset it and restart slurm across the board. This may be causing Slurm to think the node has more memory than physically available.
Comment 29 Adel Aly 2019-11-27 03:31:39 MST
Thank you for your reply. We will gather the required information and upload them shortly.

What is not clear to us is the note about FastScheduling=0


> Is there any reason Fastschedule=0 has been set? This is a deprecated configuration option that can cause surprising issues. 

In the man pages, we don't see any note that FastSchedule is deprecated (version 18.08). Where do we get this information from? What kind of surprising issues are you referring to? Is it related to "timeout" issues or something else? We have always used FastSchedul=0 without issues. 

> Please unset it and restart slurm across the board. 

I assume you mean remove the "FastSchedule=0" from the slurm.conf and then restart slurmctld on the controller node slurmd on the compute nodes.

> This may be causing Slurm to think the node has more memory than physically available.

Do you mean that this is a bug that has been fixed by introducing another configuration parameter?
Comment 30 Nate Rini 2019-11-27 10:14:41 MST
(In reply to Adel Aly from comment #29)
> > Is there any reason Fastschedule=0 has been set? This is a deprecated configuration option that can cause surprising issues. 
> 
> In the man pages, we don't see any note that FastSchedule is deprecated
> (version 18.08). 

> Where do we get this information from? 

Fastschedule=0 was deprecated in 19.05.4 here: https://github.com/SchedMD/slurm/commit/2c44fcf69cd52ada1b3e56ccc56eed0f4534d3d5

> What kind of surprising issues are you referring to?
Fastschedule only existed to allow developers to test node configurations that their machine did not have. It overrides the built in checks of Slurm to make sure the configuration matches the node. With Fastschedule=0 Slurm will ignore the hardware configuration and trust the config as is. If memory on a node is missing, then Slurm will still allow a job to run expecting full memory.

> Is it related to "timeout" issues or something else?
It should be unrelated.

> We have always used FastSchedul=0 without issues. 
> > Please unset it and restart slurm across the board. 
> 
> I assume you mean remove the "FastSchedule=0" from the slurm.conf and then
> restart slurmctld on the controller node slurmd on the compute nodes.
Yes.
 
> > This may be causing Slurm to think the node has more memory than physically available.
> Do you mean that this is a bug that has been fixed by introducing another
> configuration parameter?

It is not a bug but existed for development per the slurm.conf man page:
> Base scheduling decisions upon the actual configuration of each individual node except that the node's processor count in Slurm's configuration must match the actual hardware configuration  if  PreemptMode=suspend,gang  or SelectType=select/cons_res are configured (both of those plugins maintain resource allocation information using bitmaps for the cores in the system and must remain static, while the node's memory and disk space can be established later).
Comment 31 Adel Aly 2019-12-02 03:03:24 MST
Created attachment 12447 [details]
pmap output

Please find attached the required pmap output.
Comment 32 Nate Rini 2019-12-02 09:01:43 MST
(In reply to Adel Aly from comment #31)
> Created attachment 12447 [details]
> pmap output
> 
> Please find attached the required pmap output.

Please also call this:
> ldd /cm/shared/apps/slurm/18.08.7/lib64/slurm/gres_mic.so
> objdump -g /cm/shared/apps/slurm/18.08.7/lib64/slurm/gres_mic.so
Comment 33 Adel Aly 2019-12-02 09:41:10 MST
Created attachment 12448 [details]
ldd and objdump

Please find the output attached.
Comment 35 Nate Rini 2019-12-02 10:15:48 MST
(In reply to Adel Aly from comment #33)
> Created attachment 12448 [details]
> ldd and objdump
> 
> Please find the output attached.

Based on this output, the plugin is currently at the correct level. Looking back at the slurmd logs from comment #11, it looks like something changed on 2019-09-20 to correct the issue in comment #24.
Comment 36 Nate Rini 2019-12-02 10:18:38 MST
(In reply to Nate Rini from comment #27)
> (In reply to Nate Rini from comment #24)
> We also saw this error in the logs
> > Warning: slurmstepd startup took 21 sec, possible file system problem or full memory
> Please configure slurmd on the nodes to use the memlock option: "-M" and
> then please add this to the slurm.conf
> "LaunchParameters=slurmstepd_memlock". This will cause the slurmd and
> slurmstepd daemons to lock their memory on the nodes to avoid delays caused
> by slow filesystems. This will not help if the user application is slow to
> start/load. Is it possible to monitor the logs for this error message in the
> future? This is a filesystem issue that causing Slurm to slow down.

Can you please verify that memlock has been activated on the nodes? If so, is the customer still seeing the timeout errors?
Comment 37 Adel Aly 2019-12-03 06:57:12 MST
> Based on this output, the plugin is currently at the correct level. Looking back at the slurmd logs from comment #11, it looks like something changed on 2019-09-20 to correct the issue in comment #24.

No, we haven't changed anything.

> Can you please verify that memlock has been activated on the nodes? If so, is the customer still seeing the timeout errors?

Yes, slurmd is running on all the nodes with the "-M" flag.
Comment 38 Adel Aly 2019-12-03 08:53:54 MST
Do you have a recommendation on the expected read/write speed to the Isilon shared storage which can contribute to this timeout issue?
Comment 39 Nate Rini 2019-12-03 11:25:59 MST
(In reply to Adel Aly from comment #37)
> > Can you please verify that memlock has been activated on the nodes? If so, is the customer still seeing the timeout errors?
> Yes, slurmd is running on all the nodes with the "-M" flag.
That should make Slurm more responsive even if the filesystem is locked/slower. It won't fix any kernel driver locked states though.

(In reply to Adel Aly from comment #38)
> Do you have a recommendation on the expected read/write speed to the Isilon
> shared storage which can contribute to this timeout issue?
I believe this is covered in detail in our site training. In general, all of Slurm should be accessible on a filesystem with latency and bandwidth similar to a dedicated local disk (preferably a NVRAM or SSD device). The suggestion to use memlock should help if there is a shared file system with latency issues on the compute nodes but not the controller. If the controller is suffering latency issues from reading/writing to StateSaveLocation then all cluster operations will likely slow down (possibly dramatically).
Comment 40 Nate Rini 2019-12-10 12:11:47 MST
Adel

I'm going to close this ticket since there hasn't been any more questions. Please respond to have the ticket reopened.

Thanks,
--Nate
Comment 41 Adel Aly 2019-12-13 03:09:29 MST
Hi Nate,

The customer noticed an improvement when working with 1000 job arrays but still see the time out issue when submitting 3000+ job arrays.
Comment 42 Nate Rini 2019-12-13 09:29:00 MST
(In reply to Adel Aly from comment #41)
> The customer noticed an improvement when working with 1000 job arrays but
> still see the time out issue when submitting 3000+ job arrays.

Please provide logs around the incident and the job id.
Comment 43 Nate Rini 2019-12-26 10:44:42 MST
Adel,

Please reply to this ticket with the logs when convenient and it will reopen automatically.

Thanks,
--Nate
Comment 44 Adel Aly 2020-01-21 07:06:29 MST
Created attachment 12782 [details]
slurmctld log file

Hi Nate,

We have just verified that the issue still exists with an array of 5000 steps.

There are 3 scenarios:

1. the users submitted one job array with 3000+ steps
2. the users submitted 3000+ job arrays of different sizes
3. the users submitted different job arrays of different sizes which resulted in 3000+ steps

We have attached submit examples and the slurmctld log file.
Comment 45 Adel Aly 2020-01-21 07:07:03 MST
Created attachment 12783 [details]
submit examples
Comment 46 Nate Rini 2020-01-21 10:49:08 MST
(In reply to Adel Aly from comment #44)
> Created attachment 12782 [details]
> slurmctld log file
> cleanup_completing: JobId=4558420_2631(4561052) completion process took 160 seconds

Can you please verify what the job prolog script is doing? It appears to be taking quite a long time to run.
Comment 47 Nate Rini 2020-01-21 11:11:26 MST
(In reply to Adel Aly from comment #45)
> Created attachment 12783 [details]
> submit examples

> myuser@xyz01:/dms/app/Test/Slurm/Samples/basic-examples$ time sbatch --array=0-1999 reproduce_slow_user_commands.sh 
Can you please recall this command but with verbose mode?
> time sbatch -vvvvvv --array=0-1999 reproduce_slow_user_commands.sh
Comment 48 Adel Aly 2020-02-27 04:15:53 MST
Hi Nate,

We have found out that the issue is caused by the amount of time taken by the prolog configured in slurm.conf for PrologSlurmctld. The issue manifests itself with job arrays with large number of steps (3000+) because the prolog run time is around ~0.25 second. From our observation, it seems that the prolog runs sequentially for each job in the array which means that for 5000 job array the total prologs run time will add up to ~0.25*5000=20minutes which seems to be causing the "timed out on send/recv operation" issue. 

We would like to know if there is a possibility to force the prolog, PrologSlurmctld, to run in parallel in case of submitting a job array.

Thank you,
Comment 49 Jason Booth 2020-02-27 15:38:26 MST
Hi Adel - We have not been able to duplicate the behavior you observed. Can you elaborate on how you did this testing to observe the 0.25-second delay between each PrologSlurmctld? 

Slurm will launch the prolog as quickly as it can, and do so in parallel.

src/slurmctld/job_scheduler.c  

In "prolog_slurmctld" we create the thread for this prolog and move on immediately.


 If you are seeing a delay then the delay may be happening somewhere else. 


Can you tell me what the prolog is doing?
/cm/local/apps/cmd/scripts/prolog-prejob
Comment 50 Adel Aly 2020-03-05 09:51:28 MST
Hi Jason,

Nice to hear back from you.

> Slurm will launch the prolog as quickly as it can, and do so in parallel.

You're right indeed. According to our observation, the prologslurmctld is run in parallel for the number of jobs to be scheduled to run but we suspect that the overhead of starting (spawning) the thread adds up with large job arrays.

I will try to break down our observation in the following:

1. We will use the following basic test to run job arrays for consistent results with a different configuration changes:

time sbatch --array=0-3999 --wait reproduce_slow_user_commands.sh

cat reproduce_slow_user_commands.sh
#!/bin/bash

#SBATCH --output=/dev/shm/reproduce.output
#SBATCH --error=/dev/shm/reproduce.output


echo "Start: ${SLURM_ARRAY_TASK_ID}"
timeout 180 yes FOO > /dev/null
echo "Finished: ${SLURM_ARRAY_TASK_ID}"


2. We will run the test with and without PrologSlurmctld in slurm.conf

TestA: without PrologSlurmctld

grep PrologSlurm /etc/slurm/slurm.conf
#PrologSlurmctld=

We have noticed that the issue doesn't appear at all without PrologSlurmctld and as a result, all the Slurm commands are very responsive:

$ time sbatch --array=0-3999 --wait reproduce_slow_user_commands.sh
Submitted batch job 9835560

real    5m33.706s
user    0m0.443s
sys     0m0.378s


TestB: with PrologSlurmctld using a script that does nothing except exit(0)

grep PrologSlurm /etc/slurm/slurm.conf
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob

cat prolog-prejob
#!/usr/bin/perl

# This script invokes cluster health checker

exit(0);
#

We have noticed that the issue always appears with the same test, however, the number of "send/recv timeout" messages can vary between several runs:

$ time sbatch --array=0-3999 --wait reproduce_slow_user_commands.sh
Submitted batch job 9839565

sbatch: error: Currently unable to load job state information, retrying: Socket timed out on send/recv operation

sbatch: error: Currently unable to load job state information, retrying: Socket timed out on send/recv operation

real    8m11.142s
user    0m0.426s
sys     0m0.250s



We will attach the current slurm.conf config files, latest slurmctld logs and we will do another round of tests to make sure that the issue is reproducible consistently.

Please let us know if we need to run any other extensive tests on the environment or collect more info that will make it easier for you to debug this issue.
Comment 51 Adel Aly 2020-03-05 09:53:21 MST
Created attachment 13278 [details]
slurmctld log file (20200305)
Comment 52 Adel Aly 2020-03-05 09:54:25 MST
Created attachment 13279 [details]
slurm.conf (20200305)
Comment 53 Nate Rini 2020-03-05 10:01:27 MST
Comment on attachment 13278 [details]
slurmctld log file (20200305)

Looks like slurm.conf got uploaded instead of slurmctld log
Comment 54 Adel Aly 2020-03-05 10:11:18 MST
Created attachment 13280 [details]
slurmctld log file (20203005)

Apologies for the confusion Nate. I have attached the correct file.
Comment 55 Nate Rini 2020-03-05 10:59:26 MST
Adel,

Something looks very wrong with filesystem serving StateSaveLocation:

> error: opendir(/cm/shared/apps/slurm/var/cm/statesave/hash.0/job.8898750): Permission denied

Slurm must always be able to read and write StateSaveLocation otherwise very *unwanted* things can happen.

This needs to fixed ASAP. Can you please verify that SlurmUser owns all directories/files in StateSaveLocation.

--Nate
Comment 56 Adel Aly 2020-03-06 03:40:14 MST
Hi Nate,

We have fixed the permissions on the statesave of some of the old jobs.

Did you have a chance to look into the "Socket timed out on send/recv operation" issue with the example we have sent?

Let us know if you need any further details.
Comment 57 Nate Rini 2020-03-06 10:03:02 MST
(In reply to Adel Aly from comment #56)
> We have fixed the permissions on the statesave of some of the old jobs.
Any idea how the permissions were changed?
 
> Did you have a chance to look into the "Socket timed out on send/recv
> operation" issue with the example we have sent?
> 
> Let us know if you need any further details.

Please provide another slurmctld log for at least 20 minutes. If possible, please activate higher debug level:
> scontrol setdebug debug3
> scontrol setdebugflags +Network #this may not work on <=18.08

Reverse extra logging with:
> scontrol setdebugflags -Network
> scontrol setdebug info
Comment 58 Adel Aly 2020-03-09 09:02:08 MDT
Created attachment 13306 [details]
slurmctld.log

Hi Nate, Please find attached the slurmctld logs with debug level 3.

The issue is reproducible on the same cluster with just an empty prolog (just adding exit(0) at the very top of the script).

We have an strace of the slurmctld process while producing the "send/recv operation" timeout issue but it's about 120G (compressed):

https://support2.brightcomputing.com/adel/schedmd-7946/slurmctld.strace.tar.bz2

The "+Network" flag doesn't seem to be recognized:

xyz01:~ # scontrol setdebugflags +Network
scontrol: error: Invalid DebugFlag: Network
invalid debug flag: +Network

xyz01:~ # slurmctld -V
slurm 18.08.7
Comment 59 Nate Rini 2020-03-09 11:55:14 MDT
(In reply to Adel Aly from comment #58)
> The "+Network" flag doesn't seem to be recognized:
> 
> xyz01:~ # scontrol setdebugflags +Network
> scontrol: error: Invalid DebugFlag: Network
> invalid debug flag: +Network
Yeah, that failing it not unexpected, please ignore.
Comment 60 Nate Rini 2020-03-09 12:08:54 MDT
(In reply to Adel Aly from comment #58)
> Created attachment 13306 [details]
> slurmctld.log
> 
> Hi Nate, Please find attached the slurmctld logs with debug level 3.
> 
> The issue is reproducible on the same cluster with just an empty prolog
> (just adding exit(0) at the very top of the script).
Is this slurmctld.log from the 'exit(0)' prolog?

There are only 26 connection refused errors and 0 timeouts errors now:
> error: _shutdown_bu_thread:send/recv xyz..: Connection refused
These errors are only for host xyz02, which I assume is down?
Comment 61 Nate Rini 2020-03-09 13:18:17 MDT
(In reply to Adel Aly from comment #58)
> Created attachment 13306 [details]
> slurmctld.log
> 
> Hi Nate, Please find attached the slurmctld logs with debug level 3.
This log doesn't look like it includes any debug messages either. Maybe the wrong file was uploaded?
Comment 62 Adel Aly 2020-03-09 17:16:46 MDT
> Is this slurmctld.log from the 'exit(0)' prolog?

Yes, exactly. We tried several cases:

1. no prolog
2. empty prolog (exit(0))
3. full prolog
4. prolog without file writes
5. prolog without file writes and without any system calls

The last 4 tests produced the same errors.

> These errors are only for host xyz02, which I assume is down?

Yes, this server is down and we have discussed this before. This error is not related to the issue which we are seeing.

> This log doesn't look like it includes any debug messages either. Maybe the wrong file was uploaded?

This is the correct file. We used debug level 3 as you have suggested (from the command line and using SlurmctldDebug=3 in the slurm.conf file). We will attach an older log file just in case. Do you want a log file with a higher debug level?
Comment 63 Nate Rini 2020-03-09 17:23:15 MDT
(In reply to Adel Aly from comment #62)
> This log doesn't look like it includes any debug messages either. Maybe the wrong file was uploaded?
> 
> This is the correct file. We used debug level 3 as you have suggested (from
> the command line and using SlurmctldDebug=3 in the slurm.conf file). We will
> attach an older log file just in case.

debug3 should be very chatty in the logs. Is this site using syslog or some kind of intermediary that may be filtering the logs?

> Do you want a log file with a higher debug level?

Might as well try debug5. They should be logs starting with `debug[1-5]:` to prove that slurmctld is at the higher debug level.
Comment 64 Adel Aly 2020-03-10 03:27:04 MDT
Created attachment 13316 [details]
slurmctld log

Please find the slurmctld logs since the beginning of the test on Sunday.

In the logs, you can find:

[2020-03-08T11:31:40.125] Set debug level to 3

In slurm.conf we have also set during the test:

SlurmctldDebug=3
Comment 66 Nate Rini 2020-03-10 09:41:14 MDT
(In reply to Adel Aly from comment #64)
> Created attachment 13316 [details]
> slurmctld log
> 
> Please find the slurmctld logs since the beginning of the test on Sunday.
> 
> In the logs, you can find:
> 
> [2020-03-08T11:31:40.125] Set debug level to 3
> 
> In slurm.conf we have also set during the test:
> 
> SlurmctldDebug=3

Please make sure to include debug into the level:
> SlurmctldDebug=debug3

For historical reasons, Slurm still accepts an integer but it can be confusing and is not suggested.
Comment 68 Adel Aly 2020-03-11 09:49:35 MDT
Created attachment 13347 [details]
slurmctld logs debug5

Please find attached the log files with debug5 enabled.
Comment 70 Jason Booth 2020-03-11 14:45:42 MDT
Hi Adel,

One thing we did see in these logs you uploaded is the munge encode failures due to low munge's thread count. Communicate will not happen correctly since we are not able to verify a packest between slurmctld <-> slurmd. The good news is you can increase the thread count for munge as explained below. Can you please try this and report back if there is a change with regard to the timeouts? 

munged --num-threads 10

As explained in the htc guide https://slurm.schedmd.com/high_throughput.html



[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:00.093] debug:  Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)
[2020-03-11T16:29:46.235] debug2: Error connecting slurm stream socket at 10.60.20.72:6818: Connection refused
[2020-03-11T16:29:47.235] debug2: Error connecting slurm stream socket at 10.60.20.72:6818: Connection refused
[2020-03-11T16:29:48.236] debug2: Error connecting slurm stream socket at 10.60.20.72:6818: Connection refused
[2020-03-11T16:29:49.236] debug2: Error connecting slurm stream socket at 10.60.20.72:6818: Connection refused
[2020-03-11T16:29:50.236] debug2: Error connecting slurm stream socket at 10.60.20.72:6818: Connection refused
Comment 71 Marcin Stolarek 2020-03-12 02:13:25 MDT
Adel,

Can you tell when you noticed last "Socket timed out on send/recv operation" error message? Debug5 log contains only ~4 minutes of slurmctld activity, did the issue happened between 16:30 and 16:34 yesterday?

Does the "Socket timed out..." issue happens on all nodes, on a specific submit host? Could you please share OS logs:
-results of dmesg -H
- journalctl --since "2 days ago" (If you're on older SUSE version /var/log/messages will be the equivalent of journalctl output.)
from the host running controller and the one where you see the error messages.

Is the /cm/shared directory local for slurmctld or is it served by external storage? If it's external could you please share the mount options used?

Are you executing some "user" commands from root in an automated manner? Any cron job or service running commands like `squeue`, `scontrol show job`, `scontrol show node`, `sinfo`? From the sdiag output the main load comes from root user which normally is daemon to daemon communication, however, I'm surprised that you have so many (compared to other RPCs) REQUEST_PARTITION_INFO or REQUEST_NODE_INFO_SINGLE. Mentioned commands executed frequently can lead to Slurm performance issues. Are you running those commands in Prolog/Epilog scripts?

cheers,
Marcin
Comment 72 Adel Aly 2020-03-12 03:52:21 MDT
Hi Jason, Marcin,

> munged --num-threads 10

We have tried increasing the number of threads, 10 and 25, but we got the same results (the issue is reproducible). Any idea why the munged would be an issue with and without the prolog enabled?

I would assume that if munged works fine without a prolog then it would also work fine with a prolog enabled, or is the prolog and munged related to each other?

> Can you tell when you noticed last "Socket timed out on send/recv operation" error message? 

It happened between 15:50 and 16:20 but I don't recall the exact time, unfortunately.

> Debug5 log contains only ~4 minutes of slurmctld activity, did the issue happened between 16:30 and 16:34 yesterday?

We enabled debug level 5 starting 15:45 on march 11, 2020. The logs attached are from march 9, 2020 until the end of test 16:34 march 11. The debug window is definitely larger than 4 minutes.


(these lines are from the slurmctld logs debug5 attached to this case)

[2020-03-11T15:45:20.001] debug:  Log file re-opened
[...]
[2020-03-11T16:34:43.176] debug2: _slurm_rpc_get_fed usec=137

> Does the "Socket timed out..." issue happens on all nodes, on a specific submit host?

It happens on the submit host (management node). We see it when we run the sbatch command and other commands as well like squeue and scancel.

> Is the /cm/shared directory local for slurmctld or is it served by external storage? If it's external could you please share the mount options used?

It's hosted on an NFS server:

(rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1.2.3.4,mountvers=3,mountport=1234,mountproto=udp,local_lock=none,addr=1.2.3.4)

> Are you executing some "user" commands from root in an automated manner?

Please refer to comment 50 (https://bugs.schedmd.com/show_bug.cgi?id=7946#c50). It has all the details on the test we are running.

> I'm surprised that you have so many (compared to other RPCs) REQUEST_PARTITION_INFO or REQUEST_NODE_INFO_SINGLE. Mentioned commands executed frequently can lead to Slurm performance issues. Are you running those commands in Prolog/Epilog scripts?


The "scontrol show partition -o" and "scontrol ping", in addition to "scontrol show job", are run periodically by the management daemon to collect information about the status of the partition and nodes and jobs in slurm.
Comment 73 Marcin Stolarek 2020-03-12 04:33:14 MDT
Adel,

>[...] or is the prolog and munged related to each other?
This may depend on PrologSlurmctld content. Could you please share the script with us?

>The debug window is definitely larger than 4 minutes.
Sorry for that - I missed that log level changed in the middle of the second file in the attachment.

>Please refer to comment 50
Can you reproduce the issue without "--wait" option?

>The "scontrol show partition -o" and "scontrol[..] management daemon[...]
Are you able to disable this temporarily to verify its impact?

I'd recommend adding nodiratime and noatime options to NFS mount.

cheers,
Marcin
Comment 74 Marcin Stolarek 2020-03-12 04:35:45 MDT
PS. Please ignore "Could you please share the script with us?" I see your point from comment 50.
Comment 75 Adel Aly 2020-03-12 04:48:37 MDT
Hi Marcin,

> This may depend on PrologSlurmctld content. Could you please share the script with us?

The script currently has a single line:

exit(0);


>Please refer to comment 50

Can you reproduce the issue without "--wait" option?

I recall we reproduced it without --wait in a previous attempt, but we also run squeue and scancel and we saw the issue (those commands don't use --wait).

> Are you able to disable this temporarily to verify its impact?

We can disable this indeed by stopping the management daemon. But why we don't notice the issue without prologslurmctld set in slurm.conf? If disabling the management daemon would have an impact than I would expect to see the issue with and without a prologslurmctld.

> I'd recommend adding nodiratime and noatime options to NFS mount.

We can change the mount options, but I suspect that this doesn't seem to have an impact with our observations (the issue disappears just by disabling the prologslurmctld in slurm.conf which includes nothing more than exit(0) ). If you think that the mount options has an impact on the issue, then we can change the mount options and then remount.
Comment 76 Marcin Stolarek 2020-03-12 05:21:21 MDT
Adel,

I have a hypothesis on what's happening. If you're submitting a big job array with a relatively short task when you have idle nodes slurmctld will start them immediately calling SlurmctldProlog script for every array element. This can create thousands of processes on the box running slurmctld potentially overloading the platform so OS can't provide enough CPU cycles to slurmctld to handle all requests within the timeout, while the number of other RPC - related to job start/job end increases (--wait option to sbatch results in additional RPCs - it just requests a job status periodically).

Can you try to reproduce the issue and check the load on the server where slurmctld is running using `uptime` command? Let's say every 30 seconds during the low responsiveness of the controller daemon.

Do I understand correctly that you're only asking out of curiosity, since production use of SlurmctldProlog that only loads Perl interpreter and exits doesn't bring a huge value?

cheers,
Marcin
Comment 77 Adel Aly 2020-03-12 05:31:10 MDT
Hi Marcin,

> I have a hypothesis on what's happening. If you're submitting a big job array with a relatively short task when you have idle nodes slurmctld will start them immediately calling SlurmctldProlog script for every array element. This can create thousands of processes on the box running slurmctld potentially overloading the platform so OS can't provide enough CPU cycles to slurmctld to handle all requests within the timeout, while the number of other RPC - related to job start/job end increases (--wait option to sbatch results in additional RPCs - it just requests a job status periodically).

> Can you try to reproduce the issue and check the load on the server where slurmctld is running using `uptime` command? Let's say every 30 seconds during the low responsiveness of the controller daemon.

We will re-run the test and record the load average every 30 seconds.

> Do I understand correctly that you're only asking out of curiosity, since production use of SlurmctldProlog that only loads Perl interpreter and exits doesn't bring a huge value?


I'm not sure if I understand the question here. Can you please elaborate on what is required?
Comment 78 Marcin Stolarek 2020-03-12 05:55:18 MDT
>I'm not sure if I understand the question here. Can you please elaborate on what is required?

I'm trying to get the whole picture to find the best solution for you.

Are you using SlurmctldProlog that just loads perl and exit? If we understand the issue you can just disable it or is there some reason for having it or in reality you're using different perl script that does some real job?

cheers,
Marcin
Comment 79 Adel Aly 2020-03-12 06:05:34 MDT
Hi Marcin,

> Are you using SlurmctldProlog that just loads perl and exit? 


At the moment, this is true but we have done this (i.e. create a perl script which does nothing more than exit(0);) because one of the support engineers suggested that the issue is coming from the contents of the prolog script itself, so we wanted to eliminate this possibility and focus on solving the real issue rather than debugging a perl script which is not related to the issue which we are seeing.


> If we understand the issue you can just disable it or is there some reason for having it or in reality you're using different perl script that does some real job?

Yes, the prolog script is supposed to do health checking for the nodes on which the job will run to accept or reject/requeue the job if one or more nodes proved to be unhealthy.

Because we suspected that the issue is coming from this script, we just created a script with exit(0); but in reality, we want to use a prolog to do other work instead of just exit(0); and this is basically the main issue. Without a "PrologSlurmctld" (commented out in slurm.conf) we don't see the issue, but with "PrologSlurmctld=<myscript>" we see the issue regardless of the contents of the <myscript>.
Comment 80 Marcin Stolarek 2020-03-12 09:41:09 MDT
Adel,

It depends on your setup, but in general, I don't think PrologSlurmctld[1] is a good place for a node health check since it's running on the server where slurmctld runs - not computing node. In relation to nodes, it's a place where you can for instance reboot nodes to the different OS image.

The more "Slurm way" to check nodes state is HealthCheckProgram[2].

If you're keen on running some checks on computing node before job start, you can check Prolog[3] - in case of its failure the node will be drained and job requeued.

Looking forward to hear from you on the verification of comment 76.

cheers,
Marcin
[1] https://slurm.schedmd.com/slurm.conf.html#OPT_PrologSlurmctld
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
[3] https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog
Comment 81 Adel Aly 2020-03-13 05:50:37 MDT
Created attachment 13369 [details]
uptime+slurmctld logs (20200313)

Hi Marcin,

Please find attached the uptime output during running two tests (captured every 30 seconds) along with slurmctld logs (debug5):

1. with prologslurmctld enabled
2. without prologslurmctld enabled

It seems that the load average increases when running the test with prologslurmctld enabled.

> It depends on your setup, but in general, I don't think PrologSlurmctld[1] is a good place for a node health check since it's running on the server where slurmctld runs - not computing node.

That prologslurmctld is what we want to run a prejob health check from the management server. The prejob health check is run just before the job is run to reject/requeue the job based on the result of the script.

> The more "Slurm way" to check nodes state is HealthCheckProgram[2].

The health check program runs periodically. The management daemon already runs periodic health checks on the cluster. While this can detect unhealthy nodes, it still doesn't address the case where the node becomes unhealthy just before the job is run. With periodic health checks, there is a chance that the job will be scheduled on an unhealthy node within the period of the health check.


> If you're keen on running some checks on computing node before job start, you can check Prolog[3] - in case of its failure the node will be drained and job requeued.

We are already aware of prolog for slurmd and we make use of it in different cases. The idea of using the prologslurmctld is that it runs on the management server and the script can talk to the management daemon and run health checks just before the job is run (in this case it will not wait for health checks to run periodically). 

Please note that the prologslurmctld script at the moment does nothing except running "exit(0);"


Some comments regarding the test case which we use to reproduce the issue:

* The use of --wait was only in conjunction with time so we could capture the time taken to complete the process (without needing to calculate within a job or via Slurm). It was also an indicator that things were taking a long time when a prescribed 3-minute task was taking several more minutes to terminate properly.

* The same socket timed out issue was indeed observed without the --wait option passed to sbatch.
Comment 82 Marcin Stolarek 2020-03-13 07:00:39 MDT
Hi Adel,

>The idea of using the prologslurmctld is that it runs on the management server and the script can talk to the management daemon and run health checks just before the job is run (in this case it will not wait for health checks to run periodically).

If this is what you need you just have to be aware that this approach will limit the system throughput. As you mentioned just loading a perl interpreter and exiting is causing significant impact. What is the hardware spec for the server running slurmctld?

Could you please verify if configuring SchedulerParameters with bf_max_job_start=100,sched_max_job_start=100,sched_min_interval=10000000 will allow you to run a test job without "Socket timed out.." errors?

cheers,
Marcin
Comment 83 Adel Aly 2020-03-13 09:01:44 MDT
Hi Marcin,

> If this is what you need you just have to be aware that this approach will limit the system throughput. As you mentioned just loading a perl interpreter and exiting is causing significant impact. What is the hardware spec for the server running slurmctld?

The management server is PowerEdge R630 which has the following hardware specs:

CPUs: 32 cores ( Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz )
Memory: 512GB


A load average of < 4 on a 32 CPU system is not significant as far as I can tell. I assume that this means we can't use "PrologSlurmctld" to do anything?


> Could you please verify if configuring SchedulerParameters with bf_max_job_start=100,sched_max_job_start=100,sched_min_interval=10000000 will allow you to run a test job without "Socket timed out.." errors?


In comment #6, we have been advised to use the following SchedulerParameters:

SchedulerParameters=sched_min_interval=1000000,bf_continue,max_rpc_cnt=100

Shall we try using the following and re-run the test?

SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100

Do you have any recommendations where could we add a script to run by the slurmctld just before the job runs?
Comment 84 Marcin Stolarek 2020-03-13 09:17:14 MDT
Adel,

>A load average of < 4 on a 32 CPU system is not significant as far as I can tell
I concur. Could you please attach 

>SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100
This looks good, the parameters we're adding are to limit number of jobs started to make us 100% sure that the issue comes form the number of concurrent processes started at controler server.

Could you please share OS logs:
- result of `dmesg -H`
- journalctl --since "1 days ago" (If you're on older SUSE version /var/log/messages will be the equivalent of journalctl output.)

>Do you have any recommendations where could we add a script to run by the slurmctld just before the job runs?
PrologSlurmctld is the option. My point was that it's probably not optimal to run node healthcheck there since it actually has to happen on nodes.
 
cheers,
Marcin
Comment 85 Adel Aly 2020-03-13 10:19:40 MDT
Created attachment 13376 [details]
journal_dmesg

Hi Marcin,

> I concur. Could you please attach 

I think this message is truncated?

> Could you please share OS logs:
> - result of `dmesg -H`
> - journalctl --since "1 days ago" (If you're on older SUSE version 

Attached.


> My point was that it's probably not optimal to run node healthcheck there since it actually has to happen on nodes.

The health checks are run on the targe compute nodes eventually. The prolog connects to the management daemon and asks it to run the prejob checks and get back the results. The management daemon takes care of instructing the compute nodes to run the health checks and send back the results and eventaully the management daemon will return the result to the prolog.

Regardless of the contents of the prologslurmctld script and how efficient the logic inside it, we still see the issue.
Comment 86 Marcin Stolarek 2020-03-13 11:01:31 MDT
Adel,

Thanks for the logs. Do you remember the time frame when the test was executed? This will make journal logs analysis easier.

>The health checks are run on the target compute nodes eventually. The prolog connects to the management daemon and asks it to run the prejob[...]
As I said if you need it you can use it that way. This is also what I meant by "Slurm way" as I understand management daemon is part of Bright CM. My point was that Prolog will just do the thing - run something required on the computing nodes before the job start. I think it is more efficient than calling an external script (SlurmctldProlog) to contact management daemon and... that's why I mentioned other options. 

Nevertheless, I'm looking forward to hearing from you with the test results with updated SchedulerParameters.

cheers,
Marcin
Comment 87 Adel Aly 2020-03-16 02:15:43 MDT
Hi Marcin,

We are about to schedule a downtime to run the test with the new SchedulerParamters:

SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100

Do you have an estimate on how many other parameters we will need to change so we can better schedule the downtime?

Any other insights you can share with us on the impact of these changes on the issue (or on the performance in general) is very appreciated.

Kind regards,
adel
Comment 88 Marcin Stolarek 2020-03-16 03:19:18 MDT
Adel,

There will be no downtime related to this parameter adjustment. The change doesn't even require a restart of slurmctld, updating a configuration file and calling `scontrol reconfigure` will be enough. In terms of ITIL I'd expect this to be treated as a reaction to the incident - "Socket timed out on[...]" errors affecting users rather than a scheduled change.

I did a double check of parameters you mentioned in comment 84 and noticed that you skipped `sched_max_job_start` comparing to my suggestions from comment 82. I'm sorry I missed that checking for the first time. The line I want you to try is:
SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100

The parameters we're setting are limiting number of jobs started by main and backfill scheduler in a single iteration to 100, the minimal interval of the main scheduler is set to 10s and backfill interval stays with a default of 30s. This means that the number of jobs started per minute will be limited to ~800.

The goal here is to verify if limiting the frequency of jobs start will eliminate the issue. It's a way of confirming that the reason is OS-level resources exhaustment. 

I didn't directly ask before but to be sure did you notice "Socket timed out" errors testing doing the test you shared in comment 81?
Repeating from the previous message: Do you remember the time frame when the test was executed?

cheers,
Marcin
Comment 89 Adel Aly 2020-03-16 03:59:23 MDT
Hi Marcin,

> There will be no downtime related to this parameter adjustment. The change doesn't even require a restart of slurmctld, updating a configuration file and calling `scontrol reconfigure` will be enough. 

We need to notify users because we will do the test after enabling prologslurmctld. So regardless of the fact that we need to restart the slurmctld or just run reconfigure, there is a chance that the users will be affected.


> I did a double check of parameters you mentioned in comment 84 and noticed that you skipped `sched_max_job_start` comparing to my suggestions from comment 82. I'm sorry I missed that checking for the first time. The line I want you to try is:
SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100

Ok

> The goal here is to verify if limiting the frequency of jobs start will eliminate the issue. It's a way of confirming that the reason is OS-level resources exhaustment. 

As far as I can tell, the management node has plenty of resources and they don't get exhausted when we run 4000 job array. The load average increases but it's less than 4 on a 32 CPU system.

> I didn't directly ask before but to be sure did you notice "Socket timed out" errors testing doing the test you shared in comment 81?

Whenever we enable prologslurmctld we see the issue. If any suggested change or test didn't show the issue, we will definitely notify you.


> Repeating from the previous message: Do you remember the time frame when the test was executed?

Please Check the uptime test file. It has the time during which we run the test.
Comment 90 Adel Aly 2020-03-18 14:31:44 MDT
Created attachment 13418 [details]
slurmctld log file (20200318)

Hi Marcin,

We have done some more tests and we have the following observations:

1. The suggested scheduler parameters don't resolve the issue, instead, they just slow down its occurrence and affects the performance and the usability of the cluster.

2. With the suggested shcheduler parameters, the test case takes > 9 mins to complete whether the prologslurmctld is enabled or not (while the actual run time should be between 3-4 mins).

3. The cluster has around 300 compute nodes and when we use the suggested scheduler parameters only 100 jobs get scheduled at a time (every scheduling cycle) and the rest of the jobs stay in a pending state (Reason=None) while there is a large number of idle nodes.

4. increasing the value of "sched_max_job_start" accelerates the rate of the occurrence of the issue.

We have another observation which seems to be consistent and related to the issue which we are seeing. Running the same test case right after restarting the slurmctld, with prologslurmctld enabled and no scheduler parameters, makes the runtime nearly similar to the run time of the same test case when prologslurmctld is disabled. The runtime increases gradually for each successive run until we see the issue. Once we restart slurmctld again, the same pattern repeats.

Attached is the slurmctld logs (with debug5 enabled) from today.
Comment 91 Adel Aly 2020-03-18 14:42:45 MDT
Created attachment 13419 [details]
slurmctld log file (20200318)
Comment 92 Marcin Stolarek 2020-03-19 10:28:22 MDT
Adel,

I'm reviewing the logs and trying to reproduce the issue on a similar (even much smaller hardware setup).

In the meantime could you please verify that:
- Slurmctld is not limited to a subset of resources on the server? I mean it's not running in a cpuset or it wasn't 'taskset'? For instance if you'll add a cpuId column and check slurmctld activity under htop is it bouncing between all cores?
- Could you please verify if running the test while the Bright management daemon is stopped you're also experiencing the issue?


>they just slow down its occurrence and affects the performance and the usability of the cluster.
Do you mean that the issue occurs "later"? Do you "feel" like it's related to a number of jobs running or ending?

>accelerates the rate of the occurrence of the issue.
The slowdown was expected as explained in comment 88.

>Running the same test case right after restarting the slurmctld, with prologslurmctld enabled
I'll look into that, this is really interesting. Are you still on slurm-18.08.7? I'm asking since there were changes in the related code and I want to be sure I'm testing exactly what you have.

cheers,
Marcin
Comment 93 Adel Aly 2020-03-19 10:51:22 MDT
Hi Marcin,

> - Slurmctld is not limited to a subset of resources on the server? I mean it's not running in a cpuset or it wasn't 'taskset'? For instance if you'll add a cpuId column and check slurmctld activity under htop is it bouncing between all cores?

This is not the case as far as I can tell but I will verify.

> - Could you please verify if running the test while the Bright management daemon is stopped you're also experiencing the issue?

We can try this indeed but:

1. the issue doesn't happen when the prologslurmctld is commented out in slurm.conf regardless of Bright management daemon is running or not and,
2. the issue happens if we have an empty prologslurmctld script (exit(0))

So I'm wondering what did you observe in the logs which made you think that running the management daemon is contributing to this issue?


> Do you mean that the issue occurs "later"? 

Yes, it occurs after several iterations and we have even observed it with running a smaller job array (size 1500 tasks).

> Do you "feel" like it's related to a number of jobs running or ending?

This is hard to say because:

1. we don't see the issue with prologslurmctld commented out in slurm.conf and,
2. we don't see the issue right after the slurmctld is restarted with prologslurmctld enalbed but,
3. the issue happens when the prologslurmctld is enabled and after running several iterations of the same test case.

> Are you still on slurm-18.08.7? 

Yes, exactly.
Comment 94 Adel Aly 2020-03-19 15:19:11 MDT
Hi Marcin,

We had a chance to access a similar cluster and we were able to reproduce the issue consistently using the same test case.

We have done the following:

1. Run the test with a full prologslurmctld -- issue is reproducible.
2. Run the test with an empty prologslurmctld (exit(0)) -- issue is reproducible.
3. Run the test with prologslurmctld commented out in slurm.conf -- issue doesn't happen at all.
4. stop Bright management daemon and run the test with an empty prologslurmctld (exit(0)) -- issue is reproducible.
Comment 95 Marcin Stolarek 2020-03-20 05:42:36 MDT
Adel,

Looking at logs I've found that at certain point execution of PrologSlurmctld starts fo fail with an exit code I'm trying to understand now. The log looks like:
>error: prolog_slurmctld JobId=XXX prolog exit status 99:0

Multiple failures of PrologSlurmctld execution can lead to slurmctld performance issues since handling of its failure requires a write lock on jobs and nodes. I can't understand exitcode 99 in terms of linux errno it's:
>#define EADDRNOTAVAIL   99      /* Cannot assign requested address */
and off the top of my head, it's coming for instance from bind call when an application is unable to bind to a specified network address.

The code in job_scheduler.c:_run_prolog is basically:
-> fork
  -> close fd 0-1024 
  -> execve
-> waitpid 

Is your PrologSlurmctld located on NFS filesystem, if yes can you verify if moving it to local fs can fix the issue?

Could you please apply the patch like the one below, rebuild slurmctld and verify if this will fix or at least "slow down" the issue?
--- a/src/slurmctld/job_scheduler.c
+++ b/src/slurmctld/job_scheduler.c
@@ -4519,7 +4519,7 @@ static void *_run_prolog(void *arg)
                goto fini;
        }
        if (cpid == 0) {
-               for (i = 0; i < 1024; i++)
+               for (i = 0; i < 4096; i++)
                        (void) close(i);
                setpgid(0, 0);
                execve(argv[0], argv, my_env);

cheers,
Marcin
Comment 96 Adel Aly 2020-03-20 06:24:16 MDT
Hi Marcin,

> Looking at logs I've found that at certain point execution of PrologSlurmctld starts fo fail with an exit code I'm trying to understand now. The log looks like:
>error: prolog_slurmctld JobId=XXX prolog exit status 99:0

This error code is from the full prolog. Again, the issue happens with an empty PrologSlurmctl script which just has exit(0).

Please ignore this error. The issue happens with an exit(0) script or with a full script (that sometimes produces this 99 code because of a failed test) and it's reproducible on two different clusters with an empty prolog script (exit(0);).


> Multiple failures of PrologSlurmctld execution can lead to slurmctld performance issues since handling of its failure requires a write lock on jobs and nodes.

I don't agree with this statement because we can simply reproduce the issue with a script that has nothing except (exit(0);) and I don't think that exit(0) is too much to run and it's not a failure.

> I can't understand exitcode 99 in terms of linux errno it's:

This error is coming from the full prolog script. Please ignore it there are other tests which we have run with the exit(0); PrologSlurmctld script. The issue happens with an empty prolog which has nothing except exit(0).

> Is your PrologSlurmctld located on NFS filesystem, if yes can you verify if moving it to local fs can fix the issue?

We have done the following tests:

1. full script on a shared storage
2. full scirpt not on a shared storage
3. empty script (exit(0);) on a shared storage
4. empty script (exit(0);) not on a shared storage
5. two different clusters
6. run the test with cmd stopped and an empty script (exit(0);)

In all previous cases the issue is reproducible. The PrologSlurmctld script which produces the issue on two different clusters, whether the script on a shared storage or on a local storage, looks like the following:

#!/usr/bin/perl

exit(0);


> Could you please apply the patch like the one below, rebuild slurmctld and verify if this will fix or at least "slow down" the issue?

Unfortunately, we can't test this in a production environment.

I will check if we can get access to a lab to do this test, but if you can explain why do you think that this change will fix the "slow down", then we may be able to ask for downtime to apply a patch to slurm code and rebuild the binary and install it and then re-run the test.
Comment 97 Marcin Stolarek 2020-03-20 08:07:33 MDT
>if this will fix or at least "slow down" the issue?

>I will check if we can get access to a lab to do this test, but if you can >explain why do you think that this change will fix the "slow down", then we may >be able to ask for downtime to apply a patch to slurm code and rebuild the >binary and install it and then re-run the test.

In comment 93 you noticed that the issue happens after a few iterations, by 'fix or "slow down"' I mean that it may eliminate the issue or require more "iterations" for that to happen. The 1024 file descriptors closed before execv is a number assumed to be reasonable. Although, if because of any reason slurmcltd had more files some of those aren't close before the process image substitution. In a regular application, I wouldn't expect this to be leading to some issues. However in a conext of a fork from a pthread it may be more tricky.

Didn't you replicate the issue in a test environment?

The way I'm trying to reproduce it is:
sbatch -a 1-1000 --wrap="sleep 10"
with PrologSlurmctld being a perl script doing exit(0) on slurm-18.08.7. I'm able to execute this sbatch number of times without noticing connection timeout errors with MessageTimeout=100.

cheers,
Marcin
Comment 98 Adel Aly 2020-03-20 08:39:32 MDT
Hi Marcin,

> In comment 93 you noticed that the issue happens after a few iterations, by 'fix or "slow down"' I mean that it may eliminate the issue or require more "iterations" for that to happen. The 1024 file descriptors closed before execv is a number assumed to be reasonable. Although, if because of any reason slurmcltd had more files some of those aren't close before the process image substitution. In a regular application, I wouldn't expect this to be leading to some issues. However in a conext of a fork from a pthread it may be more tricky.

> Didn't you replicate the issue in a test environment?

Yes, we are able to reproduce this issue on a test system and on this system the issue happens immediately so no iterations are needed on the test system to reproduce the issue.

> The way I'm trying to reproduce it is:
> sbatch -a 1-1000 --wrap="sleep 10"
> with PrologSlurmctld being a perl script doing exit(0) on slurm-18.08.7. I'm able to execute this sbatch number of times without noticing connection timeout errors with MessageTimeout=100.

We can easily and consistently reproduce the issue. Also, you're using an array of size 1000, though we are using an array of size 4000.

Depending on how large your system is, you may or may not see the issue.

The test systems have ~300 compute nodes, each with 32 cores and multiple jobs per node is allowed.

You can probably reproduce the issue on a 100-node cluster (32 cores each). 

If you're trying to reproduce the issue on a smaller cluster then you may not see it.
Comment 100 Marcin Stolarek 2020-03-24 09:21:00 MDT
Waiting for the test results I thought that maybe you are not aware of our OS level tunning recommendations for high throughput computing[1]? I thought that since you have only one SlurmctldPort in slurm.conf and one of the recommendations is to use range.

Can you also attach your slurm.conf? Since the ticket is quite long and we suggested a number of changes I'd like to be sure that we're on the same page.

cheers,
Marcin
[1]https://slurm.schedmd.com/high_throughput.html
Comment 101 Adel Aly 2020-03-24 09:42:54 MDT
Hi Marcin,

> Waiting for the test results 

We were waiting for a reply from your side about what will be the next steps.

What are the tests results required?

> I thought that maybe you are not aware of our OS level tunning recommendations for high throughput computing[1]? I thought that since you have only one SlurmctldPort in slurm.conf and one of the recommendations is to use range.

No, we weren't aware of those parameters. Sure, we can try to increase the port range and run more tests, but we are wondering if the issue is coming from the lack of slurmctld ports or some low level OS parameters, wouldn't this affect slurmctld in all cases rather than manifesting itself when we enable prologslurmctld?

> Can you also attach your slurm.conf? Since the ticket is quite long and we suggested a number of changes I'd like to be sure that we're on the same page.

Sure, we can attach the slurm.conf but we have just added the following line suggested by you to the latest slurm.conf file:

SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100
Comment 102 Marcin Stolarek 2020-03-25 03:04:39 MDT
Adel,

I really appreciate your responsiveness, although, a substantial part is discussion and questions about the reasoning. I fully understand that, but for now, I need you to come along with me and when we have a resolution I can prepare a details root cause analysis. Is that OK?

My current understanding of the issue is that we're experiencing a noticeable performance issue with PrologSlurmctld configured. This option is in use by our customers having 1000+ nodes clusters and we didn't notice similar issues on their side neither I can fully reproduce this on my test environment with virtual ~4000 cores. This leads us to the direction of general optimization, but I'm also checking the code to find a potential bug. 

I'll try to gather things we have to do below.

1) Please share your current slurm.conf. I'd expect that there are more modifications than just SchedulerParameters I suggested since there were some advices other advices in the ticket like:
comment 27) LaunchParameters=slurmstepd_memlock (you mentioned you have -M for slurmd it's also fine)
comment 30) remove FastSchedule
2) Did you increase number of munge threads like suggested in comment 70?
3) Can you go through HTC recommendations[1], I'm mostly focused on:
  -)System configuration section with parameters like:
    -] /proc/sys/fs/file-max set to 32832
    -] /proc/sys/net/ipv4/tcp_syncookies enable it/set to 1
    -] /proc/sys/net/ipv4/tcp_max_syn_backlog set to 2048
    -] /proc/sys/net/core/somaxconn set it to 1024
    -] /proc/sys/net/ipv4/ip_local_port_range to "32768 65535"
  -) What are current ulimits set for slurmctld ? Those should be set to appropriatelly high values for memory, open files and stack especially.
  -) For Slurm parameters:
    -] Set SlurmctldPort to a range of 4 ports.
    -] Check that SlurmdLogFile  and SlurmctldLogFile are on local filesystem.
    -] Add bf_yield_interval=1000000 and bf_interval=300 to SchedulerParameters
    -] Switch to task/affinity TaskPlugin
4)  Can you verify if you can reproduce the issue with simple /bin/sh script instead of perl?
5) Can you apply the patch from comment 95 and check if you seen any difference
--- a/src/slurmctld/job_scheduler.c
+++ b/src/slurmctld/job_scheduler.c
@@ -4519,7 +4519,7 @@ static void *_run_prolog(void *arg)
                goto fini;
        }
        if (cpid == 0) {
-               for (i = 0; i < 1024; i++)
+               for (i = 0; i < 4096; i++)
                        (void) close(i);
                setpgid(0, 0);
                execve(argv[0], argv, my_env);

Could you please use perf to record and gather stats from slurmctld during the test execution by commands like 
> perf record  -s --call-graph dwarf -p `pidof slurmctld`
<keep it running from jobs submit to till the end of test and Ctrl+c>
Gather data and send it back:
> perf archive perf.data


cheers,
Marcin

[1] https://slurm.schedmd.com/high_throughput.html
Comment 103 Marcin Stolarek 2020-03-25 10:21:32 MDT
PS. I just understood that the info about gathering the performance data may be misleading. The command:
> perf archive perf.data
will generate perf.data.gz file but I'll need both perf.data.gz and perf.data files.
Comment 104 Adel Aly 2020-03-25 13:12:46 MDT
Hi Marcin,

> 1) Please share your current slurm.conf. 

I will attach it shortly.

> 2) Did you increase number of munge threads like suggested in comment 70?

Yes.

> 3) Can you go through HTC recommendations[1], I'm mostly focused on:

[root@zenith81 ~]# cat /proc/sys/fs/file-max
32832
[root@zenith81 ~]# cat /proc/sys/net/ipv4/tcp_syncookies
1
[root@zenith81 ~]# cat /proc/sys/net/ipv4/tcp_max_syn_backlog
2048
[root@zenith81 ~]# cat /proc/sys/net/core/somaxconn
1024
[root@zenith81 ~]# cat /proc/sys/net/ipv4/ip_local_port_range
32768   65535
[root@zenith81 ~]# su - slurm
Last login: ma okt 22 10:46:14 CDT 2018 on pts/0
-bash-4.2$ ulimit
unlimited
-bash-4.2$ ulimit  -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030039
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

* log files and binaries and configurations are on local filesystem.
* added extra scheduling parameters
* changed task/cgroup to task/affinity

[root@zenith81 ~]# netstat -tapun | grep slurmctld
tcp        0      0 0.0.0.0:6814            0.0.0.0:*               LISTEN      48994/slurmctld
tcp        0      0 0.0.0.0:6815            0.0.0.0:*               LISTEN      48994/slurmctld
tcp        0      0 0.0.0.0:6816            0.0.0.0:*               LISTEN      48994/slurmctld
tcp        0      0 0.0.0.0:6817            0.0.0.0:*               LISTEN      48994/slurmctld

4) yes, the issue is reproducible with bash script with exit0.

These parameters makes the issue happen slowly (and of course it slows down the whole scheduling process and makes the resources sit idle):

SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100,bf_yield_interval=1000000,bf_interval=300

These parameters make the issue appear fast (but they also slow down the scheduling process):

SchedulerParameters=bf_max_job_start=500,sched_min_interval=10000000,bf_continue,max_rpc_cnt=500,sched_max_job_start=500,bf_yield_interval=1000000,bf_interval=300

5) we will try to find a downtime to patch slurm code and create a new binary to run the tests.

I will attach perf data with logs from several runs shortly.
Comment 105 Adel Aly 2020-03-26 02:14:14 MDT
Created attachment 13485 [details]
perf data
Comment 106 Adel Aly 2020-03-26 03:27:03 MDT
Created attachment 13486 [details]
slurmctld log file (20200325)
Comment 107 Marcin Stolarek 2020-03-26 04:01:34 MDT
Adel,

Hi Marcin,

> 1) Please share your current slurm.conf. 

> 2) Did you increase number of munge threads like suggested in comment 70?

Could you please share result of 'ps aux | grep munge' - just to be 100% sure.

> 3) Can you go through HTC recommendations[1], I'm mostly focused on:
>ulimit -a
Can you check the result of `cat /proc/$(pidof slurmctld)/limits` on modern systems user defaults are sometimes overridden by systemd, etc.


>5) we will try to find a downtime to patch slurm code and create a new binary to run the tests.
Can you do that on a test system where you reproduced the issue?

6)
> perf archive perf.data
>will generate perf.data.gz file but I'll need both perf.data.gz and perf.data >files.
Thanks for perf.data.bz2 I'll also need original perf.data, it may be missleading but .bz2 file created by perf archive is not a compressed version of .data.


7) Could you please add a sleep(2) before exit(0) in the prologue script and check if this will change the behavior?

cheers,
Marcin
Comment 108 Marcin Stolarek 2020-03-26 04:16:16 MDT
Questions from 107 are still open.

(In reply to Adel Aly from comment #106)
> Created attachment 13486 [details]
> slurmctld log file (20200325)

Looking into this log file I've noticed that it's slurm 17.11.2...
$ grep -i 'slurmctld version' ./slurmctld
[2020-03-24T13:21:53.945] slurmctld version 17.11.2 started on cluster slurm_cluster
[2020-03-25T11:32:51.245] slurmctld version 17.11.2 started on cluster slurm_cluster

This is an old unsupported version, is it the right file?

cheers,
Marcin
Comment 109 Adel Aly 2020-03-26 04:35:27 MDT
Created attachment 13487 [details]
slurm.conf latest


daemon   136522  0.2  0.0 1146776 6416 ?        Sl   Mar12  41:38 /usr/sbin/munged --force --num-threads=16


Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2063504              2063504              processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2063504              2063504              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

> Can you do that on a test system where you reproduced the issue?

Unfortunately, the test system is a shared resource, so if we will change the binary, then we need to coordinate this with all people involved.

> Thanks for perf.data.bz2 I'll also need original perf.data, it may be missleading but .bz2 file created by perf archive is not a compressed version of .data.

I have created the tar myself using tar and not using perf archive so it contains the files generated by the perf command.

> 7) Could you please add a sleep(2) before exit(0) in the prologue script and check if this will change the behavior?

In one of the previous attempts, we tried sleep(10) and other things and the result is the same.

> Looking into this log file I've noticed that it's slurm 17.11.2...

The test system is running slurm 17.11.2 the production system is running 18.08.7. The issue is the same on both systems so the issue is not relevant to the slurm version. Historically, we have upgraded the slurm on the production system to 18.08.7 hoping that the issue will go away but it's exactly the same.

Unfortunately, we can't run all the tests on the production system so we have to use the test system to run some tests. The last test round was done on the test system.
Comment 110 Marcin Stolarek 2020-03-26 04:57:20 MDT
>In one of the previous attempts, we tried sleep(10) and other things and the result is the same.
Is that somewhere in the ticket?

Can you call:
perf archive [file]
on one of the perf.data files you send and attach the .bz2 created by that command?

cheers,
Marcin
Comment 111 Adel Aly 2020-03-27 02:38:27 MDT
Created attachment 13498 [details]
perf archive 100 exit0
Comment 112 Adel Aly 2020-03-27 02:43:48 MDT
Created attachment 13499 [details]
perf archive 500 exit0
Comment 113 Marcin Stolarek 2020-03-30 08:07:14 MDT
Created attachment 13527 [details]
FlameGraph - perf.100 case

Adel,

I analyzed perf results (you can find a flame graph attached). Interpreting the result we know that all the CPUs in the machine were engaged (perf reports CPU/IO overload) while the graph shows that nearly all time is spent in kernel threads handling forks from slurmctld that are loading a script and its interpreter finally spending a lot of time in "exit_mmap", which is an expected end of "execve" call from slurmctld side to PrologSlurmctl program.

The main thread of slurm sched process is also mostly working on the creation of forks ~33% of all frames are in __GI___fork. 

The thread responsible for handling RPC from clients (rpcmgr) takes ~ 0.6% of registered frames. I think that it's starved by other threads. Could you please check if a patch like the one below will improve the situation:
diff --git a/src/slurmctld/job_scheduler.c b/src/slurmctld/job_scheduler.c
index e3f1245dc5..ae171954aa 100644
--- a/src/slurmctld/job_scheduler.c
+++ b/src/slurmctld/job_scheduler.c
@@ -4522,6 +4522,7 @@ static void *_run_prolog(void *arg)
                for (i = 0; i < 1024; i++)
                        (void) close(i);
                setpgid(0, 0);
+               nice(10);
                execve(argv[0], argv, my_env);
                exit(127);
        }


On further slurmctld tunning I'd recommend adding "defer" to SchedulerParameters - we should just focus on the case of sched_max_job_start=100 for now.

The case with max_rpc_cnt=500,sched_max_job_start=500 is difficult to understand since 8k of 36k of events collected by perf are out of order. I usually see this when the system is under a very heavy load. My interpretation of its flame graph is that a number of processes are stuck in the middle of fork/pthread creation. 

The way we're creating a thread for PrologSlurmctld has changed in 19.05 (it's a joinable thread instead of the detached thread now). It may be worth verifying if this has some impact on the issue you see (we're close to 18.08 end-of-life since 20.02.1 was released last week).

Am I correct that both perf records were collected in lab environment? What is the hardware resources you have there is the slurm.conf fully aligned (especially MessageTimeout)?

Questions from previous comments:
- Do we have any logs/info collected with sleep(10) (any non-zero number) in PrologSlurmctld before exit registered in the ticket?
- Were you able to test the patch I shared before?

cheers,
Marcin
Comment 114 Marcin Stolarek 2020-04-15 04:01:34 MDT
Adel,

Do you have any updates? Could you please answer questions from comment 113?

cheers,
Marcin
Comment 115 Adel Aly 2020-04-15 05:02:26 MDT
> Am I correct that both perf records were collected in lab environment? 

Yes,

> What is the hardware resources you have there 

more than 250 compute nodes with 32 cores each.

> is the slurm.conf fully aligned (especially MessageTimeout)?

Yes,

> - Do we have any logs/info collected with sleep(10) (any non-zero number) in PrologSlurmctld before exit registered in the ticket?

Yes, it will be hard to find which slurmctld log file includes this test but I'm sure it's in one of the attached log files.

A test with a full prolog then this will take more than a zero second to complete so you can consider any log file if you just need to look at the logs when the prologslurmctld is with more than exit(0).

> - Were you able to test the patch I shared before?

No, not yet. We need to schedule downtime to be able to update or patch slurm.
Comment 116 Marcin Stolarek 2020-04-15 05:33:26 MDT
Just to double-check, did you add "defer" option to SchedulerParameters as recommended in comment 113. Did you check if this improves slurmctld responsiveness? Are you able to reproduce the issue with:

SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100,bf_yield_interval=1000000,bf_interval=300,defer 

If yes, can you try to further decrease sched_max_job_start to find a value that makes slurmctld stable? I'm not suggesting limiting controller throughput as a solution, but it would be interesting to see logs/perf records from the process "at the edge".

>more than 250 compute nodes with 32 cores each.
Does it mean that the node running slurmctld has 32 cores as well? From perf results, it looks like that system r is overloaded.

>No, not yet. We need to schedule downtime to be able to update or patch slurm.
Do you plan to upgrade soon? Do you have ETA for this activity?

cheers,
Marcin
Comment 117 Adel Aly 2020-04-15 06:26:48 MDT
> Just to double-check, did you add "defer" option to SchedulerParameters as recommended in comment 113. Did you check if this improves slurmctld responsiveness? 

No, we didn't have a chance to run further tests.

> Are you able to reproduce the issue with: SchedulerParameters=bf_max_job_start=100,sched_min_interval=10000000,bf_continue,max_rpc_cnt=100,sched_max_job_start=100,bf_yield_interval=1000000,bf_interval=300,defer 

We haven't tried to add "defer" and run tests.

> If yes, can you try to further decrease sched_max_job_start to find a value that makes slurmctld stable? I'm not suggesting limiting controller throughput as a solution, but it would be interesting to see logs/perf records from the process "at the edge".

In order to run tests, we need to schedule time slots on the test cluster. As it takes significant time to run the tests and collect the logs, we need to make sure the tests are worth the effort. Can you assure us these tests are not just idle curiosity?

> Does it mean that the node running slurmctld has 32 cores as well? From perf results, it looks like that system r is overloaded.

Yes, all the nodes have 32 cores. The system is responsive and we don't see any overload. 4 or even 10 load average on a 32 core server is not a significant load.


> Do you plan to upgrade soon? Do you have ETA for this activity?

We don't have a plan to upgrade at the moment, unfortunately.
Comment 118 Marcin Stolarek 2020-04-16 05:11:44 MDT
> Can you assure us these tests are not just idle curiosity?

A few comments above you can find an analysis of perf record results. The way I'm reading it is that slurmctl thread responsible for handling RPCs coming from user commands, those don't get a reply within MessageTimeout which ends-up-with the error message "Socket timed out on ...".

Quoting slurm.conf manual:
defer
    Setting this option will avoid attempting to schedule each job individually at job submit time, but defer it until a later time when scheduling multiple jobs simultaneously may be possible. This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs. Also see default_queue_depth above. 

While looking at Bug 8736 (it's the same environment?) I've understood that the StateSave is on NFS, but you're not running a HA setup - can you move StateSave to the local filesystem - this should improve Slurm performance as well. When you check the flame graph ~50% of the time is also opening PrologSlurmctld script checking its format (reading shebang) on the same NFS?
My recommendation would be to keep both StateSave and PrologSlurmctld on a local filesystem.

The patch adding "nice" reasoning is to decrease the priority of slurmctld fork responsible for calling PrologSlurmctld script to let the operating system give more resources to the thread handling connections from users.

> 4 or even 10 load average on a 32 core server is not a significant load.
I agree. The "load"/"overload" I'm talking about in the comment 113 is an interpretation of the fact that a lot of perf frames are out of order and kernel-level calls that should complete immediately (internals of process creation) take some visible time there. This may be something outside of standard "system load" - for instance, the creation of new processes makes OS scheduler starve running processes.

I understand that some of those things were tried before. However, the only systematic way to resolve the issue I see is to remove performance bottlenecks we're experiencing now and check what is the next factor limiting us.

>We don't have a plan to upgrade at the moment, unfortunately.
I'm not forcing you to do that since I don't have any evidence that it will fix your issue. At the same time, I have to remind you that it may be limiting your support options in the future since 20.02.1 was released.

cheers,
Marcin
Comment 125 Adel Aly 2020-04-30 05:08:50 MDT
Hi Marcin,

Please download the collected logs and perf data from the following link:

https://support2.brightcomputing.com/adel/schedmd-7946/collected-logs.tar.bz2
Comment 126 Marcin Stolarek 2020-05-06 02:54:00 MDT
Created attachment 14130 [details]
Docs - add note about Prolog/EpilogSlurmctld to high_throughput.shtml

Adel,

After the analysis of backtraces and perf records, our conclusion is that neither PrologSlurmctld nor EpilogSlurmctld should be used in a high-throughput environment.

In terms of architecture running a slurmctld side prolog or epilog script requires a separate thread acquiring a job write lock for every job or array task started. If you have enough resources to start thousands of jobs at a time this will result in the creation of thousands of threads competing for the mentioned lock with other operations - like job listing. We're seeing a room for improvement in this area, however, it will require a major change in slurm architecture which cannot happen on the released version. 

For your specific case of PrologSlurmctld used to trigger node checking, we recommend the use of Prolog (slurmd side) executing those checks and adding nohold_on_prolog_fail to SchedulerParameters. In the case of prolog failure the node will get drained (no new jobs will be submitted to the node) and the job will be requeued so it can start on other nodes (assuming neither --no-requeue was specified nor JobRequeue=0 in slurm.conf). This way you can verify if the node is healthy just before the job execution, spread the load(no need to run a script for every job at the same server) and avoid the limitation described above.

We were unable to reproduce the issue since on our side we were running slurmctld on the system without that much CPU resources. In this case, OS scheduler wasn't starting all "prolog" threads at the same time limiting the number of threads waiting on locks - in fact, limiting scheduler throughput (This can be achieved by SchedulerParameters we played with, but it's really not a high-throughput solution).

I'm attaching a patch to our high-throughput documentation. I'll let you know when it will get reviewed.

Feel free to ask additional questions.

cheers,
Marcin
Comment 129 Marcin Stolarek 2020-05-08 01:33:01 MDT
Throughput limitation caused by SlurmctldProlog and SlurmctldEpilogue is now documented[1] and will be published on SchedMD on-line documentation for hight throughput on next release.

Is there anything I can clarify or do you agree that we close the bug report as "information given"?

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/8351e2a644f9faae4816742be0078a19803ac4c5
Comment 130 Adel Aly 2020-05-08 01:53:52 MDT
Hi Marcin,

Thank you for the update and for the efforts in debugging this issue.

As far as I understood from the responses, the issue is not going to be addressed in the currently released versions and it may be improved in the upcoming releases so as a rule of thumb, we have to stop using PrologSlurmctld script for high throughput solutions to account for submitting a large number of jobs simultaneously.

We have already started using the Prolog as a workaround to avoid using PrologSlurmctld.

I will double-check with all involved partners and I will mark the ticket as closed.

Kind regards,
adel
Comment 131 Marcin Stolarek 2020-05-08 02:07:22 MDT
>[..] so as a rule of thumb, we have to stop using PrologSlurmctld script for high throughput solutions [...]
Correct, that's our recommendation.

>I will double-check with all involved partners [...]
Sure.

cheers,
Marcin
Comment 132 Marcin Stolarek 2020-05-14 02:56:43 MDT
Adel,

Were you able to check if there are any remaining questions on any of the involved partners?

cheers,
Marcin
Comment 133 Adel Aly 2020-05-14 03:42:18 MDT
Hi Marcin,

We haven't heard back from all partners. I would suggest that we close the ticket as "won't fix" and we will open a different ticket if something else will pop-up.

Thanks for all the help.
Comment 134 Marcin Stolarek 2020-05-14 05:38:19 MDT
I'm closing the case now. Should you have any questions please reopen.

The detailed answer is in comment 126 a short summary is that we don't recommend PrologSlurmctld for the high-throughput environment. The fact is now documented[1].

cheers,
Marcin


[1][1]https://github.com/SchedMD/slurm/commit/8351e2a644f9faae4816742be0078a19803ac4c5