Created attachment 4933 [details] Last 1000 lines of slurmctld.log We are seeing extremely slow to no responsiveness from slurmctld: sdb:/var/spool/slurm/log # scontrol show config slurm_load_ctl_conf error: Socket timed out on send/recv operation sdb:/var/spool/slurm/log # The last squeue that completed showed 2500+ jobs queued. Sinfo seems to return on a fairly regular basis but it shows a significant number of nodes as being idle. sdb:/cray_home/crayadm # sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST standard* up 1-infini 1-00:00:00 64 2:16:2 80 idle nid00[012-047,076-111,140-147] ldms up 1-infini 1-00:00:00 64 2:16:2 4 idle nid00[160-163] sierra up 1-infini 1-00:00:00 64 2:16:2 16 idle nid00[164-179] knl up 1-infini 1-00:00:00 272 1:68:4 1 completing nid00249 knl up 1-infini 1-00:00:00 272 1:68:4 16 allocated nid00[256-271] knl up 1-infini 1-00:00:00 272 1:68:4 63 idle nid00[192-248,250-255] sierra_kn up 1-infini 1-00:00:00 272 1:68:4 19 allocated nid00[272-286,288-291] sierra_kn up 1-infini 1-00:00:00 272 1:68:4 1 idle nid00287 all up 1-infini 1-00:00:00 272 1:68:4 1 completing nid00249 all up 1-infini 1-00:00:00 272 1:68:4 35 allocated nid00[256-286,288-291] all up 1-infini 1-00:00:00 64+ 1+:16+ 164 idle nid00[012-047,076-111,140-147,160-179,192-248,250-255,287] ccm_queue up 1-infini 1-00:00:00 272 1:68:4 1 completing nid00249 ccm_queue up 1-infini 1-00:00:00 272 1:68:4 35 allocated nid00[256-286,288-291] ccm_queue up 1-infini 1-00:00:00 64+ 1+:16+ 164 idle nid00[012-047,076-111,140-147,160-179,192-248,250-255,287] sdb:/cray_home/crayadm #
Does the problem go away if you drop SlurmctldDebug=debug? Debug3 is a extremely high value that will considerably impair performance on a production system. Only if needed should this value be raised. On a very busy system 'debug' may be too high, but should be good in most environments. Was this the debug level before you noticed the problem or did you change it only to send logs? If you did it just temporarily that is fine, just making sure you don't leave it this high :). I have noticed this in the logs as well... [2017-07-17T11:32:50.461] error: (cookies.c: 166: lease_cookies) alpsc_lease_cookies failed: src/lib/alpscomm_sn/cookie.c:1325 Multiple errors while trying to create a socket: #1 (src/lib/alpscomm_sn/cookie.c:1378 socket on /var//opt/cray/ncmd/ncmd.uds: No such file or directory) #2 (src/lib/alpscomm_sn/cookie.c:1708 The server did not respond quickly enough) I am wondering if there is a problem in the alps system. This is what I would go investigate. It looks like LANL also had a similar issue to this. See bug 3928.
We had been running with it at the debug3 level. The log file actually filled up our filesystem and it was zeroed out this morning. This also killed the ncmd process as it couldn't log to it's own log file any more. - ncmd was restarted - debug3 -> info All commands are functioning as expected now and jobs running. Thanks for the pointer about the ncmd process.
No problem. Please reopen if you have any more issues related to this. I think you will be much happier with 'info'. You might also check to see if the slurmdlog is any higher than 'debug' to do the same.