4002 – Socket timed out on send/recv operation

Bug 4002 - Socket timed out on send/recv operation

Summary: Socket timed out on send/recv operation

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	17.02.4
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-17 11:35 MDT by Jason Repik
Modified:	2017-07-17 12:34 MDT (History)
CC List:	0 users

See Also:
Site:	Sandia National Laboratories
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Last 1000 lines of slurmctld.log (81.63 KB, text/plain) 2017-07-17 11:35 MDT, Jason Repik	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Repik 2017-07-17 11:35:44 MDT

Created attachment 4933 [details]
Last 1000 lines of slurmctld.log

We are seeing extremely slow to no responsiveness from slurmctld:

sdb:/var/spool/slurm/log # scontrol show config
slurm_load_ctl_conf error: Socket timed out on send/recv operation
sdb:/var/spool/slurm/log #

The last squeue that completed showed 2500+ jobs queued.

Sinfo seems to return on a fairly regular basis but it shows a significant
number of nodes as being idle.


sdb:/cray_home/crayadm #  sinfo
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
standard* up    1-infini 1-00:00:00     64 2:16:2      80 idle       nid00[012-047,076-111,140-147]
ldms      up    1-infini 1-00:00:00     64 2:16:2       4 idle       nid00[160-163]
sierra    up    1-infini 1-00:00:00     64 2:16:2      16 idle       nid00[164-179]
knl       up    1-infini 1-00:00:00    272 1:68:4       1 completing nid00249
knl       up    1-infini 1-00:00:00    272 1:68:4      16 allocated  nid00[256-271]
knl       up    1-infini 1-00:00:00    272 1:68:4      63 idle       nid00[192-248,250-255]
sierra_kn up    1-infini 1-00:00:00    272 1:68:4      19 allocated  nid00[272-286,288-291]
sierra_kn up    1-infini 1-00:00:00    272 1:68:4       1 idle       nid00287
all       up    1-infini 1-00:00:00    272 1:68:4       1 completing nid00249
all       up    1-infini 1-00:00:00    272 1:68:4      35 allocated  nid00[256-286,288-291]
all       up    1-infini 1-00:00:00    64+ 1+:16+     164 idle       nid00[012-047,076-111,140-147,160-179,192-248,250-255,287]
ccm_queue up    1-infini 1-00:00:00    272 1:68:4       1 completing nid00249
ccm_queue up    1-infini 1-00:00:00    272 1:68:4      35 allocated  nid00[256-286,288-291]
ccm_queue up    1-infini 1-00:00:00    64+ 1+:16+     164 idle       nid00[012-047,076-111,140-147,160-179,192-248,250-255,287]
sdb:/cray_home/crayadm #

Comment 1 Danny Auble 2017-07-17 11:44:51 MDT

Does the problem go away if you drop SlurmctldDebug=debug?  Debug3 is a extremely high value that will considerably impair performance on a production system.  Only if needed should this value be raised.  On a very busy system 'debug' may be too high, but should be good in most environments.  Was this the debug level before you noticed the problem or did you change it only to send logs?  If you did it just temporarily that is fine, just making sure you don't leave it this high :).

I have noticed this in the logs as well...

[2017-07-17T11:32:50.461] error: (cookies.c: 166: lease_cookies) alpsc_lease_cookies failed: src/lib/alpscomm_sn/cookie.c:1325 Multiple errors while trying to create a socket: #1 (src/lib/alpscomm_sn/cookie.c:1378 socket on /var//opt/cray/ncmd/ncmd.uds: No such file or directory) #2 (src/lib/alpscomm_sn/cookie.c:1708 The server did not respond quickly enough)

I am wondering if there is a problem in the alps system.  This is what I would go investigate.

It looks like LANL also had a similar issue to this. See bug 3928.

Comment 2 Jason Repik 2017-07-17 12:06:58 MDT

We had been running with it at the debug3 level.  The log file actually filled up our filesystem and it was zeroed out this morning.  This also killed the ncmd process as it couldn't log to it's own log file any more.  

- ncmd was restarted
- debug3 -> info

All commands are functioning as expected now and jobs running.
Thanks for the pointer about the ncmd process.

Comment 3 Danny Auble 2017-07-17 12:34:29 MDT

No problem. Please reopen if you have any more issues related to this.  I think you will be much happier with 'info'.  You might also check to see if the slurmdlog is any higher than 'debug' to do the same.