471 – Socket timed out on send/recv operation causes job to abort: Error: Transport endpoint is not connected

Bug 471 - Socket timed out on send/recv operation causes job to abort: Error: Transport endpoint is not connected

Summary: Socket timed out on send/recv operation causes job to abort: Error: Transport...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other bugs)
Version:	14.03.x
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2013-10-21 04:36 MDT by Jason Sollom
Modified:	2014-01-13 02:13 MST (History)
CC List:	1 user (show)

See Also:	http://bugzilla.us.cray.com/show_bug.cgi?id=804432
Site:	CRAY
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.0pre5
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
attachment-17456-0.html (3.59 KB, text/html) 2013-10-21 04:40 MDT, Danny Auble	Details
attachment-13824-0.html (2.24 KB, text/html) 2013-10-21 04:54 MDT, Danny Auble	Details
Slurmd log: nid00248.log (722.30 KB, text/x-log) 2013-10-21 06:47 MDT, Jason Sollom	Details
strace of slurmd process (2.92 KB, application/octet-stream) 2013-10-21 06:48 MDT, Jason Sollom	Details
slurm.conf file when this bug occurred (6.23 KB, application/octet-stream) 2013-10-21 06:48 MDT, Jason Sollom	Details
attachment-10606-0.html (4.75 KB, text/html) 2013-10-25 03:15 MDT, Danny Auble	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Sollom 2013-10-21 04:36:28 MDT

Created attachment 459 [details]
attachment-17456-0.html

See also Cray Bug 804155. http://bugzilla.us.cray.com/show_bug.cgi?id=804155

A simple 'ls' command experienced a socket time on galaxy today.  Here is the error message.

pavek@galaxy:~> srun --cpus-per-task=1 --ntasks=1 /bin/ls
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 4475.0 failed on node nid00248: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Slurmd log from nid000248:
[2013-10-21T11:05:34.971] AcctGatherProfile NONE plugin loaded
[2013-10-21T11:05:34.972] AcctGatherEnergy NONE plugin loaded
[2013-10-21T11:05:34.972] AcctGatherInfiniband NONE plugin loaded
[2013-10-21T11:05:34.973] AcctGatherFilesystem NONE plugin loaded
[2013-10-21T11:05:34.973] Job accounting gather NOT_INVOKED plugin loaded
[2013-10-21T11:05:34.973] scaling CPU count by factor of 2
[2013-10-21T11:05:34.973] [4475.0] Message thread started pid = 6259
[2013-10-21T11:05:34.973] error: _rpc_launch_tasks: unable to send return code to address:port=10.128.0.195:34026 msg_type=6001: Transport endpoint is not connected
[2013-10-21T11:05:34.974] [4475.0] Reading cgroup.conf file /etc/opt/slurm/cgroup.conf
[2013-10-21T11:05:34.975] [4475.0] task/cgroup: now constraining jobs allocated cores
[2013-10-21T11:05:34.975] [4475.0] task/cgroup/memory: total:32222M allowed:100%(enforced), swap:0%(permissive), max:100%(32222M) max+swap:100%(64444M) min:30M
[2013-10-21T11:05:34.975] [4475.0] task/cgroup: now constraining jobs allocated memory
[2013-10-21T11:05:34.975] [4475.0] task/cgroup: loaded
[2013-10-21T11:05:34.975] [4475.0] task CRAY plugin loaded, really, really loaded.
[2013-10-21T11:05:34.976] [4475.0] Process tracking via Cray job module loaded
[2013-10-21T11:05:34.976] [4475.0] Checkpoint plugin loaded: checkpoint/none
[2013-10-21T11:05:34.976] [4475.0] mpi type = none
[2013-10-21T11:05:34.977] [4475.0] Process tracking via Cray job module loaded
[2013-10-21T11:05:34.977] [4475.0] proctrack_p_plugin_create: created jid 0xf800000004 thread 0x7ffff0b79700
[2013-10-21T11:05:34.977] [4475.0] spank: opening plugin stack /etc/opt/slurm/plugstack.conf
[2013-10-21T11:05:34.977] [4475.0] gres_cnt: 2072 0
[2013-10-21T11:05:34.977] [4475.0] mpi type = (null)
[2013-10-21T11:05:34.977] [4475.0] mpi/none: slurmstepd prefork
[2013-10-21T11:05:34.977] [4475.0] task/cgroup: job abstract cores are '7'
[2013-10-21T11:05:34.977] [4475.0] task/cgroup: step abstract cores are '7'
[2013-10-21T11:05:34.977] [4475.0] task/cgroup: job physical cores are '7,15'
[2013-10-21T11:05:34.977] [4475.0] task/cgroup: step physical cores are '7,15'
[2013-10-21T11:05:35.035] [4475.0] task/cgroup: /slurm/uid_14428/job_4475: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-21T11:05:35.035] [4475.0] task/cgroup: /slurm/uid_14428/job_4475/step_0: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-21T11:05:35.035] [4475.0] task_p_pre_setuid: 4475.0
[2013-10-21T11:05:40.070] [4475.0] connect io: Connection refused
[2013-10-21T11:05:40.070] [4475.0] IO setup failed: Connection refused

Slurmctld log:
[2013-10-21T11:04:30.289] sched: _slurm_rpc_allocate_resources JobId=4475 NodeList=nid00248 usec=1579
[2013-10-21T11:04:30.290] debug:  Configuration for job 4475 complete
[2013-10-21T11:04:30.290] debug:  laying out the 1 tasks on 1 hosts nid00248 dist 1
[2013-10-21T11:04:30.293] sched: _slurm_rpc_job_step_create: StepId=4475.0 nid00248 usec=2974
[2013-10-21T11:04:34.000] debug:  backfill: no jobs to backfill
[2013-10-21T11:04:40.318] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 4475.0 nodes 0-0 rc=5004 uid=14428
[2013-10-21T11:04:40.321] debug:  Application 4475 not found in app list
[2013-10-21T11:04:40.321] sched: _slurm_rpc_step_complete StepId=4475.0 usec=3270
[2013-10-21T11:04:40.322] sched: Cancel of StepId=4475.0 by UID=14428 usec=757
[2013-10-21T11:04:40.323] error: (switch_cray.c: 444: switch_p_pack_jobinfo) switch_job was NULL
[2013-10-21T11:04:42.000] job_step_signal step 4475.0 not found
[2013-10-21T11:04:42.000] Signal 9 of StepId=4475.0 by UID=14428: Invalid job id specified
[2013-10-21T11:04:42.001] completing job 4475
[2013-10-21T11:04:42.002] sched: job_complete for JobId=4475 successful, exit code=5004
[2013-10-21T11:04:42.007] debug:  sched: Running job scheduler

Comment 1 Jason Sollom 2013-10-21 04:37:30 MDT

The error comes from the function '_rpc_launch_tasks' in src/slurmd/slurmd/req.c

Code snippet from _rpc_launch_tasks:
	if (slurm_send_rc_msg(msg, errnum) < 0) {
		char addr_str[32];
		slurm_print_slurm_addr(&msg->address, addr_str,
				       sizeof(addr_str));
		error("_rpc_launch_tasks: unable to send return code to "
		      "address:port=%s msg_type=%u: %m",
		      addr_str, msg->msg_type);

		/*
		 * Rewind credential so that srun may perform retry
		 */
		slurm_cred_rewind(conf->vctx, req->cred); /* ignore errors */

	}

----------------
Upon further digging, I believe this error comes from slurm_send_rc_msg function 
in src/common/slurm_protocol_api.c.

Code snippet:
int slurm_send_rc_msg(slurm_msg_t *msg, int rc)
{
	slurm_msg_t resp_msg;
	return_code_msg_t rc_msg;

	if (msg->conn_fd < 0) {
		slurm_seterrno(ENOTCONN);
		return SLURM_ERROR;
	}


-------------
I believe that msg->conn_fd is less than zero.  I have seen no other error messages that would have been produced later in this function if it had been allowed to continue.

The problem seemed to be transient because it went away on its own.  I was able to run multiple srun commands targeting the same node where the problem was seen without triggering a repeat of the bug.  Admittedly, these attempts came 27 minutes after the initial occurrence, so it is possible that the problem could have been short-lived and cleared itself up.

jasons@galaxy:~> date
Mon Oct 21 11:31:29 CDT 2013

jasons@galaxy:~> srun -w nid00248 /bin/ls
bin
mpit
mpit.c
slurm.conf

Regardless of its transient nature, this is a real problem.

Comment 2 Danny Auble 2013-10-21 04:40:33 MDT

Jason, do you happen to know what

[2013-10-21T11:05:40.070] [4475.0] connect io: Connection refused
[2013-10-21T11:05:40.070] [4475.0] IO setup failed: Connection refused

Means?  It appears there is no more output from the stepd after that 
point.  Did the slurmstepd segfault?

Danny

On 10/21/13 10:37, bugs@schedmd.com wrote:
>
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=471#c1> on bug 
> 471 <http://bugs.schedmd.com/show_bug.cgi?id=471> from Jason Sollom 
> <mailto:jasons@cray.com> *
> The error comes from the function '_rpc_launch_tasks' in
> src/slurmd/slurmd/req.c
>
> Code snippet from _rpc_launch_tasks:
>      if (slurm_send_rc_msg(msg, errnum) < 0) {
>          char addr_str[32];
>          slurm_print_slurm_addr(&msg->address, addr_str,
>                         sizeof(addr_str));
>          error("_rpc_launch_tasks: unable to send return code to "
>                "address:port=%s msg_type=%u: %m",
>                addr_str, msg->msg_type);
>
>          /*
>           * Rewind credential so that srun may perform retry
>           */
>          slurm_cred_rewind(conf->vctx, req->cred); /* ignore errors */
>
>      }
>
> ----------------
> Upon further digging, I believe this error comes from slurm_send_rc_msg
> function
> in src/common/slurm_protocol_api.c.
>
> Code snippet:
> int slurm_send_rc_msg(slurm_msg_t *msg, int rc)
> {
>      slurm_msg_t resp_msg;
>      return_code_msg_t rc_msg;
>
>      if (msg->conn_fd < 0) {
>          slurm_seterrno(ENOTCONN);
>          return SLURM_ERROR;
>      }
>
>
> -------------
> I believe that msg->conn_fd is less than zero.  I have seen no other error
> messages that would have been produced later in this function if it had been
> allowed to continue.
>
> The problem seemed to be transient because it went away on its own.  I was able
> to run multiple srun commands targeting the same node where the problem was
> seen without triggering a repeat of the bug.  Admittedly, these attempts came
> 27 minutes after the initial occurrence, so it is possible that the problem
> could have been short-lived and cleared itself up.
>
> jasons@galaxy:~> date
> Mon Oct 21 11:31:29 CDT 2013
>
> jasons@galaxy:~> srun -w nid00248 /bin/ls
> bin
> mpit
> mpit.c
> slurm.conf
>
> Regardless of its transient nature, this is a real problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 3 Jason Sollom 2013-10-21 04:50:11 MDT

I ran a ps on the node when I first logged on, and there was no slurmstepd running, only slurmd.

There was no core file, but I ran 'ulimit -c', and it returned zero, so I don't think that slurmstepd would have been allowed to drop core.

'connect io: Connection refused' comes from Slurm either from here

/*
 * Initiate a TCP connection back to a waiting client (e.g. srun).
 *
 * Create a new eio client object and wake up the eio engine so that
 * it can see the new object.
 */
int
io_client_connect(srun_info_t *srun, stepd_step_rec_t *job)
{
	int sock = -1;
	struct client_io_info *client;
	eio_obj_t *obj;

	debug4 ("adding IO connection (logical node rank %d)", job->nodeid);

	if (srun->ioaddr.sin_addr.s_addr) {
		char         ip[256];
		uint16_t     port;
		slurm_get_ip_str(&srun->ioaddr, &port, ip, sizeof(ip));
		debug4("connecting IO back to %s:%d", ip, ntohs(port));
	}

	if ((sock = (int) slurm_open_stream(&srun->ioaddr)) < 0) {
		error("connect io: %m");
		/* XXX retry or silently fail?
		 *     fail for now.
		 */
		return SLURM_ERROR;
	}

or here:

int
io_initial_client_connect(srun_info_t *srun, stepd_step_rec_t *job,
			  int stdout_tasks, int stderr_tasks)
{
	int sock = -1;
	struct client_io_info *client;
	eio_obj_t *obj;

	debug4 ("adding IO connection (logical node rank %d)", job->nodeid);

	if (srun->ioaddr.sin_addr.s_addr) {
		char         ip[256];
		uint16_t     port;
		slurm_get_ip_str(&srun->ioaddr, &port, ip, sizeof(ip));
		if (ntohs(port) == 0) {
			debug3("No IO connection requested");
			return SLURM_SUCCESS;
		}
		debug4("connecting IO back to %s:%d", ip, ntohs(port));
	}

	if ((sock = (int) slurm_open_stream(&srun->ioaddr)) < 0) {
		error("connect io: %m");
		/* XXX retry or silently fail?
		 *     fail for now.
		 */
		return SLURM_ERROR;
	}

(In reply to Danny Auble from comment #2)
> Jason, do you happen to know what
> 
> [2013-10-21T11:05:40.070] [4475.0] connect io: Connection refused
> [2013-10-21T11:05:40.070] [4475.0] IO setup failed: Connection refused
> 
> Means?  It appears there is no more output from the stepd after that 
> point.  Did the slurmstepd segfault?
> 
> Danny
>

Comment 4 Jason Sollom 2013-10-21 04:50:50 MDT

I revisited the log file on the nod., and I saw this interesting info message.  Note, that happens about 15 minutes after the job aborted.

[2013-10-21T11:20:02.227] debug:  Cleaned up stray socket /var/spool/slurmd/nid00248_4475.0

Comment 5 Danny Auble 2013-10-21 04:54:08 MDT

Yeah, I saw this when listed in the Cray report.  It really made me 
think something funky happened with the stepd.  It appears this happens 
very often based on that log.

I haven't seen it happen yet in my testing.  Would it be possible for 
you to try to recreate without your switch plugin?  Seeing as that is 
the only known difference in our testing I am wondering if that has 
something to do with it.

Danny

On 10/21/13 10:50, bugs@schedmd.com wrote:
>
> *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=471#c4> on bug 
> 471 <http://bugs.schedmd.com/show_bug.cgi?id=471> from Jason Sollom 
> <mailto:jasons@cray.com> *
> I revisited the log file on the nod., and I saw this interesting info message.
> Note, that happens about 15 minutes after the job aborted.
>
> [2013-10-21T11:20:02.227] debug:  Cleaned up stray socket
> /var/spool/slurmd/nid00248_4475.0
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 6 Danny Auble 2013-10-21 04:54:10 MDT

Created attachment 460 [details]
attachment-13824-0.html

Comment 7 Jason Sollom 2013-10-21 06:47:07 MDT

Created attachment 464 [details]
Slurmd log: nid00248.log

Comment 8 Jason Sollom 2013-10-21 06:47:29 MDT

Okay, so I have reproduced this issue.

First, let me say that the configuration was not using munge.  It was using 

AuthType=auth/none
CryptoType=crypto/openssl

While we should have been running with munge, this does not invalidate the bug.

I have attached several files which back up my case including the slurmd log, strace of slurmd, and the slurm.conf file.

Here is the sequence of events as I see them on node 248.

First, we enter stepd_connect.

-------------------------------------------------------
stepd_connect(const char *directory, const char *nodename,
	      uint32_t jobid, uint32_t stepid)

...blah,blah...
	/* Create an auth credential */
	auth_cred = g_slurm_auth_create(NULL, 2, NULL);
	if (auth_cred == NULL) {
		error("Creating authentication credential: %s",
		      g_slurm_auth_errstr(g_slurm_auth_errno(NULL)));
		slurm_seterrno(SLURM_PROTOCOL_AUTHENTICATION_ERROR);
		goto fail1;
	}
...blah,blah...
/* Connect to the step */
	fd = _step_connect(directory, nodename, jobid, stepid);
	if (fd == -1)
		goto fail1;
...blah,blah...
-------------------------------
Based on what I've seen in the attached strace, I believe this auth_authd.c function gets called right before the error.  You'll see the calls to geteuid and getegid in the strace.  I'm showing this just to establish the time line.

-------------------------------
slurm_auth_credential_t *
slurm_auth_create( void *argv[], char *auth_info )
{
	int ttl;
	int rc;
	slurm_auth_credential_t *cred;

	if ( argv == NULL ) {
		plugin_errno = SLURM_AUTH_MEMORY;
		return NULL;
	}

	cred = (slurm_auth_credential_t *)
		xmalloc( sizeof( slurm_auth_credential_t ) );
	cred->cr_errno = SLURM_SUCCESS;
	cred->cred.uid = geteuid();
	cred->cred.gid = getegid();

-------------------------------------------------------


Here in this _step_connect excerpt you'll see the code that is printing the following errors

From the slurmd log:
debug:  _step_connect: connect: No such file or directory

From the strace:
connect(13, {sa_family=AF_FILE, path="/var/spool/slurmd/nid00248_4525.0"}, 36) = -1 ENOENT (No such file or directory)

------------------------------------------------------
static int
_step_connect(const char *directory, const char *nodename,
	      uint32_t jobid, uint32_t stepid)
{
	int fd;
	int len;
	struct sockaddr_un addr;
	char *name = NULL;

	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
		debug("_step_connect: socket: %m");
		return -1;
	}

	memset(&addr, 0, sizeof(addr));
	addr.sun_family = AF_UNIX;
	xstrfmtcat(name, "%s/%s_%u.%u", directory, nodename, jobid, stepid);
	strcpy(addr.sun_path, name);
	len = strlen(addr.sun_path)+1 + sizeof(addr.sun_family);

	if (connect(fd, (struct sockaddr *) &addr, len) < 0) {
		if (errno == ECONNREFUSED) {
			_handle_stray_socket(name);
		} else {
			debug("_step_connect: connect: %m");
		}
----------------------

Comment 9 Jason Sollom 2013-10-21 06:48:02 MDT

Created attachment 465 [details]
strace of slurmd process

Comment 10 Jason Sollom 2013-10-21 06:48:31 MDT

Created attachment 466 [details]
slurm.conf file when this bug occurred

Comment 11 Jason Sollom 2013-10-21 06:51:02 MDT

The job where the error occurred in Comment #8 is 4525.

Comment 12 Jason Sollom 2013-10-23 05:48:32 MDT

I was able to make this happen on galaxy again when neither the Cray switch plugin nor the Cray task plugin were running.  I had to have the Cray select plugin loaded because there are safeguards that prevent removing that and running on a Cray.

I put the slurmctld log, slurmd log, and straces of slurmd here:
http://bugzilla.us.cray.com/show_bug.cgi?id=804155

because your bugzilla doesn't accept larger attachments.

***Running Without the Switch Plugin***

*Login Node Srun Error*
jasons@galaxy:~> ./loop.sh 
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 10224.0 failed on node nid00241: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

*Slurmd log on the affected node*

I pulled out the lines that looked particularly troublesome.  the rest of the log including those troublesome lines is below.

-----------------------------------------------------------
[2013-10-23T10:45:41.786] debug:  _step_connect: connect: No such file or directory
[2013-10-23T10:45:41.786] debug:  signal for nonexistant 10224.0 stepd_connect failed: No such file or directory
...
[2013-10-23T10:47:47.184] error: _rpc_launch_tasks: unable to send return code to address:port=10.128.0.195:38439 msg_type=6001: Transport endpoint is not connected
...
[2013-10-23T10:47:47.245] [10224.0] connect io: Connection refused
[2013-10-23T10:47:47.246] [10224.0] IO setup failed: Connection refused
[2013-10-23T10:54:18.387] debug:  Cleaned up stray socket /var/spool/slurmd/nid00241_10224.0
-----------------------------------------------------------

Log in its entirety for this job launch.

[2013-10-23T10:45:31.764] debug:  task_p_slurmd_launch_request: 10224.0 0
[2013-10-23T10:45:31.764] launch task 10224.0 request from 31329.12790@10.128.0.195 (port 10134)
[2013-10-23T10:45:31.764] debug:  Checking credential with 128 bytes of sig data
[2013-10-23T10:45:31.766] Scaling CPU count by factor of 2 (32/(16-0)
[2013-10-23T10:45:41.786] debug:  Sending signal 9 to step 10224.0
[2013-10-23T10:45:41.786] debug:  _step_connect: connect: No such file or directory
[2013-10-23T10:45:41.786] debug:  signal for nonexistant 10224.0 stepd_connect failed: No such file or directory
[2013-10-23T10:45:43.094] debug:  _rpc_terminate_job, uid = 0
[2013-10-23T10:45:43.094] debug:  task_p_slurmd_release_resources: 10224
[2013-10-23T10:45:43.094] debug:  credential for job 10224 revoked
[2013-10-23T10:47:46.805] debug:  Calling /opt/slurm/13.12.0-1.0000.81bf41b.29.1/sbin/slurmstepd spank prolog
[2013-10-23T10:47:46.807] debug:  Process tracking via Cray job module loaded
[2013-10-23T10:47:46.807] debug:  proctrack_p_plugin_create: created jid 0xf10000001e thread 0x7ffff1523700
[2013-10-23T10:47:46.814] Reading slurm.conf file: /etc/opt/slurm/slurm.conf
[2013-10-23T10:47:46.816] Running spank/prolog for jobid [10224] uid [31329]
[2013-10-23T10:47:46.816] spank: opening plugin stack /etc/opt/slurm/plugstack.conf
[2013-10-23T10:47:46.975] switch NONE plugin loaded
[2013-10-23T10:47:46.976] Received cpu frequency information for 32 cpus
[2013-10-23T10:47:46.976] Consumable Resources (CR) Node Selection plugin loaded with argument 49
[2013-10-23T10:47:47.179] AcctGatherProfile NONE plugin loaded
[2013-10-23T10:47:47.179] AcctGatherEnergy NONE plugin loaded
[2013-10-23T10:47:47.180] AcctGatherInfiniband NONE plugin loaded
[2013-10-23T10:47:47.181] AcctGatherFilesystem NONE plugin loaded
[2013-10-23T10:47:47.181] Job accounting gather NOT_INVOKED plugin loaded
[2013-10-23T10:47:47.183] scaling CPU count by factor of 2 (32/(16-0)
[2013-10-23T10:47:47.183] [10224.0] Message thread started pid = 7575
[2013-10-23T10:47:47.184] error: _rpc_launch_tasks: unable to send return code to address:port=10.128.0.195:38439 msg_type=6001: Transport endpoint is not connected
[2013-10-23T10:47:47.184] [10224.0] Reading cgroup.conf file /etc/opt/slurm/cgroup.conf
[2013-10-23T10:47:47.185] [10224.0] task/cgroup: now constraining jobs allocated cores
[2013-10-23T10:47:47.185] [10224.0] task/cgroup/memory: total:64540M allowed:100%(enforced), swap:0%(permissive), max:100%(64540M) max+swap:100%(129080M) min:30M
[2013-10-23T10:47:47.185] [10224.0] task/cgroup: now constraining jobs allocated memory
[2013-10-23T10:47:47.185] [10224.0] task/cgroup: loaded
[2013-10-23T10:47:47.186] [10224.0] task CRAY plugin loaded, really, really loaded.
[2013-10-23T10:47:47.186] [10224.0] Process tracking via Cray job module loaded
[2013-10-23T10:47:47.187] [10224.0] Checkpoint plugin loaded: checkpoint/none
[2013-10-23T10:47:47.187] [10224.0] mpi type = none
[2013-10-23T10:47:47.187] [10224.0] Process tracking via Cray job module loaded
[2013-10-23T10:47:47.188] [10224.0] proctrack_p_plugin_create: created jid 0xf10000001f thread 0x7ffff31c7700
[2013-10-23T10:47:47.188] [10224.0] spank: opening plugin stack /etc/opt/slurm/plugstack.conf
[2013-10-23T10:47:47.188] [10224.0] mpi type = (null)
[2013-10-23T10:47:47.188] [10224.0] mpi/none: slurmstepd prefork
[2013-10-23T10:47:47.188] [10224.0] task/cgroup: job abstract cores are '15'
[2013-10-23T10:47:47.188] [10224.0] task/cgroup: step abstract cores are '15'
[2013-10-23T10:47:47.188] [10224.0] task/cgroup: job physical cores are '15,31'
[2013-10-23T10:47:47.188] [10224.0] task/cgroup: step physical cores are '15,31'
[2013-10-23T10:47:47.242] [10224.0] task/cgroup: /slurm/uid_31329/job_10224: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-23T10:47:47.242] [10224.0] task/cgroup: /slurm/uid_31329/job_10224/step_0: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-23T10:47:47.242] [10224.0] task_p_pre_setuid: 10224.0
[2013-10-23T10:47:47.245] [10224.0] connect io: Connection refused
[2013-10-23T10:47:47.246] [10224.0] IO setup failed: Connection refused
[2013-10-23T10:54:18.387] debug:  Cleaned up stray socket /var/spool/slurmd/nid00241_10224.0

Note, we're still seeing the clean up of stray sockets without the switch plugin running.

*Strace files*
Here is some potentially notable output from the strace files.

strace_slurmd.7575 ioctl(9, AGPIOC_ALLOCATE+++ killed by SIGKILL +++

strace_slurmd.7580  connect(10, {sa_family=AF_FILE, path="/var/spool/slurmd/nid00241_10224.0"}, 37) = -1 ECONNREFUSED (Connection refused)

***Running without the Switch or Task plugins***
This time I ran without the Cray Switch or Task plugins

*Srun output *

jasons@galaxy:~/bug_471> ../loop.sh
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 10253.0 failed on node nid00241: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Count: 1

*Slurmd log*

[2013-10-23T11:22:35.228] debug:  task_p_slurmd_launch_request: 10253.0 0
[2013-10-23T11:22:35.228] launch task 10253.0 request from 31329.12790@10.128.0.195 (port 56475)
[2013-10-23T11:22:35.228] debug:  Checking credential with 128 bytes of sig data
[2013-10-23T11:22:35.230] Scaling CPU count by factor of 2 (32/(16-0)
[2013-10-23T11:22:45.253] debug:  Sending signal 9 to step 10253.0
[2013-10-23T11:22:45.253] debug:  _step_connect: connect: No such file or directory
[2013-10-23T11:22:45.253] debug:  signal for nonexistant 10253.0 stepd_connect failed: No such file or directory
[2013-10-23T11:22:47.149] debug:  _rpc_terminate_job, uid = 0
[2013-10-23T11:22:47.149] debug:  task_p_slurmd_release_resources: 10253
[2013-10-23T11:22:47.150] debug:  credential for job 10253 revoked
[2013-10-23T11:23:06.316] debug:  Calling /opt/slurm/13.12.0-1.0000.81bf41b.29.1/sbin/slurmstepd spank prolog
[2013-10-23T11:23:06.317] debug:  Process tracking via Cray job module loaded
[2013-10-23T11:23:06.318] debug:  proctrack_p_plugin_create: created jid 0xf100000061 thread 0x7ffff1523700
[2013-10-23T11:23:06.322] Reading slurm.conf file: /etc/opt/slurm/slurm.conf
[2013-10-23T11:23:06.325] Running spank/prolog for jobid [10253] uid [31329]
[2013-10-23T11:23:06.325] spank: opening plugin stack /etc/opt/slurm/plugstack.conf
[2013-10-23T11:23:06.486] switch NONE plugin loaded
[2013-10-23T11:23:06.486] Received cpu frequency information for 32 cpus
[2013-10-23T11:23:06.487] Consumable Resources (CR) Node Selection plugin loaded with argument 49
[2013-10-23T11:23:06.701] AcctGatherProfile NONE plugin loaded
[2013-10-23T11:23:06.702] AcctGatherEnergy NONE plugin loaded
[2013-10-23T11:23:06.702] AcctGatherInfiniband NONE plugin loaded
[2013-10-23T11:23:06.703] AcctGatherFilesystem NONE plugin loaded
[2013-10-23T11:23:06.703] Job accounting gather NOT_INVOKED plugin loaded
[2013-10-23T11:23:06.705] scaling CPU count by factor of 2 (32/(16-0)
[2013-10-23T11:23:06.705] [10253.0] Message thread started pid = 8408
[2013-10-23T11:23:06.706] error: _rpc_launch_tasks: unable to send return code to address:port=10.128.0.195:39900 msg_type=6001: Transport endpoint is not connected
[2013-10-23T11:23:06.706] [10253.0] Reading cgroup.conf file /etc/opt/slurm/cgroup.conf
[2013-10-23T11:23:06.707] [10253.0] task/cgroup: now constraining jobs allocated cores
[2013-10-23T11:23:06.707] [10253.0] task/cgroup/memory: total:64540M allowed:100%(enforced), swap:0%(permissive), max:100%(64540M) max+swap:100%(129080M) min:30M
[2013-10-23T11:23:06.707] [10253.0] task/cgroup: now constraining jobs allocated memory
[2013-10-23T11:23:06.707] [10253.0] task/cgroup: loaded
[2013-10-23T11:23:06.708] [10253.0] Process tracking via Cray job module loaded
[2013-10-23T11:23:06.708] [10253.0] Checkpoint plugin loaded: checkpoint/none
[2013-10-23T11:23:06.708] [10253.0] mpi type = none
[2013-10-23T11:23:06.709] [10253.0] Process tracking via Cray job module loaded
[2013-10-23T11:23:06.709] [10253.0] proctrack_p_plugin_create: created jid 0xf100000062 thread 0x7ffff33cb700
[2013-10-23T11:23:06.709] [10253.0] spank: opening plugin stack /etc/opt/slurm/plugstack.conf
[2013-10-23T11:23:06.709] [10253.0] mpi type = (null)
[2013-10-23T11:23:06.709] [10253.0] mpi/none: slurmstepd prefork
[2013-10-23T11:23:06.710] [10253.0] task/cgroup: job abstract cores are '15'
[2013-10-23T11:23:06.710] [10253.0] task/cgroup: step abstract cores are '15'
[2013-10-23T11:23:06.710] [10253.0] task/cgroup: job physical cores are '15,31'
[2013-10-23T11:23:06.710] [10253.0] task/cgroup: step physical cores are '15,31'
[2013-10-23T11:23:06.766] [10253.0] task/cgroup: /slurm/uid_31329/job_10253: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-23T11:23:06.766] [10253.0] task/cgroup: /slurm/uid_31329/job_10253/step_0: alloc=2048MB mem.limit=2048MB memsw.limit=unlimited
[2013-10-23T11:23:06.769] [10253.0] connect io: Connection refused
[2013-10-23T11:23:06.769] [10253.0] IO setup failed: Connection refused
[2013-10-23T11:29:18.950] debug:  Cleaned up stray socket /var/spool/slurmd/nid00241_10253.0
[2013-10-23T11:29:18.950] debug:  Cleaned up stray socket /var/spool/slurmd/nid00241_10242.0

Comment 13 Jason Sollom 2013-10-23 05:49:22 MDT

In comment 12, you should look at comment 5 of Cray Bug 804155 for the attachment I spoke of.

Comment 14 Jason Sollom 2013-10-24 08:47:49 MDT

After the Thursday morning CLE update, neither Travis from Cray nor Danny Auble  was able to make this socket time-out problem re-occur.  The Slurm installation had not changed from yesterday when this problem was easy to reproduce.

I have two theories:
1.  The CLE upgrade fixed an underlying issue that was causing this problem.
2.  This is a transient error that will come back to bite us later.

We're going to run against Slurm this weekend.  We'll see if the problem resurfaces.

Comment 15 Jason Sollom 2013-10-25 03:01:58 MDT

The bug reoccurred Friday morning after a reboot of galaxy.  Whatever "fix" happened yesterday seems to be transient.

I reproduced the bug on galaxy.  

Here's what I did.  I just salloc'ed nid00206.  Then, I ran some sruns in a loop until I was able to see the error.  I had a tough time killing the script, so a few more sruns were launched after the initial srun -- step 0-- which seemed to have triggered the problem.  The job ID was 14414 and step ID was 0.


I did an strace of slurmd as well as its children on the node.  I also straced srun on the login node.  This analysis only looks at the straces of slurmd.  All of the files are named jasons_strace_slurmd.<pid>.  To save typing, I'll just refer to the pid.

I investigated, and I think I found a race condition.  

A slurmd process 13011 attempts to do a connect to the socket file /var/spool/slurmd/nid00206_14414.0.  The socket file does not exist.

08:47:41.517772 connect(13, {sa_family=AF_FILE, path="/var/spool/slurmd/nid00206_14414.0"}, 37) = -1 ENOENT (No such file or directory)

Two minutes and five seconds later, slurmstepd process 13156 creates the socket file.  (This seems like a long time.)

08:49:46.902280 stat("/var/spool/slurmd/nid00206_14414.0", 0x7fffffffe8c0) = -1 ENOENT (No such file or directory)
08:49:46.902302 socket(PF_FILE, SOCK_STREAM, 0) = 6
08:49:46.902322 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
08:49:46.902339 bind(6, {sa_family=AF_FILE, path="/var/spool/slurmd/nid00206_14414.0"}, 37) = 0
08:49:46.902362 listen(6, 5)            = 0
08:49:46.902378 chmod("/var/spool/slurmd/nid00206_14414.0", 0777) = 0

While process 13156 binds and listens on the socket file, it never does an accept on the socket.  

Then, seemingly due to the connection refused error listed below, the process 13156 kills itself.
08:49:47.151486 write(3, "[2013-10-25T08:49:47.151] [14414.0] connect io: Connection refused\n", 67) = 67
08:49:47.151506 geteuid()               = 31329
08:49:47.151522 tgkill(13156, 13162, SIGRT_1) = 0

1400 milli-seconds later process 13156 is killed.

08:49:47.152967 ioctl(9, AGPIOC_ALLOCATE08:49:47.398982 +++ killed by SIGKILL +++

Process 13187 tries to connect to this socket file forty seconds later.  At that time, the socket file still exists, but the process that should have been listening on the socket -- process 13156 -- has been killed.  The processes attempting to access that socket get an ECONNREFUSED.  I can only assume this is some sort of end of job or clean-up process because it attempts to connect to the sockets for each job step.  (There were multiple job steps because I ran an srun -w nid00206 hostname command in a loop inside an salloc.)


08:50:27.356566 connect(11, {sa_family=AF_FILE, path="/var/spool/slurmd/nid00206_14414.0"}, 37) = -1 ECONNREFUSED (Connection refused)

It seems like the socket time-out might come from slurmstepd taking too long to create the socket file and not being ready when slurmd expects it to be.  This may lead to the connection refused error.  I'm not sure.

I have put all of the straces here.  
http://bugzilla.us.cray.com/show_bug.cgi?id=804155 Comments 7 and 8.

I'd love to attach them to this bug, but the size restriction have prevented this enough times in the past that I don't even try anymore.

Comment 16 Danny Auble 2013-10-25 03:15:44 MDT

Could you get backtraces of the slurmd and slurmstepd?

This would be helpful.  Please get them for all threads.

Thanks

bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=471
>
>--- Comment #15 from Jason Sollom <jasons@cray.com> ---
>The bug reoccurred Friday morning after a reboot of galaxy.  Whatever
>"fix"
>happened yesterday seems to be transient.
>
>I reproduced the bug on galaxy.  
>
>Here's what I did.  I just salloc'ed nid00206.  Then, I ran some sruns
>in a
>loop until I was able to see the error.  I had a tough time killing the
>script,
>so a few more sruns were launched after the initial srun -- step 0--
>which
>seemed to have triggered the problem.  The job ID was 14414 and step ID
>was 0.
>
>
>I did an strace of slurmd as well as its children on the node.  I also
>straced
>srun on the login node.  This analysis only looks at the straces of
>slurmd. 
>All of the files are named jasons_strace_slurmd.<pid>.  To save typing,
>I'll
>just refer to the pid.
>
>I investigated, and I think I found a race condition.  
>
>A slurmd process 13011 attempts to do a connect to the socket file
>/var/spool/slurmd/nid00206_14414.0.  The socket file does not exist.
>
>08:47:41.517772 connect(13, {sa_family=AF_FILE,
>path="/var/spool/slurmd/nid00206_14414.0"}, 37) = -1 ENOENT (No such
>file or
>directory)
>
>Two minutes and five seconds later, slurmstepd process 13156 creates
>the socket
>file.  (This seems like a long time.)
>
>08:49:46.902280 stat("/var/spool/slurmd/nid00206_14414.0",
>0x7fffffffe8c0) = -1
>ENOENT (No such file or directory)
>08:49:46.902302 socket(PF_FILE, SOCK_STREAM, 0) = 6
>08:49:46.902322 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
>08:49:46.902339 bind(6, {sa_family=AF_FILE,
>path="/var/spool/slurmd/nid00206_14414.0"}, 37) = 0
>08:49:46.902362 listen(6, 5)            = 0
>08:49:46.902378 chmod("/var/spool/slurmd/nid00206_14414.0", 0777) = 0
>
>While process 13156 binds and listens on the socket file, it never does
>an
>accept on the socket.  
>
>Then, seemingly due to the connection refused error listed below, the
>process
>13156 kills itself.
>08:49:47.151486 write(3, "[2013-10-25T08:49:47.151] [14414.0] connect
>io:
>Connection refused\n", 67) = 67
>08:49:47.151506 geteuid()               = 31329
>08:49:47.151522 tgkill(13156, 13162, SIGRT_1) = 0
>
>1400 milli-seconds later process 13156 is killed.
>
>08:49:47.152967 ioctl(9, AGPIOC_ALLOCATE08:49:47.398982 +++ killed by
>SIGKILL
>+++
>
>Process 13187 tries to connect to this socket file forty seconds later.
> At
>that time, the socket file still exists, but the process that should
>have been
>listening on the socket -- process 13156 -- has been killed.  The
>processes
>attempting to access that socket get an ECONNREFUSED.  I can only
>assume this
>is some sort of end of job or clean-up process because it attempts to
>connect
>to the sockets for each job step.  (There were multiple job steps
>because I ran
>an srun -w nid00206 hostname command in a loop inside an salloc.)
>
>
>08:50:27.356566 connect(11, {sa_family=AF_FILE,
>path="/var/spool/slurmd/nid00206_14414.0"}, 37) = -1 ECONNREFUSED
>(Connection
>refused)
>
>It seems like the socket time-out might come from slurmstepd taking too
>long to
>create the socket file and not being ready when slurmd expects it to
>be.  This
>may lead to the connection refused error.  I'm not sure.
>
>I have put all of the straces here.  
>http://bugzilla.us.cray.com/show_bug.cgi?id=804155 Comments 7 and 8.
>
>I'd love to attach them to this bug, but the size restriction have
>prevented
>this enough times in the past that I don't even try anymore.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 17 Danny Auble 2013-10-25 03:15:46 MDT

Created attachment 482 [details]
attachment-10606-0.html

Comment 18 Jason Sollom 2013-10-25 03:55:37 MDT

(In reply to Jason Sollom from comment #15)
> Then, seemingly due to the connection refused error listed below, the
> process 13156 kills itself.
> 08:49:47.151486 write(3, "[2013-10-25T08:49:47.151] [14414.0] connect io:
> Connection refused\n", 67) = 67
> 08:49:47.151506 geteuid()               = 31329
> 08:49:47.151522 tgkill(13156, 13162, SIGRT_1) = 0
> 
> 1400 milli-seconds later process 13156 is killed.
> 
> 08:49:47.152967 ioctl(9, AGPIOC_ALLOCATE08:49:47.398982 +++ killed by
> SIGKILL +++
> 

I was wrong above.  The tgkill sent a SIGRT_1, which is not fatal, so the process did not kill itself.  Something else did.

Comment 19 Danny Auble 2013-10-25 10:34:23 MDT

OK, it appears this is caused by the launch process taking too long.  We believe we have it narrowed down to container_g_create() in src/slurmd/slurmd/req.c if using srun to launch the tasks then the parent function is _rpc_launch_tasks().

That appears to sometimes take longer than the MessageTimeout which will cause srun to abandon the socket and when it does finish make it so the step no longer has anyone to talk to.  

Just doing a sleep for longer than MessageTimeout in this spot will reproduce the problem immediately.

We are looking into what can be done to avoid this scenario where the launch takes much longer than expected.

Cray is studying the container code to see what is taking so long.

Comment 20 Moe Jette 2013-10-25 11:23:57 MDT

Since Danny was kind enough to assign this to me, I have to ask Cray what is the worst case scenario for job container creation? Danny has seen 2.5 minutes.

In my experience, any user operation that takes more than a few seconds (and preferably milliseconds) will result in user complaints (5 seconds seems to be the limit at NASA).

Comment 21 Jason Sollom 2013-10-25 23:10:18 MDT

(In reply to Moe Jette from comment #20)
> Since Danny was kind enough to assign this to me, I have to ask Cray what is
> the worst case scenario for job container creation? Danny has seen 2.5
> minutes.
> 
> In my experience, any user operation that takes more than a few seconds (and
> preferably milliseconds) will result in user complaints (5 seconds seems to
> be the limit at NASA).

I opened Cray bug http://bugzilla.us.cray.com/show_bug.cgi?id=804432 to track this issue.

Comment 22 Jason Sollom 2013-10-29 07:00:29 MDT

Some time-out statements were put into the Slurm job_container_cncu code to record whether the call to job_create_reservation took longer than seconds.

These statements never triggered.  I verified that the correct code was running.

This weakens the case against the job_create_reservation call significantly in my mind.  It is possible the cause is elsewhere.

Comment 23 Danny Auble 2013-10-29 07:55:00 MDT

I think the system is unstable.  We have seen it before where we run for hours and never see it and then when the system is rebooted it happens all the time.  As we have already seen delays in the log file with only this call in between I don't know what else it could be.  My suggestion would be lower your MessageTimeout back to 10 or 30.  If the socket timeouts come back and the messages never trigger then we can point at something else.

Comment 24 Danny Auble 2013-10-31 09:29:26 MDT

We now believe this is a result of an ldap issue where it makes getpwnam_r hang for some reason.  Sorry for blaming you CNCU, it looks like you are clean for now.  More on this if it turns out to be something else.

Comment 25 Danny Auble 2013-11-12 08:22:32 MST

Jason, can we close this?

Comment 26 Danny Auble 2014-01-10 03:48:58 MST

ping?  I am fairly sure this is fixed.

Comment 27 Moe Jette 2014-01-13 02:13:33 MST

Closing this ticket for bookkeeping. Please re-open if necessary.