2002 – error: slurm_receive_msg: Zero Bytes were transmitted or received

Bug 2002 - error: slurm_receive_msg: Zero Bytes were transmitted or received

Summary: error: slurm_receive_msg: Zero Bytes were transmitted or received

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	14.11.5
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-10-02 04:47 MDT by William Schadlich
Modified:	2021-09-30 11:05 MDT (History)
CC List:	4 users (show)

See Also:
Site:	EM
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.10, 15.08.2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
scntrl_show_nodes.txt (231.21 KB, text/plain) 2015-10-02 05:50 MDT, William Schadlich	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description William Schadlich 2015-10-02 04:47:58 MDT

Noticed the following errors almost every second in the slurmctld.log on slurm master node.  Forwarded higher level of debug messages to Brian already.

 error: slurm_receive_msg: Zero Bytes were transmitted or received

Comment 1 Brian Christiansen 2015-10-02 05:00:59 MDT

We are currently on slurm 14.11.5 and have been running it using the config file setting under slurm.conf
 
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
# Script that shuts down the nodes
SuspendProgram=/etc/slurm/suspend.sh
# Script that starts nodes back up
ResumeProgram=/etc/slurm/restart.sh
# Max time it takes a node to shutdown
SuspendTimeout=120
# Max time it takes to restart a node
ResumeTimeout=480
# Number of nodes to restart at one time
ResumeRate=100
# Nodes/partitions excluded from shutdown
SuspendExcNodes=n0[1-3][01-84]
#SuspendExcParts=
# Number of nodes shutdown at once
SuspendRate=50
# Amount of time a node is idle before it is shutdown
SuspendTime=1800
 
 
Regards,
Bill

Comment 2 Brian Christiansen 2015-10-02 05:20:48 MDT

Here is the output from the high debug

[2015-10-02T12:36:25.874] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:25.874] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:25.884] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:28.877] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:28.877] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:28.887] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:31.881] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:31.881] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:31.891] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:34.884] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:34.884] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:34.894] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:37.888] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:37.888] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:37.898] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:40.891] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:40.891] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:40.901] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:43.251] debug2: Testing job time limits and checkpoints
[2015-10-02T12:36:43.895] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:43.895] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:43.905] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:46.898] debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
[2015-10-02T12:36:46.898] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2015-10-02T12:36:46.908] error: slurm_receive_msg: Zero Bytes were transmitted or received

Comment 3 Brian Christiansen 2015-10-02 05:24:52 MDT

Bill,

Will you send the outputs of sinfo and scontrol show nodes?

Thanks,
Brian

Comment 4 William Schadlich 2015-10-02 05:49:57 MDT

Created attachment 2271 [details]
scntrl_show_nodes.txt

Info requested.  Show nodes in attached due to size limitations.

[root@clnxcat02 slurm]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
short*       up   16:00:00      1  down* n0677
short*       up   16:00:00     44  idle~ n[0548-0584,0601-0603,0671-0674]
short*       up   16:00:00      1   down n0475
short*       up   16:00:00    298  alloc n[0101-0167,0201-0203,0231-0244,0247-0254,0401-0474,0476-0484,0501-0547,0604-0670,0675-0676,0678-0684]
short*       up   16:00:00    160   idle n[0168-0184,0204-0230,0245-0246,0255-0284,0301-0384]
small        up   infinite     25  alloc n[0201-0203,0231-0244,0247-0254]
small        up   infinite    143   idle n[0204-0230,0245-0246,0255-0284,0301-0384]
large        up   infinite      1  down* n0677
large        up   infinite     44  idle~ n[0548-0584,0601-0603,0671-0674]
large        up   infinite      1   down n0475
large        up   infinite    206  alloc n[0401-0474,0476-0484,0501-0547,0604-0670,0675-0676,0678-0684]





Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 1:25 PM
To: Schadlich, William W
Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received

Comment # 3<http://bugs.schedmd.com/show_bug.cgi?id=2002#c3> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Brian Christiansen<mailto:brian@schedmd.com>

Bill,



Will you send the outputs of sinfo and scontrol show nodes?



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 5 Brian Christiansen 2015-10-02 06:11:20 MDT

It looks like the issue may be with communicating with n0677.

One possible cause of the messages is that there might be a pending RPC that keeps trying to communicate with n0677 even though it's in the down state.

If you restart the controller we would expect it to stop trying to communicate with the node -- ie. would clear out the pending RPC.

Moe said he would look into it.

Comment 6 William Schadlich 2015-10-02 06:21:02 MDT

The node is down in slurm and now powered off but still reporting the same message.  Little nervous to be restarting the scheduler at 2:30 on a Friday.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 2:11 PM
To: Schadlich, William W
Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received

Brian Christiansen<mailto:brian@schedmd.com> changed bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002>
What

Removed

Added

Assignee

david@schedmd.com<mailto:david@schedmd.com>

jette@schedmd.com<mailto:jette@schedmd.com>

Comment # 5<http://bugs.schedmd.com/show_bug.cgi?id=2002#c5> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Brian Christiansen<mailto:brian@schedmd.com>

It looks like the issue may be with communicating with n0677.



One possible cause of the messages is that there might be a pending RPC that

keeps trying to communicate with n0677 even though it's in the down state.



If you restart the controller we would expect it to stop trying to communicate

with the node -- ie. would clear out the pending RPC.



Moe said he would look into it.

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 7 Moe Jette 2015-10-02 06:35:20 MDT

(In reply to William Schadlich from comment #6)
> The node is down in slurm and now powered off but still reporting the same
> message.  Little nervous to be restarting the scheduler at 2:30 on a Friday.

Alternatives are:
1. Power node back up and return to service to clear out the RPC
2. Leave node down and make sure you have sufficient storage for the logs
3. Restart slurmctld

Your call...

Comment 8 William Schadlich 2015-10-02 06:37:36 MDT

Understand, but somewhat harsh especially if say 100 nodes were offline for any particular reason.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 2:35 PM
To: Schadlich, William W
Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received

Comment # 7<http://bugs.schedmd.com/show_bug.cgi?id=2002#c7> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Moe Jette<mailto:jette@schedmd.com>

(In reply to William Schadlich from comment #6<http://bugs.schedmd.com/show_bug.cgi?id=2002#c6>)

> The node is down in slurm and now powered off but still reporting the same

> message.  Little nervous to be restarting the scheduler at 2:30 on a Friday.



Alternatives are:

1. Power node back up and return to service to clear out the RPC

2. Leave node down and make sure you have sufficient storage for the logs

3. Restart slurmctld



Your call...

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 9 Moe Jette 2015-10-02 11:27:39 MDT

I am able to reproduce what looks like the root problem here. I needed to add some well placed sleep() calls in the code to do that, but it's clearly possible with the following sequence of operations:

1. Some RPC is queued to get sent from the slurmctld daemon to compute nodes. This RPC is of a type that gets requeued if it fails so as to insure that it gets sent. There are relatively few RPCsd of this type.
2. The compute node goes DOWN.
3. The RPC repeatedly tries getting sent and upon failing, gets requeued.

The fix I'm working on introduces a check for DOWN nodes in step #3 and avoids requeuing requests for those specific nodes.

You'll need to either restart the slurmctld daemon or return the DOWN nodes to service to clear this up.

Comment 10 Moe Jette 2015-10-06 06:06:11 MDT

You are not likely to see this problem again, but I've added logic to Slurm so that RPCs bound for DOWN nodes are not requeued. You may see the errors for a couple of minutes while retries occur, but they will stop once the node is deemed to be DOWN. Here's the commit:
https://github.com/martbhell/slurm/commit/f4ea9dec83e2b6a17be914d4c89cbbaf84168b66

I'm closing this ticket. You can re-open the ticket or open a new one if you see this problem after applying the patch or upgrading to the next release of Slurm with the patch.

Comment 11 Wolfgang Resch 2017-05-12 06:45:23 MDT

Hi,

we are seeing Zero Bytes RPC messages as well as the accompanying
'invalid type trying to be freed 65534' even though we are running
slurm 16.05.9. These RPCs seem to be targeted to DOWN nodes and get
re-queued over and over. Has there been a regression? Or are we
looking at a different problem?


Details:

Huge numbers of log messages like this in ctld.log:

[2017-05-12T08:35:45.696] error: slurm_receive_msg: Zero Bytes were transmitted or received                  
[2017-05-12T08:35:45.706] error: slurm_receive_msg [10.2.5.40:41612]: Zero Bytes were transmitted or received
[2017-05-12T08:35:45.706] error: invalid type trying to be freed 65534                                       

these errors move to the backup controller when the main controller is shut 
down and back to the main controller when it comes back up so slurm is presumably 
saving those pending RPC calls to it's state files as it goes down.

Tracking down to what nodes they are coming from:

perl -nle 'if (m/slurm_receive_msg \[([0-9.:]+)\]/) {print $1}' ctld.log \
    | cut -f1 -d':' /tmp/fnord | sort -V | uniq -c | sort -k1,1 | tail -10

   138 10.1.22.71
  4517 10.2.4.103
 11037 10.2.4.205
 12957 10.2.4.75
 29451 10.2.13.19
 29455 10.2.13.26
 38316 10.2.8.52
 81765 10.2.1.74
148092 10.2.5.40
218913 10.2.4.231

These nodes are either DOWN or were recently DOWN so it seems to fit the
pattern described in this bug of slurm trying to requeue these RPC calls.

Regards,
Wolfgang

Comment 12 Gonzalo 2018-10-19 10:28:03 MDT

Check for out of sync clocks! I solved the same problem restarting NTP services... Regards

Comment 13 William Schadlich 2021-09-30 11:02:37 MDT

I am currently out of the office and will respond once I return on I can be reached on my cell or personal email if necessary.  Thanks.