Noticed the following errors almost every second in the slurmctld.log on slurm master node. Forwarded higher level of debug messages to Brian already. error: slurm_receive_msg: Zero Bytes were transmitted or received
We are currently on slurm 14.11.5 and have been running it using the config file setting under slurm.conf # POWER SAVE SUPPORT FOR IDLE NODES (optional) # Script that shuts down the nodes SuspendProgram=/etc/slurm/suspend.sh # Script that starts nodes back up ResumeProgram=/etc/slurm/restart.sh # Max time it takes a node to shutdown SuspendTimeout=120 # Max time it takes to restart a node ResumeTimeout=480 # Number of nodes to restart at one time ResumeRate=100 # Nodes/partitions excluded from shutdown SuspendExcNodes=n0[1-3][01-84] #SuspendExcParts= # Number of nodes shutdown at once SuspendRate=50 # Amount of time a node is idle before it is shutdown SuspendTime=1800 Regards, Bill
Here is the output from the high debug [2015-10-02T12:36:25.874] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:25.874] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:25.884] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:28.877] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:28.877] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:28.887] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:31.881] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:31.881] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:31.891] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:34.884] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:34.884] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:34.894] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:37.888] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:37.888] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:37.898] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:40.891] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:40.891] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:40.901] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:43.251] debug2: Testing job time limits and checkpoints [2015-10-02T12:36:43.895] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:43.895] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:43.905] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:46.898] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2015-10-02T12:36:46.898] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-02T12:36:46.908] error: slurm_receive_msg: Zero Bytes were transmitted or received
Bill, Will you send the outputs of sinfo and scontrol show nodes? Thanks, Brian
Created attachment 2271 [details] scntrl_show_nodes.txt Info requested. Show nodes in attached due to size limitations. [root@clnxcat02 slurm]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST short* up 16:00:00 1 down* n0677 short* up 16:00:00 44 idle~ n[0548-0584,0601-0603,0671-0674] short* up 16:00:00 1 down n0475 short* up 16:00:00 298 alloc n[0101-0167,0201-0203,0231-0244,0247-0254,0401-0474,0476-0484,0501-0547,0604-0670,0675-0676,0678-0684] short* up 16:00:00 160 idle n[0168-0184,0204-0230,0245-0246,0255-0284,0301-0384] small up infinite 25 alloc n[0201-0203,0231-0244,0247-0254] small up infinite 143 idle n[0204-0230,0245-0246,0255-0284,0301-0384] large up infinite 1 down* n0677 large up infinite 44 idle~ n[0548-0584,0601-0603,0671-0674] large up infinite 1 down n0475 large up infinite 206 alloc n[0401-0474,0476-0484,0501-0547,0604-0670,0675-0676,0678-0684] Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 02, 2015 1:25 PM To: Schadlich, William W Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received Comment # 3<http://bugs.schedmd.com/show_bug.cgi?id=2002#c3> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Brian Christiansen<mailto:brian@schedmd.com> Bill, Will you send the outputs of sinfo and scontrol show nodes? Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug.
It looks like the issue may be with communicating with n0677. One possible cause of the messages is that there might be a pending RPC that keeps trying to communicate with n0677 even though it's in the down state. If you restart the controller we would expect it to stop trying to communicate with the node -- ie. would clear out the pending RPC. Moe said he would look into it.
The node is down in slurm and now powered off but still reporting the same message. Little nervous to be restarting the scheduler at 2:30 on a Friday. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 02, 2015 2:11 PM To: Schadlich, William W Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received Brian Christiansen<mailto:brian@schedmd.com> changed bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> What Removed Added Assignee david@schedmd.com<mailto:david@schedmd.com> jette@schedmd.com<mailto:jette@schedmd.com> Comment # 5<http://bugs.schedmd.com/show_bug.cgi?id=2002#c5> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Brian Christiansen<mailto:brian@schedmd.com> It looks like the issue may be with communicating with n0677. One possible cause of the messages is that there might be a pending RPC that keeps trying to communicate with n0677 even though it's in the down state. If you restart the controller we would expect it to stop trying to communicate with the node -- ie. would clear out the pending RPC. Moe said he would look into it. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to William Schadlich from comment #6) > The node is down in slurm and now powered off but still reporting the same > message. Little nervous to be restarting the scheduler at 2:30 on a Friday. Alternatives are: 1. Power node back up and return to service to clear out the RPC 2. Leave node down and make sure you have sufficient storage for the logs 3. Restart slurmctld Your call...
Understand, but somewhat harsh especially if say 100 nodes were offline for any particular reason. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 02, 2015 2:35 PM To: Schadlich, William W Subject: [Bug 2002] error: slurm_receive_msg: Zero Bytes were transmitted or received Comment # 7<http://bugs.schedmd.com/show_bug.cgi?id=2002#c7> on bug 2002<http://bugs.schedmd.com/show_bug.cgi?id=2002> from Moe Jette<mailto:jette@schedmd.com> (In reply to William Schadlich from comment #6<http://bugs.schedmd.com/show_bug.cgi?id=2002#c6>) > The node is down in slurm and now powered off but still reporting the same > message. Little nervous to be restarting the scheduler at 2:30 on a Friday. Alternatives are: 1. Power node back up and return to service to clear out the RPC 2. Leave node down and make sure you have sufficient storage for the logs 3. Restart slurmctld Your call... ________________________________ You are receiving this mail because: * You reported the bug.
I am able to reproduce what looks like the root problem here. I needed to add some well placed sleep() calls in the code to do that, but it's clearly possible with the following sequence of operations: 1. Some RPC is queued to get sent from the slurmctld daemon to compute nodes. This RPC is of a type that gets requeued if it fails so as to insure that it gets sent. There are relatively few RPCsd of this type. 2. The compute node goes DOWN. 3. The RPC repeatedly tries getting sent and upon failing, gets requeued. The fix I'm working on introduces a check for DOWN nodes in step #3 and avoids requeuing requests for those specific nodes. You'll need to either restart the slurmctld daemon or return the DOWN nodes to service to clear this up.
You are not likely to see this problem again, but I've added logic to Slurm so that RPCs bound for DOWN nodes are not requeued. You may see the errors for a couple of minutes while retries occur, but they will stop once the node is deemed to be DOWN. Here's the commit: https://github.com/martbhell/slurm/commit/f4ea9dec83e2b6a17be914d4c89cbbaf84168b66 I'm closing this ticket. You can re-open the ticket or open a new one if you see this problem after applying the patch or upgrading to the next release of Slurm with the patch.
Hi, we are seeing Zero Bytes RPC messages as well as the accompanying 'invalid type trying to be freed 65534' even though we are running slurm 16.05.9. These RPCs seem to be targeted to DOWN nodes and get re-queued over and over. Has there been a regression? Or are we looking at a different problem? Details: Huge numbers of log messages like this in ctld.log: [2017-05-12T08:35:45.696] error: slurm_receive_msg: Zero Bytes were transmitted or received [2017-05-12T08:35:45.706] error: slurm_receive_msg [10.2.5.40:41612]: Zero Bytes were transmitted or received [2017-05-12T08:35:45.706] error: invalid type trying to be freed 65534 these errors move to the backup controller when the main controller is shut down and back to the main controller when it comes back up so slurm is presumably saving those pending RPC calls to it's state files as it goes down. Tracking down to what nodes they are coming from: perl -nle 'if (m/slurm_receive_msg \[([0-9.:]+)\]/) {print $1}' ctld.log \ | cut -f1 -d':' /tmp/fnord | sort -V | uniq -c | sort -k1,1 | tail -10 138 10.1.22.71 4517 10.2.4.103 11037 10.2.4.205 12957 10.2.4.75 29451 10.2.13.19 29455 10.2.13.26 38316 10.2.8.52 81765 10.2.1.74 148092 10.2.5.40 218913 10.2.4.231 These nodes are either DOWN or were recently DOWN so it seems to fit the pattern described in this bug of slurm trying to requeue these RPC calls. Regards, Wolfgang
Check for out of sync clocks! I solved the same problem restarting NTP services... Regards
I am currently out of the office and will respond once I return on I can be reached on my cell or personal email if necessary. Thanks.