Bug 14011 - Nodes flapping between available and not responding after update.
Summary: Nodes flapping between available and not responding after update.
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 21.08.8
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-05-05 13:04 MDT by Michael Robbert
Modified: 2022-05-12 10:57 MDT (History)
1 user (show)

See Also:
Site: Colorado School of Mines
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurmctld log (5.36 MB, text/plain)
2022-05-05 13:28 MDT, Michael Robbert
Details
slurmd log from node running old version 21.08.6 (5.09 MB, text/plain)
2022-05-05 13:30 MDT, Michael Robbert
Details
slurmd from node running new version 21.08.8 (17.96 KB, text/plain)
2022-05-05 13:31 MDT, Michael Robbert
Details
slurmd with errors Debugflags=network active (3.40 MB, text/plain)
2022-05-06 09:11 MDT, Michael Robbert
Details
Output of "scontrol show nodes" (186.95 KB, text/plain)
2022-05-06 10:04 MDT, Michael Robbert
Details
slurmctld logs from today. Debugflags=network enabled at 08:40:56 (1.38 MB, text/plain)
2022-05-06 10:07 MDT, Michael Robbert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Robbert 2022-05-05 13:04:02 MDT
We updated slurmctld and slurmdbd from 21.08.6 to 21.08.8 yesterday and have been updating slurmd on the compute nodes on a rolling basis as they become idle after setting them to a drain state. 
Since we started the rolling update we've been seeing large blocks of nodes showing up as "not responding", but then coming back online ("now responding") after awhile. 
I saw your note about in the Email announcement about including a fix to communications between patched and unpatched slurmd processes. We did not get any pre-release patches and are using only the official release so I thought that this method of updating would be fine, but it looks like there are still problems with this.

Mike Robbert
Comment 1 Jason Booth 2022-05-05 13:18:33 MDT
Would you please attach the slurmctld.log and slurmd.log from one of the nodes? Also, let us know if you see errors like the following:

> RPC:REQUEST_PING : Protocol authentication error
Comment 2 Michael Robbert 2022-05-05 13:28:10 MDT
Created attachment 24857 [details]
Slurmctld log
Comment 3 Jason Booth 2022-05-05 13:29:06 MDT
Please also note that slurmd's on an older version will run into this issue. Mixing new slurmctld with older slurmd's will cause the nodes to have Protocol authentication error. The new slurmd should work just fine. Please let us know if the new slurmd's have communication issues.
Comment 4 Michael Robbert 2022-05-05 13:30:31 MDT
Created attachment 24858 [details]
slurmd log from node running old version 21.08.6
Comment 5 Michael Robbert 2022-05-05 13:31:00 MDT
Created attachment 24859 [details]
slurmd from node running new version 21.08.8
Comment 6 Michael Robbert 2022-05-05 13:34:23 MDT
We are not seeing RPC:REQUEST_PING messages, but we are seeing "Protocol authentication error" messages like this:

[2022-05-05T13:19:28.606] error: slurm_receive_msgs: [[c004]:6818] failed: Protocol authentication error

I thought that updates within a major release were supposed to be protocol compatible such that 21.08.6 and 21.08.8 would work fine together such that we could do a rolling update. Is that never the case or does that break with this particular update?

Mike
Comment 7 Jason Booth 2022-05-05 13:41:25 MDT
We have found the issue, and it relates to mixed fixed+unfixed 21.08. Currently, this is broken. We do apologize. I will have Chad follow up with you shortly (in the next hour or two) with a patch and / or  link to 21.08.8-2.
Comment 8 Chad Vizino 2022-05-05 14:44:20 MDT
(In reply to Jason Booth from comment #7)
> We have found the issue, and it relates to mixed fixed+unfixed 21.08.
> Currently, this is broken. We do apologize. I will have Chad follow up with
> you shortly (in the next hour or two) with a patch and / or  link to
> 21.08.8-2.
Please see the related note just posted to slurm-announce and slurm-users. You can download the fixed release from https://www.schedmd.com/downloads.php.
Comment 9 Michael Robbert 2022-05-05 17:01:06 MDT
I wanted to test the patch so I downloaded 21.08.8-2 and compiled the RPMs and got them installed on all the nodes that I had already updated with 21.08.8-1 and I am still seeing the same problem. 

Is this due to jobs running that were started under 21.08.8-1 or is there maybe still a problem with the patch?
Comment 10 Nate Rini 2022-05-05 18:48:03 MDT
(In reply to Michael Robbert from comment #9)
> Is this due to jobs running that were started under 21.08.8-1 or is there
> maybe still a problem with the patch?

We are actively working to determine the answer to that question.

Is it possible to set debugflags=network on the controller and node052. A restart of the slurmd will be required. `scontrol setdebugflags +network` can activate the logging. Have it log until the 'RPC:REQUEST_PING : Protocol authentication error' errors are logged.
Comment 11 Nate Rini 2022-05-05 18:50:17 MDT
Please also provide the logs from slurmd on node501, node505, node602 from at least May.
Comment 12 Michael Robbert 2022-05-05 22:26:15 MDT
I ran `scontrol setdebugflags +network` on the controller. I also tried running on a compute node. I also tried adding the line "DebugFlags=Network" to slurm.conf on a few compute nodes and then restarted the slurmd there. None of that changed the messages that I'm seeing in the logs. 
Further we do not have nodes with the names "node501, node505, node602". The nodes in the partition that start with node* are 001 through 140.

Should I proceed to update the rest of the nodes in the cluster to see if that clears it up?

Mike
Comment 13 Nate Rini 2022-05-05 23:08:05 MDT
Please attach the logs of the nodes with network flag active. It is not ideal to leave it active on the controller for long periods.
Comment 14 Michael Robbert 2022-05-06 08:55:47 MDT
Here are the logs from node052 running 21.08.6 with Debugflags=Network active. You can see the restart where I activated it and the restart where I deactivated it. During this time the controller showed several nodes that went offline and online again, but this node logged nothing.

[2022-05-06T01:40:09.033] [11591547.extern] done with job
[2022-05-06T08:42:48.988] Slurmd shutdown completing
[2022-05-06T08:42:49.119] slurmd version 21.08.6 started
[2022-05-06T08:42:49.120] slurmd started on Fri, 06 May 2022 08:42:49 -0600
[2022-05-06T08:42:49.752] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=64400 TmpDisk=32200 Uptime=3492078 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-05-06T08:49:25.697] Slurmd shutdown completing
[2022-05-06T08:49:25.811] slurmd version 21.08.6 started
[2022-05-06T08:49:25.812] slurmd started on Fri, 06 May 2022 08:49:25 -0600
[2022-05-06T08:49:26.444] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=64400 TmpDisk=32200 Uptime=3492474 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

When you asked for node052 I wasn't sure if maybe you could have meant c052 so I did the same thing there and had the same results. That node is running the latest code 21.08.8-2

[2022-05-06T07:42:35.775] [11600775.0] task/cgroup: _memcg_initialize: job: alloc=187740MB mem.limit=187740MB memsw.limit=unlimited
[2022-05-06T07:42:35.775] [11600775.0] task/cgroup: _memcg_initialize: step: alloc=187740MB mem.limit=187740MB memsw.limit=unlimited
[2022-05-06T08:44:45.768] Slurmd shutdown completing
[2022-05-06T08:44:45.877] slurmd version 21.08.8-2 started
[2022-05-06T08:44:45.877] slurmd started on Fri, 06 May 2022 08:44:45 -0600
[2022-05-06T08:44:47.153] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=385414 TmpDisk=9990 Uptime=3492448 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-05-06T08:48:54.091] Slurmd shutdown completing
[2022-05-06T08:48:54.209] slurmd version 21.08.8-2 started
[2022-05-06T08:48:54.210] slurmd started on Fri, 06 May 2022 08:48:54 -0600
[2022-05-06T08:48:55.486] CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=385414 TmpDisk=9990 Uptime=3492696 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

It does appear that neither of this nodes is showing up in the list of nodes that is flapping up and down. I'll now try the same thing on one of the nodes that is flapping.
Comment 15 Michael Robbert 2022-05-06 09:11:53 MDT
Created attachment 24883 [details]
slurmd with errors Debugflags=network active

I am not seeing any logging from node125 which is flapping, even with Debugflags=network active. I am attaching the slurmd.log from a node that is throwing errors, but isn't flapping. I activated and then deactivated the network debug flag this morning for several minutes, but I didn't see much difference in the output. That output is in the attachment.

I feel like the debug flag for network isn't doing what what it should. 

Mike
Comment 16 Nate Rini 2022-05-06 10:01:00 MDT
Please also provide output of:
> scontrol show nodes
Comment 17 Nate Rini 2022-05-06 10:02:57 MDT
(In reply to Michael Robbert from comment #15)
> I feel like the debug flag for network isn't doing what what it should. 

Do you have the logs from slurmctld at this time too with network flag active?

Note that this timestamp is just zero:
> ENCODED: Wed Dec 31 17:00:00 1969

Could be that munge is very unhappy and refusing to sign/verify anything on this node. Can you please also restart munged on the nodes with this error?
Comment 18 Michael Robbert 2022-05-06 10:04:57 MDT
Created attachment 24884 [details]
Output of "scontrol show nodes"
Comment 19 Michael Robbert 2022-05-06 10:07:07 MDT
Created attachment 24885 [details]
slurmctld logs from today. Debugflags=network enabled at 08:40:56
Comment 22 Nate Rini 2022-05-06 10:10:56 MDT
(In reply to Michael Robbert from comment #19)
> Created attachment 24885 [details]
> slurmctld logs from today. Debugflags=network enabled at 08:40:56

Looks like the debugflag was not enabled. There should be a ton of NET: logs but I don't see even a single one.

Please also set log level to at least on the controller:
> scontrol setdebug debug
Comment 23 Nate Rini 2022-05-06 10:11:40 MDT
Please also note that a work-around for the issue is to set:
> SlurmdTimeout=0

Note that a restart of all daemons is required for it to be effective.
Comment 24 Michael Robbert 2022-05-06 10:19:00 MDT
Setting the log level to debug has enabled NET: messages. On the compute nodes setting slurmddebug to debug has gotten the messages to show up there as well. 

Now that we have that debugging working is there a particular node that you'd like to see network deubugging from? node052 is what you asked for last night.

I only found the munge errors on a small handful of nodes. Restarting munge on those nodes didn't help. I double checked the md5sum of our munge.key and the verified that the time on those nodes is in sync with the rest of the cluster.
Comment 25 Nate Rini 2022-05-06 10:27:59 MDT
(In reply to Michael Robbert from comment #24)
> Setting the log level to debug has enabled NET: messages. On the compute
> nodes setting slurmddebug to debug has gotten the messages to show up there
> as well. 
Great

> Now that we have that debugging working is there a particular node that
> you'd like to see network deubugging from? node052 is what you asked for
> last night.
I only need it from one of the flapping nodes.

> I only found the munge errors on a small handful of nodes. Restarting munge
> on those nodes didn't help. I double checked the md5sum of our munge.key and
> the verified that the time on those nodes is in sync with the rest of the
> cluster.
How many threads is munged running with on those nodes?
Comment 26 Michael Robbert 2022-05-06 10:36:58 MDT
It doesn't look like we are specifying a number of threads for munged so I am assuming it is the default. Is there a way to query to know for sure?

I did find a total of 5 nodes that were logging munge errors. All of those nodes were still running slurmd 21.08.6 so I tried updating them to 21.08.8-2 and the munge errors stopped after that update.

It looks like the node availability stopped flapping too.

I can't think of anything special about these 5 nodes. There were 3 from the compute partition and 2 from the aun partition so it was 2 different node types. Other nodes that didn't get the update aren't having the same problem. 

Any thoughts? 
Should I go ahead and continue with updating the rest of the nodes?
Comment 27 Michael Robbert 2022-05-06 10:41:44 MDT
In case it helps, these were the only nodes with the Munge errors:
c[001,022,061],node[037,109]
Comment 28 Nate Rini 2022-05-06 10:47:08 MDT
(In reply to Michael Robbert from comment #26)
> It doesn't look like we are specifying a number of threads for munged so I
> am assuming it is the default. Is there a way to query to know for sure?

Then I suggest adding `--num-threads=10` as an arg.

> I did find a total of 5 nodes that were logging munge errors. All of those
> nodes were still running slurmd 21.08.6 so I tried updating them to
> 21.08.8-2 and the munge errors stopped after that update.
> 
> It looks like the node availability stopped flapping too.
Good.
 
> I can't think of anything special about these 5 nodes. There were 3 from the
> compute partition and 2 from the aun partition so it was 2 different node
> types. Other nodes that didn't get the update aren't having the same
> problem. 

The bug in 21.08.8-1 is a cross-compatibility issue so it could just have been these nodes had a particularly long enough running job to cross all the versions. Since the bug is fixed, having all nodes uniform should be sufficient to be rid of the node flapping. If not, please provide more logs.

> Should I go ahead and continue with updating the rest of the nodes?

Yes, we have not seen this issue happen on clusters that have all daemons updated and restarted.
Comment 29 Nate Rini 2022-05-12 10:16:32 MDT
(In reply to Michael Robbert from comment #26)
> Should I go ahead and continue with updating the rest of the nodes?

Any updates? The problem should have been resolved by updating all nodes.
Comment 30 Michael Robbert 2022-05-12 10:19:11 MDT
Sorry, I thought I had updated the ticket. When we updated all the nodes the problem went away. We are all good now!
Comment 31 Nate Rini 2022-05-12 10:57:59 MDT
Great. Closing out ticket.