Ticket 9035

Summary:	Dialing back node registration alerts
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, mdidomenico, tim
Version:	20.11.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9949
Site:	Harvard University	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	20.11pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Paul Edmon 2020-05-12 07:52:01 MDT

Yesterday I turned off config_overrides in order to allow the scheduler to compare node configs in the conf versus actual config. I was doing this because with the MemSpecLimit feature we don't have to lie about node memory size in the conf any more.  When I put this in place the scheduler complained and flagged bad nodes on restart and then went on its merry way.

This morning I did a global restart and all the nodes that had mismatches began spewing errors all over the logs.  For example:

May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c20305: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy2a09303 has low tmp_disk size (51175 < 233849)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy2a09303: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c16204 has low real_memory size (176765 < 192892)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c16204: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c26601 has low real_memory size (192886 < 192892)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c26601: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c18607 has low real_memory size (192887 < 192892)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c18607: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holystat09 has low real_memory size (64182 < 64214)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holystat09: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c20603 has low real_memory size (192884 < 192892)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c20603: Invalid argument
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c22102 has low real_memory size (176765 < 192892)
May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c22102: Invalid argument


These are all known bad nodes that we need to fix.  We have them flagged as drain or down in the scheduler.  Since we won't be able to fix these nodes for a while and since I don't want the logs just filled with error messages we can do nothing about I'm going to turn on config_overrides again.  However what I would like to do is have config_overrides off and have these alerts silenced if the node is set to DRAIN or DOWN in the scheduler.  I don't need the logs constantly complaining as it makes it very hard to track what is going on when there are real problems.  A single node being compromised shouldn't spew all over the logs like this.

So in summary could the alert frequency for this be dialed back or even better silenced if a node is set to DRAIN or DOWN.  That way we still know the node is bad, and we don't have the logs complaining.  In addition could slurm set the node to drain and down with a more descriptive reason, as current it just says Low TmpDisk but not the other info the alert has of the amount its low by.  That's useful info as it tells me if I have a dead disk or dead DIMM.

Until the alerting for this is dialed back I'm going to have to run with config_overrides on as the log is basically unreadable at this point and we will always have busted nodes in our conf.  So I can't really clean it of bad nodes as we would have nodes going in and out all the time unnecessarily.

Comment 1 Marcin Stolarek 2020-05-14 06:04:54 MDT

Paul,

You can set the node state to "future" which will stop ping/registration RPC being executed for those nodes.

As a side effect, those nodes will also be hidden in outputs of commands like scontrol show job/sinfo which is generally desirable, for resources that won't be available soon.

>That's useful info as it tells me if I have a dead disk or dead DIMM.
I see your point, but checking if "DIMM is dead" vs it was removed on purpose and config wasn't updated appropriately may just be an interpretation of the fact that the node has less memory than configured? 

cheers,
Marcin

Comment 2 Marcin Stolarek 2020-05-18 01:11:11 MDT

Paul,

Does the suggested solution work for you?

cheers,
Marcin

Comment 3 Paul Edmon 2020-05-18 07:50:33 MDT

Sorry, I was out on vacation.  Just got back.

This solution feels like a kludge to me.  If we have a node that as a 
DIMM that fails and deactivates while open, it should go into DRAIN 
state and then stay there as jobs exit.   A more natural solution to me 
would be to have the alert silenced if the node is in DRAIN or DOWN 
state as obviously a problem is known.  While we could go with FUTURE it 
just doesn't feel like the right solution to this issue.  Also would 
FUTURE survive cluster restarts?

Beyond that it would hide the node from the tools we normally use for 
monitoring the cluster, as we lean on scontrol show node and others to 
get info on node state and reasons why they are down. From an admin 
point of view its good to have the nodes there but down so we can see 
the reason obviously.  It's also good for the users as they may have an 
impacted node in their partition and they will want to know why it is down.

-Paul Edmon-

On 5/18/2020 3:11 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c2> on bug 
> 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Does the suggested solution work for you?
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Marcin Stolarek 2020-05-21 03:23:58 MDT

Paul,

> A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known.
After an internal discussion we decided to go into that direction for 20.11. I'll keep you posted on the progress.

>Also would FUTURE survive cluster restarts?
Yes - you can just set the nodes to future in slurm.conf

cheers,
Marcin

Comment 8 Paul Edmon 2020-05-21 07:21:29 MDT

Excellent good to hear.  I look forward to 20.11

-Paul Edmon-

On 5/21/2020 5:23 AM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c6> on bug 
> 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> > A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known.
> After an internal discussion we decided to go into that direction for 20.11.
> I'll keep you posted on the progress.
>
> >Also would FUTURE survive cluster restarts?
> Yes - you can just set the nodes to future in slurm.conf
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Marcin Stolarek 2020-10-23 01:02:10 MDT

Paul,

The log level for repeated error messages was demoted to debug[1] on our master/development branch. The change will be effective since Slurm 20.11 major release.

cheers,
Marcin

[1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master, origin/master, origin/HEAD, bug9035)
Author:     Marcin Stolarek <cinek@schedmd.com>
AuthorDate: Thu May 21 09:44:37 2020 +0000
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Thu Oct 22 14:27:56 2020 -0600

    Demote error messages in validate_node_specs to debug
    
    Instead of throwing an error on every node registration do that only
    when draining a node.

Comment 11 Paul Edmon 2020-10-23 07:41:12 MDT

Awesome.  Thank you.

-Paul Edmon-

On 10/23/2020 3:02 AM, bugs@schedmd.com wrote:
> Marcin Stolarek <mailto:cinek@schedmd.com> changed bug 9035 
> <https://bugs.schedmd.com/show_bug.cgi?id=9035>
> What 	Removed 	Added
> Version Fixed 		20.11pre1
> Status 	OPEN 	RESOLVED
> Resolution 	--- 	FIXED
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c10> on 
> bug 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> The log level for repeated error messages was demoted to debug[1] on our
> master/development branch. The change will be effective since Slurm 20.11 major
> release.
>
> cheers,
> Marcin
>
> [1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master,
> origin/master, origin/HEAD,bug9035  <show_bug.cgi?id=9035>)
> Author:     Marcin Stolarek <cinek@schedmd.com  <mailto:cinek@schedmd.com>>
> AuthorDate: Thu May 21 09:44:37 2020 +0000
> Commit:     Brian Christiansen <brian@schedmd.com  <mailto:brian@schedmd.com>>
> CommitDate: Thu Oct 22 14:27:56 2020 -0600
>
>      Demote error messages in validate_node_specs to debug
>
>      Instead of throwing an error on every node registration do that only
>      when draining a node.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>