Yesterday I turned off config_overrides in order to allow the scheduler to compare node configs in the conf versus actual config. I was doing this because with the MemSpecLimit feature we don't have to lie about node memory size in the conf any more. When I put this in place the scheduler complained and flagged bad nodes on restart and then went on its merry way. This morning I did a global restart and all the nodes that had mismatches began spewing errors all over the logs. For example: May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c20305: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy2a09303 has low tmp_disk size (51175 < 233849) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy2a09303: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c16204 has low real_memory size (176765 < 192892) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c16204: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c26601 has low real_memory size (192886 < 192892) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c26601: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c18607 has low real_memory size (192887 < 192892) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c18607: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holystat09 has low real_memory size (64182 < 64214) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holystat09: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c20603 has low real_memory size (192884 < 192892) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c20603: Invalid argument May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: Node holy7c22102 has low real_memory size (176765 < 192892) May 12 09:46:36 holy-slurm02 slurmctld[10902]: error: _slurm_rpc_node_registration node=holy7c22102: Invalid argument These are all known bad nodes that we need to fix. We have them flagged as drain or down in the scheduler. Since we won't be able to fix these nodes for a while and since I don't want the logs just filled with error messages we can do nothing about I'm going to turn on config_overrides again. However what I would like to do is have config_overrides off and have these alerts silenced if the node is set to DRAIN or DOWN in the scheduler. I don't need the logs constantly complaining as it makes it very hard to track what is going on when there are real problems. A single node being compromised shouldn't spew all over the logs like this. So in summary could the alert frequency for this be dialed back or even better silenced if a node is set to DRAIN or DOWN. That way we still know the node is bad, and we don't have the logs complaining. In addition could slurm set the node to drain and down with a more descriptive reason, as current it just says Low TmpDisk but not the other info the alert has of the amount its low by. That's useful info as it tells me if I have a dead disk or dead DIMM. Until the alerting for this is dialed back I'm going to have to run with config_overrides on as the log is basically unreadable at this point and we will always have busted nodes in our conf. So I can't really clean it of bad nodes as we would have nodes going in and out all the time unnecessarily.
Paul, You can set the node state to "future" which will stop ping/registration RPC being executed for those nodes. As a side effect, those nodes will also be hidden in outputs of commands like scontrol show job/sinfo which is generally desirable, for resources that won't be available soon. >That's useful info as it tells me if I have a dead disk or dead DIMM. I see your point, but checking if "DIMM is dead" vs it was removed on purpose and config wasn't updated appropriately may just be an interpretation of the fact that the node has less memory than configured? cheers, Marcin
Paul, Does the suggested solution work for you? cheers, Marcin
Sorry, I was out on vacation. Just got back. This solution feels like a kludge to me. If we have a node that as a DIMM that fails and deactivates while open, it should go into DRAIN state and then stay there as jobs exit. A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. While we could go with FUTURE it just doesn't feel like the right solution to this issue. Also would FUTURE survive cluster restarts? Beyond that it would hide the node from the tools we normally use for monitoring the cluster, as we lean on scontrol show node and others to get info on node state and reasons why they are down. From an admin point of view its good to have the nodes there but down so we can see the reason obviously. It's also good for the users as they may have an impacted node in their partition and they will want to know why it is down. -Paul Edmon- On 5/18/2020 3:11 AM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c2> on bug > 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Does the suggested solution work for you? > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, > A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. After an internal discussion we decided to go into that direction for 20.11. I'll keep you posted on the progress. >Also would FUTURE survive cluster restarts? Yes - you can just set the nodes to future in slurm.conf cheers, Marcin
Excellent good to hear. I look forward to 20.11 -Paul Edmon- On 5/21/2020 5:23 AM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c6> on bug > 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > > A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. > After an internal discussion we decided to go into that direction for 20.11. > I'll keep you posted on the progress. > > >Also would FUTURE survive cluster restarts? > Yes - you can just set the nodes to future in slurm.conf > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, The log level for repeated error messages was demoted to debug[1] on our master/development branch. The change will be effective since Slurm 20.11 major release. cheers, Marcin [1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master, origin/master, origin/HEAD, bug9035) Author: Marcin Stolarek <cinek@schedmd.com> AuthorDate: Thu May 21 09:44:37 2020 +0000 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Thu Oct 22 14:27:56 2020 -0600 Demote error messages in validate_node_specs to debug Instead of throwing an error on every node registration do that only when draining a node.
Awesome. Thank you. -Paul Edmon- On 10/23/2020 3:02 AM, bugs@schedmd.com wrote: > Marcin Stolarek <mailto:cinek@schedmd.com> changed bug 9035 > <https://bugs.schedmd.com/show_bug.cgi?id=9035> > What Removed Added > Version Fixed 20.11pre1 > Status OPEN RESOLVED > Resolution --- FIXED > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c10> on > bug 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > The log level for repeated error messages was demoted to debug[1] on our > master/development branch. The change will be effective since Slurm 20.11 major > release. > > cheers, > Marcin > > [1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master, > origin/master, origin/HEAD,bug9035 <show_bug.cgi?id=9035>) > Author: Marcin Stolarek <cinek@schedmd.com <mailto:cinek@schedmd.com>> > AuthorDate: Thu May 21 09:44:37 2020 +0000 > Commit: Brian Christiansen <brian@schedmd.com <mailto:brian@schedmd.com>> > CommitDate: Thu Oct 22 14:27:56 2020 -0600 > > Demote error messages in validate_node_specs to debug > > Instead of throwing an error on every node registration do that only > when draining a node. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >