Summary: | Dialing back node registration alerts | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | cinek, mdidomenico, tim |
Version: | 20.11.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9949 | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 20.11pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Paul Edmon
2020-05-12 07:52:01 MDT
Paul,
You can set the node state to "future" which will stop ping/registration RPC being executed for those nodes.
As a side effect, those nodes will also be hidden in outputs of commands like scontrol show job/sinfo which is generally desirable, for resources that won't be available soon.
>That's useful info as it tells me if I have a dead disk or dead DIMM.
I see your point, but checking if "DIMM is dead" vs it was removed on purpose and config wasn't updated appropriately may just be an interpretation of the fact that the node has less memory than configured?
cheers,
Marcin
Paul, Does the suggested solution work for you? cheers, Marcin Sorry, I was out on vacation. Just got back. This solution feels like a kludge to me. If we have a node that as a DIMM that fails and deactivates while open, it should go into DRAIN state and then stay there as jobs exit. A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. While we could go with FUTURE it just doesn't feel like the right solution to this issue. Also would FUTURE survive cluster restarts? Beyond that it would hide the node from the tools we normally use for monitoring the cluster, as we lean on scontrol show node and others to get info on node state and reasons why they are down. From an admin point of view its good to have the nodes there but down so we can see the reason obviously. It's also good for the users as they may have an impacted node in their partition and they will want to know why it is down. -Paul Edmon- On 5/18/2020 3:11 AM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c2> on bug > 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Does the suggested solution work for you? > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul, > A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. After an internal discussion we decided to go into that direction for 20.11. I'll keep you posted on the progress. >Also would FUTURE survive cluster restarts? Yes - you can just set the nodes to future in slurm.conf cheers, Marcin Excellent good to hear. I look forward to 20.11 -Paul Edmon- On 5/21/2020 5:23 AM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c6> on bug > 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > > A more natural solution to me would be to have the alert silenced if the node is in DRAIN or DOWN state as obviously a problem is known. > After an internal discussion we decided to go into that direction for 20.11. > I'll keep you posted on the progress. > > >Also would FUTURE survive cluster restarts? > Yes - you can just set the nodes to future in slurm.conf > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul, The log level for repeated error messages was demoted to debug[1] on our master/development branch. The change will be effective since Slurm 20.11 major release. cheers, Marcin [1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master, origin/master, origin/HEAD, bug9035) Author: Marcin Stolarek <cinek@schedmd.com> AuthorDate: Thu May 21 09:44:37 2020 +0000 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Thu Oct 22 14:27:56 2020 -0600 Demote error messages in validate_node_specs to debug Instead of throwing an error on every node registration do that only when draining a node. Awesome. Thank you. -Paul Edmon- On 10/23/2020 3:02 AM, bugs@schedmd.com wrote: > Marcin Stolarek <mailto:cinek@schedmd.com> changed bug 9035 > <https://bugs.schedmd.com/show_bug.cgi?id=9035> > What Removed Added > Version Fixed 20.11pre1 > Status OPEN RESOLVED > Resolution --- FIXED > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=9035#c10> on > bug 9035 <https://bugs.schedmd.com/show_bug.cgi?id=9035> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > The log level for repeated error messages was demoted to debug[1] on our > master/development branch. The change will be effective since Slurm 20.11 major > release. > > cheers, > Marcin > > [1]commit 0f4985219611625c769e4883c770e78f4e644dab (HEAD -> master, > origin/master, origin/HEAD,bug9035 <show_bug.cgi?id=9035>) > Author: Marcin Stolarek <cinek@schedmd.com <mailto:cinek@schedmd.com>> > AuthorDate: Thu May 21 09:44:37 2020 +0000 > Commit: Brian Christiansen <brian@schedmd.com <mailto:brian@schedmd.com>> > CommitDate: Thu Oct 22 14:27:56 2020 -0600 > > Demote error messages in validate_node_specs to debug > > Instead of throwing an error on every node registration do that only > when draining a node. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > |