Bug 3973

Summary: slurmd error messages find_node_record and _find_alias_node_record when adding nodes to Slurm
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmdAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: adam.huffman
Version: 17.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=13805
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2017-07-10 01:16:09 MDT
I've added a set of nodes to Slurm (nodes were migrated from our old Torque cluster).  According to https://slurm.schedmd.com/scontrol.html it is required to restart slurmctld:

reconfigure
    Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. ... The slurmctld daemon must be restarted if nodes are added to or removed from the cluster. 

I've updated slurm.conf with the new nodes and distributed the file to all nodes. However, after restarting slurmctld and then doing "scontrol reconfigure", the current compute nodes' slurmd.log have error messages related to the new nodes:

...
[2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c129
[2017-07-10T08:53:43.565] error: find_node_record: lookup failure for c130
[2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c130
[2017-07-10T08:53:43.566] error: WARNING: Invalid hostnames in switch configuration: c[001-130]

When I restart slurmd on the node, the error messages no longer appear in slurmd.log.

Question: What is the correct procedure when adding nodes to Slurm?
Comment 2 Dominik Bartkiewicz 2017-07-10 09:33:21 MDT
Hi

I can reproduce this,
and I work on fix.

Dominik
Comment 6 Dominik Bartkiewicz 2017-07-14 06:41:41 MDT
Hi
 
We are still working on this.
For now safe solution, after adding new nodes, is to restart slurmctld and slurmd.
I will update documentation, in the future in 17.11 version we will try to modify slurmd code that reconfigure will be sufficient.

Dominik
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2017-07-14 07:22:09 MDT
(In reply to Dominik Bartkiewicz from comment #6)
> We are still working on this.
> For now safe solution, after adding new nodes, is to restart slurmctld and
> slurmd.
> I will update documentation, in the future in 17.11 version we will try to
> modify slurmd code that reconfigure will be sufficient.

Thanks, I figure it's a tricky problem!  My slurmctld was hanging pretty badly after adding new nodes, apparently having troubles communicating with nodes.  So it became a critical issue for us, and I was forced to restart slurmd on all nodes. Now Slurm is stable again!

I strongly agree that the slurmctld documentation quoted above must be changed into "The slurmctld daemon as well as all slurmd daemons must be restarted"...

I'm really pleased that you have decided to try to alleviate this problem in 17.11.  The restarting of slurmd daemons should primarily be avoided if possible.

Thanks a lot,
Ole
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2017-07-24 15:37:39 MDT
I'm out of the office until August 14.
Jeg er ikke på kontoret, tilbage igen 14. august.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 12 Dominik Bartkiewicz 2017-08-09 10:39:27 MDT
Hi

In commit 71a8c6e76491 we added information about slurmd restart necessity after adding nodes.
I am changing severity to "Enhancment".

Dominik
Comment 13 Adam Huffman 2017-08-09 10:39:39 MDT
On the afternoon of Wednesday 9th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
Comment 15 Adam Huffman 2017-08-15 07:32:17 MDT
On Tuesday 15th August, I shall be on annual leave. I will deal with any queries upon my return.

Best Wishes,
Adam

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
Comment 23 Dominik Bartkiewicz 2020-02-04 05:59:06 MST
Hi

I'm closing this enhancement as TMEDOUT.
Live adding and removing nodes is still
an open issue but it is documented.

Dominik