3973 – slurmd error messages find_node_record and _find_alias_node_record when adding nodes to Slurm

Bug 3973 - slurmd error messages find_node_record and _find_alias_node_record when adding nodes to Slurm

Summary: slurmd error messages find_node_record and _find_alias_node_record when addin...

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other bugs)
Version:	17.02.6
Hardware:	Linux Linux

Importance:	--- 5 - Enhancement
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-10 01:16 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified:	2023-03-15 10:27 MDT (History)
CC List:	1 user (show)

See Also:	13805
Site:	DTU Physics
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ole.H.Nielsen@fysik.dtu.dk 2017-07-10 01:16:09 MDT

I've added a set of nodes to Slurm (nodes were migrated from our old Torque cluster).  According to https://slurm.schedmd.com/scontrol.html it is required to restart slurmctld:

reconfigure
    Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. ... The slurmctld daemon must be restarted if nodes are added to or removed from the cluster. 

I've updated slurm.conf with the new nodes and distributed the file to all nodes. However, after restarting slurmctld and then doing "scontrol reconfigure", the current compute nodes' slurmd.log have error messages related to the new nodes:

...
[2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c129
[2017-07-10T08:53:43.565] error: find_node_record: lookup failure for c130
[2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c130
[2017-07-10T08:53:43.566] error: WARNING: Invalid hostnames in switch configuration: c[001-130]

When I restart slurmd on the node, the error messages no longer appear in slurmd.log.

Question: What is the correct procedure when adding nodes to Slurm?

Comment 2 Dominik Bartkiewicz 2017-07-10 09:33:21 MDT

Hi

I can reproduce this,
and I work on fix.

Dominik

Comment 6 Dominik Bartkiewicz 2017-07-14 06:41:41 MDT

Hi
 
We are still working on this.
For now safe solution, after adding new nodes, is to restart slurmctld and slurmd.
I will update documentation, in the future in 17.11 version we will try to modify slurmd code that reconfigure will be sufficient.

Dominik

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2017-07-14 07:22:09 MDT

(In reply to Dominik Bartkiewicz from comment #6)
> We are still working on this.
> For now safe solution, after adding new nodes, is to restart slurmctld and
> slurmd.
> I will update documentation, in the future in 17.11 version we will try to
> modify slurmd code that reconfigure will be sufficient.

Thanks, I figure it's a tricky problem!  My slurmctld was hanging pretty badly after adding new nodes, apparently having troubles communicating with nodes.  So it became a critical issue for us, and I was forced to restart slurmd on all nodes. Now Slurm is stable again!

I strongly agree that the slurmctld documentation quoted above must be changed into "The slurmctld daemon as well as all slurmd daemons must be restarted"...

I'm really pleased that you have decided to try to alleviate this problem in 17.11.  The restarting of slurmd daemons should primarily be avoided if possible.

Thanks a lot,
Ole

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2017-07-24 15:37:39 MDT

I'm out of the office until August 14.
Jeg er ikke på kontoret, tilbage igen 14. august.

Best regards / Venlig hilsen,
Ole Holm Nielsen

Comment 12 Dominik Bartkiewicz 2017-08-09 10:39:27 MDT

Hi

In commit 71a8c6e76491 we added information about slurmd restart necessity after adding nodes.
I am changing severity to "Enhancment".

Dominik

Comment 13 Adam Huffman 2017-08-09 10:39:39 MDT

On the afternoon of Wednesday 9th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT

Comment 15 Adam Huffman 2017-08-15 07:32:17 MDT

On Tuesday 15th August, I shall be on annual leave. I will deal with any queries upon my return.

Best Wishes,
Adam

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT

Comment 23 Dominik Bartkiewicz 2020-02-04 05:59:06 MST

Hi

I'm closing this enhancement as TMEDOUT.
Live adding and removing nodes is still
an open issue but it is documented.

Dominik