I've added a set of nodes to Slurm (nodes were migrated from our old Torque cluster). According to https://slurm.schedmd.com/scontrol.html it is required to restart slurmctld: reconfigure Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. ... The slurmctld daemon must be restarted if nodes are added to or removed from the cluster. I've updated slurm.conf with the new nodes and distributed the file to all nodes. However, after restarting slurmctld and then doing "scontrol reconfigure", the current compute nodes' slurmd.log have error messages related to the new nodes: ... [2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c129 [2017-07-10T08:53:43.565] error: find_node_record: lookup failure for c130 [2017-07-10T08:53:43.565] error: _find_alias_node_record: lookup failure for c130 [2017-07-10T08:53:43.566] error: WARNING: Invalid hostnames in switch configuration: c[001-130] When I restart slurmd on the node, the error messages no longer appear in slurmd.log. Question: What is the correct procedure when adding nodes to Slurm?
Hi I can reproduce this, and I work on fix. Dominik
Hi We are still working on this. For now safe solution, after adding new nodes, is to restart slurmctld and slurmd. I will update documentation, in the future in 17.11 version we will try to modify slurmd code that reconfigure will be sufficient. Dominik
(In reply to Dominik Bartkiewicz from comment #6) > We are still working on this. > For now safe solution, after adding new nodes, is to restart slurmctld and > slurmd. > I will update documentation, in the future in 17.11 version we will try to > modify slurmd code that reconfigure will be sufficient. Thanks, I figure it's a tricky problem! My slurmctld was hanging pretty badly after adding new nodes, apparently having troubles communicating with nodes. So it became a critical issue for us, and I was forced to restart slurmd on all nodes. Now Slurm is stable again! I strongly agree that the slurmctld documentation quoted above must be changed into "The slurmctld daemon as well as all slurmd daemons must be restarted"... I'm really pleased that you have decided to try to alleviate this problem in 17.11. The restarting of slurmd daemons should primarily be avoided if possible. Thanks a lot, Ole
I'm out of the office until August 14. Jeg er ikke pƄ kontoret, tilbage igen 14. august. Best regards / Venlig hilsen, Ole Holm Nielsen
Hi In commit 71a8c6e76491 we added information about slurmd restart necessity after adding nodes. I am changing severity to "Enhancment". Dominik
On the afternoon of Wednesday 9th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
On Tuesday 15th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
Hi I'm closing this enhancement as TMEDOUT. Live adding and removing nodes is still an open issue but it is documented. Dominik