Summary: | slurmd error messages find_node_record and _find_alias_node_record when adding nodes to Slurm | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmd | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | adam.huffman |
Version: | 17.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=13805 | ||
Site: | DTU Physics | Alineos Sites: | --- |
Bull/Atos Sites: | --- | Confidential Site: | --- |
Cray Sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2017-07-10 01:16:09 MDT
Hi I can reproduce this, and I work on fix. Dominik Hi We are still working on this. For now safe solution, after adding new nodes, is to restart slurmctld and slurmd. I will update documentation, in the future in 17.11 version we will try to modify slurmd code that reconfigure will be sufficient. Dominik (In reply to Dominik Bartkiewicz from comment #6) > We are still working on this. > For now safe solution, after adding new nodes, is to restart slurmctld and > slurmd. > I will update documentation, in the future in 17.11 version we will try to > modify slurmd code that reconfigure will be sufficient. Thanks, I figure it's a tricky problem! My slurmctld was hanging pretty badly after adding new nodes, apparently having troubles communicating with nodes. So it became a critical issue for us, and I was forced to restart slurmd on all nodes. Now Slurm is stable again! I strongly agree that the slurmctld documentation quoted above must be changed into "The slurmctld daemon as well as all slurmd daemons must be restarted"... I'm really pleased that you have decided to try to alleviate this problem in 17.11. The restarting of slurmd daemons should primarily be avoided if possible. Thanks a lot, Ole I'm out of the office until August 14. Jeg er ikke på kontoret, tilbage igen 14. august. Best regards / Venlig hilsen, Ole Holm Nielsen Hi In commit 71a8c6e76491 we added information about slurmd restart necessity after adding nodes. I am changing severity to "Enhancment". Dominik On the afternoon of Wednesday 9th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT On Tuesday 15th August, I shall be on annual leave. I will deal with any queries upon my return. Best Wishes, Adam The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT Hi I'm closing this enhancement as TMEDOUT. Live adding and removing nodes is still an open issue but it is documented. Dominik |