When I type 'sinfo' from one of the login nodes, I get the following: [root@grault-l01 ~]# sinfo sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received slurm_load_partitions: Zero Bytes were transmitted or received /var/log/messages: Jun 12 16:54:04 grault-l01 systemd: PID file /var/run/slurmd.pid not readable (yet?) after start. /var/log/slurmd.log [2018-06-12T18:14:40.662] error: Unable to register: Zero Bytes were transmitted or received Everything works from the admin node, grault-adm01. Jay
Some usual causes for these problems: - Are the clocks in sync? A skew of more than 300 seconds (by default, this could be shorter depending on configuration) will lead to this type of issue. - Is MUNGE running on the login node, and setup with the same key as the rest of the cluster. - Is the version of the Slurm commands on the login node the same or older than that running on the slurmctld? Note that the slurmd process should not usually be running on the login nodes - it only needs to be present on compute nodes. - Tim
*** Bug 5300 has been marked as a duplicate of this bug. ***
We did have some clock skew issues earlier which were resolved. Also you mentioned that the slurmd process didn't need to be running on the loggin nodes. So should I 'systemctl stop slurmd' ? Jay On 2018-06-12 18:22, bugs@schedmd.com wrote: > Tim Wickberg changed bug 5302 [1] > > WHAT > REMOVED > ADDED > > Assignee > support@schedmd.com > tim@schedmd.com > > COMMENT # 1 [2] ON BUG 5302 [1] FROM TIM WICKBERG > > Some usual causes for these problems: > > - Are the clocks in sync? A skew of more than 300 seconds (by default, > this > could be shorter depending on configuration) will lead to this type of > issue. > > - Is MUNGE running on the login node, and setup with the same key as > the rest > of the cluster. > > - Is the version of the Slurm commands on the login node the same or > older than > that running on the slurmctld? > > Note that the slurmd process should not usually be running on the > login nodes - > it only needs to be present on compute nodes. > > - Tim > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5302 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c1
(In reply to ellingjd.ctr from comment #3) > We did have some clock skew issues earlier which were resolved. Also > you mentioned that the slurmd process didn't need to be running on the > loggin nodes. So should I 'systemctl stop slurmd' ? It shouldn't be running - if there's no matching Node definition in slurm.conf it would refuse to start up. But yes, you can certainly stop them, and you may want to 'systemctl disable slurmd' on those machines as well. Are you still seeing issues with just running 'sinfo' and other commands?
Yes, the admin node and login nodes were out of sync. Once I can get them sync what should I do on the login nodes to get sinfo working again? Jay On 2018-06-12 18:35, bugs@schedmd.com wrote: > COMMENT # 4 [2] ON BUG 5302 [3] FROM TIM WICKBERG > > (In reply to ellingjd.ctr from comment #3 [1]) >> We did have some clock skew issues earlier which were resolved. > Also >> you mentioned that the slurmd process didn't need to be running on > the >> loggin nodes. So should I 'systemctl stop slurmd' ? > > It shouldn't be running - if there's no matching Node definition in > slurm.conf > it would refuse to start up. But yes, you can certainly stop them, and > you may > want to 'systemctl disable slurmd' on those machines as well. > > Are you still seeing issues with just running 'sinfo' and other > commands? > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c3 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c4 > [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
(In reply to ellingjd.ctr from comment #5) > Yes, the admin node and login nodes were out of sync. Once I can get > them sync what should I do on the login nodes to get sinfo working > again? > > Jay Once the clocks are all back in sync, you shouldn't need to do anything further.
That worked. Thank you. Jay On 2018-06-12 18:38, bugs@schedmd.com wrote: > COMMENT # 6 [2] ON BUG 5302 [3] FROM TIM WICKBERG > > (In reply to ellingjd.ctr from comment #5 [1]) >> Yes, the admin node and login nodes were out of sync. Once I can > get >> them sync what should I do on the login nodes to get sinfo working >> again? >> >> Jay > > Once the clocks are all back in sync, you shouldn't need to do > anything > further. > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c5 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c6 > [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
(In reply to ellingjd.ctr from comment #7) > That worked. Thank you. > Certainly. Is there anything else I can address, or can I close this out? - Tim
Yes, you can close this ticket out. Thanks again, Jay On 2018-06-12 18:45, bugs@schedmd.com wrote: > COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG > > (In reply to ellingjd.ctr from comment #7 [1]) >> That worked. Thank you. >> > > Certainly. > > Is there anything else I can address, or can I close this out? > > - Tim > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8 > [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Tim, Its showing all compute nodes down on the admin node. I'm not able to restart slurmd service. Is this because everything was out of sync. Jay On 2018-06-12 18:45, bugs@schedmd.com wrote: > COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG > > (In reply to ellingjd.ctr from comment #7 [1]) >> That worked. Thank you. >> > > Certainly. > > Is there anything else I can address, or can I close this out? > > - Tim > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8 > [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
(In reply to ellingjd.ctr from comment #10) > Tim, > Its showing all compute nodes down on the admin node. I'm not able to > restart slurmd service. > Is this because everything was out of sync. Potentially, yes. If the clocks were far enough off to cause communication issues like what you saw on the login nodes, this would likely have impacted some or all of the compute nodes as well. 'scontrol update nodename=(nodes) state=resume' should be enough to bring them back up.
It gives me a syntax error. Do I need to substitute each compute node for (nodes)? Jay On 2018-06-12 19:04, bugs@schedmd.com wrote: > COMMENT # 11 [2] ON BUG 5302 [3] FROM TIM WICKBERG > > (In reply to ellingjd.ctr from comment #10 [1]) >> Tim, >> Its showing all compute nodes down on the admin node. I'm not able > to >> restart slurmd service. >> Is this because everything was out of sync. > > Potentially, yes. If the clocks were far enough off to cause > communication > issues like what you saw on the login nodes, this would likely have > impacted > some or all of the compute nodes as well. > > 'scontrol update nodename=(nodes) state=resume' should be enough to > bring them > back up. > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c10 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c11 > [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
(In reply to ellingjd.ctr from comment #12) > It gives me a syntax error. Do I need to substitute each compute node > for (nodes)? Yes, or the range. E.g., you can update one at a time: scontrol update nodename=node001 state=resume Or a range: scontrol update nodename=node[001-100] state=resume
Hey Jay - I'm assuming this has been sorted out at this point, and tagging this as resolved/infogiven. Please reopen if there's anything further I can assist with. cheers,, - Tim
Yes, everything worked. You can close the ticket. Thanks, Jay On 2018-06-13 13:00, bugs@schedmd.com wrote: > Tim Wickberg changed bug 5302 [1] > > WHAT > REMOVED > ADDED > > Status > UNCONFIRMED > RESOLVED > > Resolution > --- > INFOGIVEN > > COMMENT # 14 [2] ON BUG 5302 [1] FROM TIM WICKBERG > > Hey Jay - > > I'm assuming this has been sorted out at this point, and tagging this > as > resolved/infogiven. Please reopen if there's anything further I can > assist > with. > > cheers,, > - Tim > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5302 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c14