Bug 5302 - Slurm not running on logging nodes
Summary: Slurm not running on logging nodes
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-12 16:15 MDT by ellingjd.ctr
Modified: 2018-06-13 11:25 MDT (History)
0 users

See Also:
Site: AFRL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ellingjd.ctr 2018-06-12 16:15:27 MDT
When I type 'sinfo' from one of the login nodes, I get the following:

[root@grault-l01 ~]# sinfo
sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurm_load_partitions: Zero Bytes were transmitted or received

/var/log/messages:

Jun 12 16:54:04 grault-l01 systemd: PID file /var/run/slurmd.pid not readable (yet?) after start.

/var/log/slurmd.log

[2018-06-12T18:14:40.662] error: Unable to register: Zero Bytes were transmitted or received

Everything works from the admin node, grault-adm01.

Jay
Comment 1 Tim Wickberg 2018-06-12 16:22:41 MDT
Some usual causes for these problems:

- Are the clocks in sync? A skew of more than 300 seconds (by default, this could be shorter depending on configuration) will lead to this type of issue.

- Is MUNGE running on the login node, and setup with the same key as the rest of the cluster.

- Is the version of the Slurm commands on the login node the same or older than that running on the slurmctld?

Note that the slurmd process should not usually be running on the login nodes - it only needs to be present on compute nodes.

- Tim
Comment 2 Tim Wickberg 2018-06-12 16:26:16 MDT
*** Bug 5300 has been marked as a duplicate of this bug. ***
Comment 3 ellingjd.ctr 2018-06-12 16:32:26 MDT
We did have some clock skew issues earlier which were resolved.  Also 
you mentioned that the slurmd process didn't need to be running on the 
loggin nodes.  So should I 'systemctl stop slurmd' ?


Jay

On 2018-06-12 18:22, bugs@schedmd.com wrote:
> Tim Wickberg changed bug 5302 [1]
> 
>  		WHAT
>  		REMOVED
>  		ADDED
> 
>  		Assignee
>  		support@schedmd.com
>  		tim@schedmd.com
> 
>  COMMENT # 1 [2] ON BUG 5302 [1] FROM TIM WICKBERG
> 
> Some usual causes for these problems:
> 
> - Are the clocks in sync? A skew of more than 300 seconds (by default,
> this
> could be shorter depending on configuration) will lead to this type of
> issue.
> 
> - Is MUNGE running on the login node, and setup with the same key as
> the rest
> of the cluster.
> 
> - Is the version of the Slurm commands on the login node the same or
> older than
> that running on the slurmctld?
> 
> Note that the slurmd process should not usually be running on the
> login nodes -
> it only needs to be present on compute nodes.
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] https://bugs.schedmd.com/show_bug.cgi?id=5302
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c1
Comment 4 Tim Wickberg 2018-06-12 16:35:38 MDT
(In reply to ellingjd.ctr from comment #3)
> We did have some clock skew issues earlier which were resolved.  Also 
> you mentioned that the slurmd process didn't need to be running on the 
> loggin nodes.  So should I 'systemctl stop slurmd' ?

It shouldn't be running - if there's no matching Node definition in slurm.conf it would refuse to start up. But yes, you can certainly stop them, and you may want to 'systemctl disable slurmd' on those machines as well.

Are you still seeing issues with just running 'sinfo' and other commands?
Comment 5 ellingjd.ctr 2018-06-12 16:37:08 MDT
Yes, the admin node and login nodes were out of sync.  Once I can get 
them sync what should I do on the login nodes to get sinfo working 
again?

Jay

On 2018-06-12 18:35, bugs@schedmd.com wrote:
> COMMENT # 4 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #3 [1])
>> We did have some clock skew issues earlier which were resolved.
> Also
>> you mentioned that the slurmd process didn't need to be running on
> the
>> loggin nodes.  So should I 'systemctl stop slurmd' ?
> 
> It shouldn't be running - if there's no matching Node definition in
> slurm.conf
> it would refuse to start up. But yes, you can certainly stop them, and
> you may
> want to 'systemctl disable slurmd' on those machines as well.
> 
> Are you still seeing issues with just running 'sinfo' and other
> commands?
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c3
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c4
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Comment 6 Tim Wickberg 2018-06-12 16:38:10 MDT
(In reply to ellingjd.ctr from comment #5)
> Yes, the admin node and login nodes were out of sync.  Once I can get 
> them sync what should I do on the login nodes to get sinfo working 
> again?
> 
> Jay

Once the clocks are all back in sync, you shouldn't need to do anything further.
Comment 7 ellingjd.ctr 2018-06-12 16:42:10 MDT
That worked.  Thank you.

Jay

On 2018-06-12 18:38, bugs@schedmd.com wrote:
> COMMENT # 6 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #5 [1])
>> Yes, the admin node and login nodes were out of sync.  Once I can
> get
>> them sync what should I do on the login nodes to get sinfo working
>> again?
>> 
>> Jay
> 
> Once the clocks are all back in sync, you shouldn't need to do
> anything
> further.
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c5
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c6
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Comment 8 Tim Wickberg 2018-06-12 16:45:51 MDT
(In reply to ellingjd.ctr from comment #7)
> That worked.  Thank you.
> 

Certainly.

Is there anything else I can address, or can I close this out?

- Tim
Comment 9 ellingjd.ctr 2018-06-12 16:48:41 MDT
Yes, you can close this ticket out.

Thanks again,
Jay

On 2018-06-12 18:45, bugs@schedmd.com wrote:
> COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #7 [1])
>> That worked.  Thank you.
>> 
> 
> Certainly.
> 
> Is there anything else I can address, or can I close this out?
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Comment 10 ellingjd.ctr 2018-06-12 16:56:07 MDT
Tim,
Its showing all compute nodes down on the admin node.  I'm not able to 
restart slurmd service.
Is this because everything was out of sync.

Jay

On 2018-06-12 18:45, bugs@schedmd.com wrote:
> COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #7 [1])
>> That worked.  Thank you.
>> 
> 
> Certainly.
> 
> Is there anything else I can address, or can I close this out?
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Comment 11 Tim Wickberg 2018-06-12 17:04:59 MDT
(In reply to ellingjd.ctr from comment #10)
> Tim,
> Its showing all compute nodes down on the admin node.  I'm not able to 
> restart slurmd service.
> Is this because everything was out of sync.

Potentially, yes. If the clocks were far enough off to cause communication issues like what you saw on the login nodes, this would likely have impacted some or all of the compute nodes as well.

'scontrol update nodename=(nodes) state=resume' should be enough to bring them back up.
Comment 12 ellingjd.ctr 2018-06-12 17:11:09 MDT
It gives me a syntax error.  Do I need to substitute each compute node 
for (nodes)?

Jay

On 2018-06-12 19:04, bugs@schedmd.com wrote:
> COMMENT # 11 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #10 [1])
>> Tim,
>> Its showing all compute nodes down on the admin node.  I'm not able
> to
>> restart slurmd service.
>> Is this because everything was out of sync.
> 
> Potentially, yes. If the clocks were far enough off to cause
> communication
> issues like what you saw on the login nodes, this would likely have
> impacted
> some or all of the compute nodes as well.
> 
> 'scontrol update nodename=(nodes) state=resume' should be enough to
> bring them
> back up.
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c10
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c11
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302
Comment 13 Tim Wickberg 2018-06-12 17:13:21 MDT
(In reply to ellingjd.ctr from comment #12)
> It gives me a syntax error.  Do I need to substitute each compute node 
> for (nodes)?

Yes, or the range.

E.g., you can update one at a time:


scontrol update nodename=node001 state=resume

Or a range:

scontrol update nodename=node[001-100] state=resume
Comment 14 Tim Wickberg 2018-06-13 11:00:54 MDT
Hey Jay -

I'm assuming this has been sorted out at this point, and tagging this as resolved/infogiven. Please reopen if there's anything further I can assist with.

cheers,,
- Tim
Comment 15 ellingjd.ctr 2018-06-13 11:25:54 MDT
Yes, everything worked.  You can close the ticket.

Thanks,
Jay

On 2018-06-13 13:00, bugs@schedmd.com wrote:
> Tim Wickberg changed bug 5302 [1]
> 
>  		WHAT
>  		REMOVED
>  		ADDED
> 
>  		Status
>  		UNCONFIRMED
>  		RESOLVED
> 
>  		Resolution
>  		---
>  		INFOGIVEN
> 
>  COMMENT # 14 [2] ON BUG 5302 [1] FROM TIM WICKBERG
> 
> Hey Jay -
> 
> I'm assuming this has been sorted out at this point, and tagging this
> as
> resolved/infogiven. Please reopen if there's anything further I can
> assist
> with.
> 
> cheers,,
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] https://bugs.schedmd.com/show_bug.cgi?id=5302
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c14