5302 – Slurm not running on logging nodes

Bug 5302 - Slurm not running on logging nodes

Summary: Slurm not running on logging nodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other bugs)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-06-12 16:15 MDT by ellingjd.ctr
Modified:	2018-06-13 11:25 MDT (History)
CC List:	0 users

See Also:
Site:	AFRL
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ellingjd.ctr 2018-06-12 16:15:27 MDT

When I type 'sinfo' from one of the login nodes, I get the following:

[root@grault-l01 ~]# sinfo
sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurm_load_partitions: Zero Bytes were transmitted or received

/var/log/messages:

Jun 12 16:54:04 grault-l01 systemd: PID file /var/run/slurmd.pid not readable (yet?) after start.

/var/log/slurmd.log

[2018-06-12T18:14:40.662] error: Unable to register: Zero Bytes were transmitted or received

Everything works from the admin node, grault-adm01.

Jay

Comment 1 Tim Wickberg 2018-06-12 16:22:41 MDT

Some usual causes for these problems:

- Are the clocks in sync? A skew of more than 300 seconds (by default, this could be shorter depending on configuration) will lead to this type of issue.

- Is MUNGE running on the login node, and setup with the same key as the rest of the cluster.

- Is the version of the Slurm commands on the login node the same or older than that running on the slurmctld?

Note that the slurmd process should not usually be running on the login nodes - it only needs to be present on compute nodes.

- Tim

Comment 2 Tim Wickberg 2018-06-12 16:26:16 MDT

*** Bug 5300 has been marked as a duplicate of this bug. ***

Comment 3 ellingjd.ctr 2018-06-12 16:32:26 MDT

We did have some clock skew issues earlier which were resolved.  Also 
you mentioned that the slurmd process didn't need to be running on the 
loggin nodes.  So should I 'systemctl stop slurmd' ?


Jay

On 2018-06-12 18:22, bugs@schedmd.com wrote:
> Tim Wickberg changed bug 5302 [1]
> 
>  		WHAT
>  		REMOVED
>  		ADDED
> 
>  		Assignee
>  		support@schedmd.com
>  		tim@schedmd.com
> 
>  COMMENT # 1 [2] ON BUG 5302 [1] FROM TIM WICKBERG
> 
> Some usual causes for these problems:
> 
> - Are the clocks in sync? A skew of more than 300 seconds (by default,
> this
> could be shorter depending on configuration) will lead to this type of
> issue.
> 
> - Is MUNGE running on the login node, and setup with the same key as
> the rest
> of the cluster.
> 
> - Is the version of the Slurm commands on the login node the same or
> older than
> that running on the slurmctld?
> 
> Note that the slurmd process should not usually be running on the
> login nodes -
> it only needs to be present on compute nodes.
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] https://bugs.schedmd.com/show_bug.cgi?id=5302
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c1

Comment 4 Tim Wickberg 2018-06-12 16:35:38 MDT

(In reply to ellingjd.ctr from comment #3)
> We did have some clock skew issues earlier which were resolved.  Also 
> you mentioned that the slurmd process didn't need to be running on the 
> loggin nodes.  So should I 'systemctl stop slurmd' ?

It shouldn't be running - if there's no matching Node definition in slurm.conf it would refuse to start up. But yes, you can certainly stop them, and you may want to 'systemctl disable slurmd' on those machines as well.

Are you still seeing issues with just running 'sinfo' and other commands?

Comment 5 ellingjd.ctr 2018-06-12 16:37:08 MDT

Yes, the admin node and login nodes were out of sync.  Once I can get 
them sync what should I do on the login nodes to get sinfo working 
again?

Jay

On 2018-06-12 18:35, bugs@schedmd.com wrote:
> COMMENT # 4 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #3 [1])
>> We did have some clock skew issues earlier which were resolved.
> Also
>> you mentioned that the slurmd process didn't need to be running on
> the
>> loggin nodes.  So should I 'systemctl stop slurmd' ?
> 
> It shouldn't be running - if there's no matching Node definition in
> slurm.conf
> it would refuse to start up. But yes, you can certainly stop them, and
> you may
> want to 'systemctl disable slurmd' on those machines as well.
> 
> Are you still seeing issues with just running 'sinfo' and other
> commands?
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c3
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c4
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302

Comment 6 Tim Wickberg 2018-06-12 16:38:10 MDT

(In reply to ellingjd.ctr from comment #5)
> Yes, the admin node and login nodes were out of sync.  Once I can get 
> them sync what should I do on the login nodes to get sinfo working 
> again?
> 
> Jay

Once the clocks are all back in sync, you shouldn't need to do anything further.

Comment 7 ellingjd.ctr 2018-06-12 16:42:10 MDT

That worked.  Thank you.

Jay

On 2018-06-12 18:38, bugs@schedmd.com wrote:
> COMMENT # 6 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #5 [1])
>> Yes, the admin node and login nodes were out of sync.  Once I can
> get
>> them sync what should I do on the login nodes to get sinfo working
>> again?
>> 
>> Jay
> 
> Once the clocks are all back in sync, you shouldn't need to do
> anything
> further.
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c5
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c6
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302

Comment 8 Tim Wickberg 2018-06-12 16:45:51 MDT

(In reply to ellingjd.ctr from comment #7)
> That worked.  Thank you.
> 

Certainly.

Is there anything else I can address, or can I close this out?

- Tim

Comment 9 ellingjd.ctr 2018-06-12 16:48:41 MDT

Yes, you can close this ticket out.

Thanks again,
Jay

On 2018-06-12 18:45, bugs@schedmd.com wrote:
> COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #7 [1])
>> That worked.  Thank you.
>> 
> 
> Certainly.
> 
> Is there anything else I can address, or can I close this out?
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302

Comment 10 ellingjd.ctr 2018-06-12 16:56:07 MDT

Tim,
Its showing all compute nodes down on the admin node.  I'm not able to 
restart slurmd service.
Is this because everything was out of sync.

Jay

On 2018-06-12 18:45, bugs@schedmd.com wrote:
> COMMENT # 8 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #7 [1])
>> That worked.  Thank you.
>> 
> 
> Certainly.
> 
> Is there anything else I can address, or can I close this out?
> 
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c7
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c8
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302

Comment 11 Tim Wickberg 2018-06-12 17:04:59 MDT

(In reply to ellingjd.ctr from comment #10)
> Tim,
> Its showing all compute nodes down on the admin node.  I'm not able to 
> restart slurmd service.
> Is this because everything was out of sync.

Potentially, yes. If the clocks were far enough off to cause communication issues like what you saw on the login nodes, this would likely have impacted some or all of the compute nodes as well.

'scontrol update nodename=(nodes) state=resume' should be enough to bring them back up.

Comment 12 ellingjd.ctr 2018-06-12 17:11:09 MDT

It gives me a syntax error.  Do I need to substitute each compute node 
for (nodes)?

Jay

On 2018-06-12 19:04, bugs@schedmd.com wrote:
> COMMENT # 11 [2] ON BUG 5302 [3] FROM TIM WICKBERG
> 
> (In reply to ellingjd.ctr from comment #10 [1])
>> Tim,
>> Its showing all compute nodes down on the admin node.  I'm not able
> to
>> restart slurmd service.
>> Is this because everything was out of sync.
> 
> Potentially, yes. If the clocks were far enough off to cause
> communication
> issues like what you saw on the login nodes, this would likely have
> impacted
> some or all of the compute nodes as well.
> 
> 'scontrol update nodename=(nodes) state=resume' should be enough to
> bring them
> back up.
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] http://webmail.afrl.hpc.mil/show_bug.cgi?id=5302#c10
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c11
> [3] https://bugs.schedmd.com/show_bug.cgi?id=5302

Comment 13 Tim Wickberg 2018-06-12 17:13:21 MDT

(In reply to ellingjd.ctr from comment #12)
> It gives me a syntax error.  Do I need to substitute each compute node 
> for (nodes)?

Yes, or the range.

E.g., you can update one at a time:


scontrol update nodename=node001 state=resume

Or a range:

scontrol update nodename=node[001-100] state=resume

Comment 14 Tim Wickberg 2018-06-13 11:00:54 MDT

Hey Jay -

I'm assuming this has been sorted out at this point, and tagging this as resolved/infogiven. Please reopen if there's anything further I can assist with.

cheers,,
- Tim

Comment 15 ellingjd.ctr 2018-06-13 11:25:54 MDT

Yes, everything worked.  You can close the ticket.

Thanks,
Jay

On 2018-06-13 13:00, bugs@schedmd.com wrote:
> Tim Wickberg changed bug 5302 [1]
> 
>  		WHAT
>  		REMOVED
>  		ADDED
> 
>  		Status
>  		UNCONFIRMED
>  		RESOLVED
> 
>  		Resolution
>  		---
>  		INFOGIVEN
> 
>  COMMENT # 14 [2] ON BUG 5302 [1] FROM TIM WICKBERG
> 
> Hey Jay -
> 
> I'm assuming this has been sorted out at this point, and tagging this
> as
> resolved/infogiven. Please reopen if there's anything further I can
> assist
> with.
> 
> cheers,,
> - Tim
> 
> -------------------------
>  You are receiving this mail because:
> 
>  	* You reported the bug.
> 
> 
> 
> Links:
> ------
> [1] https://bugs.schedmd.com/show_bug.cgi?id=5302
> [2] https://bugs.schedmd.com/show_bug.cgi?id=5302#c14