Ticket 18505

Summary:	scontrol reboot action fails when nodes are in State=IDLE+POWERED_DOWN
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	User Commands	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll
Version:	23.02.7
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	24.08.0rc1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2023-12-19 05:41:06 MST

We use Slurm power saving with great success, so nodes may be powered down and have a State=IDLE+POWERED_DOWN.

We also use "scontrol reboot ASAP" when we need to apply OS or firmware updates to sets of nodes.

My question is, what should be the appropriate action taken by "scontrol reboot ASAP" when a node is in the power saving State=IDLE+POWERED_DOWN?

Currently, it seems that "scontrol reboot ASAP" is simply ignored for POWERED_DOWN nodes. I did this test on node s008 with a Down status:

NodeName=s008 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUEfctv=80 CPUTot=80 CPULoad=0.00
   AvailableFeatures=xeon5218r,GPU_RTX3090,power_ipmi
   ActiveFeatures=xeon5218r,GPU_RTX3090,power_ipmi
   Gres=gpu:RTX3090:10
   NodeAddr=s008 NodeHostName=s008 Version=23.02.7
   OS=Linux 4.18.0-513.9.1.el8_9.x86_64 #1 SMP Sat Dec 2 05:23:44 EST 2023
   RealMemory=768000 AllocMem=0 FreeMem=769643 Sockets=4 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=800000 Weight=19336 Owner=N/A MCS_label=N/A
   Partitions=sm3090el8,sm3090el8_768
   BootTime=2023-12-18T20:24:59 SlurmdStartTime=2023-12-18T20:25:23
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=80,mem=750G,billing=160,gres/gpu=10
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=49 AveWatts=186
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

This command had no effect whatsoever:

$ scontrol reboot asap nextstate=resume s008

Then I powered up the node with:

$ scontrol update nodename=s008 state=power_up

and it acquired a new State=IDLE after some minutes.  It wasn't rebooted again after coming up.

I have two alternative suggestions for a correct response from "scontrol reboot ASAP" to nodes that are powered down:

1. An warning message should be printed that the node cannot be rebooted because it is in state POWERED_DOWN.

2. The scontrol should power_up the node, and subsequently update its state according to "scontrol reboot" NextState=xxx (Resume or Down).

I'm unsure which is the most appropriate response, but at the very least the user must definitely be warned that "scontrol reboot ASAP" didn't work.

Thanks,
Ole

Comment 3 Marcin Stolarek 2023-12-20 04:11:50 MST

Ole,

I see that scontrol completing without any error, with return code of zero may be confusing. I'll work on the improvement to send back an error message to scontrol that specific subset of requested nodes cannot be rebooted.

I'll keep you posted on the progress.

cheers,
Marcin

Comment 5 Marcin Stolarek 2023-12-20 06:22:46 MST

Ole,

I'm sending a patch with an improvement - show an error when we can't fulfill the request to reboot specified nodes - to our review queue.

I'll keep you posted on the progress.

cheers,
Marcin

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2023-12-21 01:38:01 MST

Hi Marcin,

Thanks for a quick answer!

(In reply to Marcin Stolarek from comment #3)
> I see that scontrol completing without any error, with return code of zero
> may be confusing. I'll work on the improvement to send back an error message
> to scontrol that specific subset of requested nodes cannot be rebooted.

It seems to me that there may be additional states where "scontrol reboot ASAP" will not work correctly and scontrol should print a warning message.  This command is defined in the manual page as "Reboot the nodes in the system when they become idle using the RebootProgram...", meaning that slurmctld should take the reboot action as soon as the node enters the IDLE state.

IMHO, there are a number of cases where a node can't be expected to enter an IDLE state unless special actions are taken.  

Reading the scontrol manual page section "NODES - SPECIFICATIONS FOR SHOW COMMAND" I think there are states and flags which ought to prevent the "scontrol reboot ASAP" action and result in a warning message being printed, namely:

1. Do not attempt to reboot nodes in these states: DOWN, ERROR, FUTURE, UNKNOWN.

2. Do not attempt to reboot nodes with these flags: NOT_RESPONDING, POWER_DOWN, POWERED_DOWN, POWERING_UP, REBOOT_ISSUED, REBOOT_REQUESTED.

For the last 3 states I don't think it makes sense to force a new reboot.

I hope the "scontrol reboot ASAP" command can be modified as suggested.

Thanks,
Ole

Comment 8 Marcin Stolarek 2023-12-21 04:14:39 MST

Ole,

Thanks for pointing that out. I'll try to make it clear in the docs when scontrol reboot can be requested.

The implementation we're considering now is to ignore the request with appropriate  error message from scontrol for the following states/stage flags[1]:
FUTURE,REBOOT_REQUESTED, REBOOT_ISSUED, POWER_DOWN, POWERED_DOWN,POWERING_DOWN.

Does that look reasonable for you?

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-23.11/src/slurmctld/proc_req.c#L5647-L5652

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2023-12-21 04:58:35 MST

(In reply to Marcin Stolarek from comment #8)
> Ole,
> 
> Thanks for pointing that out. I'll try to make it clear in the docs when
> scontrol reboot can be requested.
> 
> The implementation we're considering now is to ignore the request with
> appropriate  error message from scontrol for the following states/stage
> flags[1]:
> FUTURE,REBOOT_REQUESTED, REBOOT_ISSUED, POWER_DOWN,
> POWERED_DOWN,POWERING_DOWN.
> 
> Does that look reasonable for you?

I think that NOT_RESPONDING nodes should cause a warning message, because we don't know why it's not responding!  It could be down due to hardware/power/network issues, and it may not become IDLE anytime soon.

Regarding State=DOWN we use this for OS and firmware upgrades with an "scontrol reboot ASAP NextState=Down" so that a crontab job can perform the upgrades without disturbance and make a "scontrol reboot ASAP NextState=Resume" in the crontab job when completed.  

A DOWN node isn't IDLE, so how should "scontrol reboot ASAP" handle DOWN nodes correctly?  I think a warning message should be issued as well.

Thanks,
Ole

Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2023-12-22 06:09:43 MST

Hi Marcin,

Since the fix only adds warning messages and no changed functionality, could you kindly make your patches available also to 23.02.8?

Thanks,
Ole

Comment 12 Marcin Stolarek 2023-12-22 06:44:11 MST

Ole,

Let me explain how this works in a little bit more details. There are basically 3 steps here. 

1) scontrol reboot sends an RPC to slurmctld. The handler of it sets the "REBOOT_REQUESTED" flag on the node if is't not in one of the states mentioned in Comment 8. In this case I think we should send a message about the request being ignored to scontrol. (This is missing today).

2) Slurmctld periodically checks if there are hosts that require reboot. 
   -) If the node is COMPLETING then it skip the reboot in current cycle (check again in next iteration).
   -) Keep nodes REBOOT_REQUESTED flag for next cycle unless (any of the below):
      *) node is idle, responding, not powering up and has no suspended jobs 
      *) node is DOWN (this was added in Bug 5544 - it was part of customer requested feature, commit 4baa6d6a241)
   -) If any of the above succeeded (evaluated to true) we add the node reboot request to the agent queue.

3) Slurmctld agent thread sends the reboot request directly to nodes(without the over the tree communication) or (if SlurmctldParameters=reboot_from_controller set) run the script via slurmscriptd thread.

>I think that NOT_RESPONDING nodes should cause a warning message, because we don't know why it's not responding![...]A DOWN node isn't IDLE, so how should "scontrol reboot ASAP" handle DOWN nodes correctly?[...]

Just looking at the code we'll queue a request to a down and non-responding node, which may look wrong, but since it's in since Slurm 18.08 and we don't have any clear bug indication coming from that I don't feel confident enough to change it. In all other cases non-responding node is queued for reboot and we'll send the reboot request to the node when it fulfills above mentioned criteria. 

We have some customers relying on the cli commands output and return codes in automated workloads, because of that we usually avoid any changes to command output and return codes on a released branch. Having that in mind I think that master branch will be most appropriate, although I'll discuss your request with the reviewer. The patch should be easy to backport though.

cheers,
Marcin

Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2023-12-22 06:44:20 MST

I'm out of the office, back on Jan. 2nd.
Jeg er ikke på kontoret, tilbage igen 2. januar.

Best regards / Venlig hilsen,
Ole Holm Nielsen

Comment 21 Marcin Stolarek 2024-01-05 04:37:59 MST

Ole,

The improvement to return error message when node reboot request is ignored was merged to our public repository, commit: dc75901f9f[1]. It will be part of Slurm 24.08 release.
Additionally the NOTE was added to scontrol man page[2].

I'll go ahead and close the bug as fixed now. Should you have any quetions please reopen.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/dc75901f9fa7a7136d03888c3f41e0a321eee4fa
[2]https://github.com/SchedMD/slurm/commit/def39cd10ef86404a98d59b4430cdab2636ea0aa