Summary: | scontrol reboot action fails when nodes are in State=IDLE+POWERED_DOWN | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | User Commands | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | felip.moll |
Version: | 23.02.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 24.08.0rc1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2023-12-19 05:41:06 MST
Ole, I see that scontrol completing without any error, with return code of zero may be confusing. I'll work on the improvement to send back an error message to scontrol that specific subset of requested nodes cannot be rebooted. I'll keep you posted on the progress. cheers, Marcin Ole, I'm sending a patch with an improvement - show an error when we can't fulfill the request to reboot specified nodes - to our review queue. I'll keep you posted on the progress. cheers, Marcin Hi Marcin, Thanks for a quick answer! (In reply to Marcin Stolarek from comment #3) > I see that scontrol completing without any error, with return code of zero > may be confusing. I'll work on the improvement to send back an error message > to scontrol that specific subset of requested nodes cannot be rebooted. It seems to me that there may be additional states where "scontrol reboot ASAP" will not work correctly and scontrol should print a warning message. This command is defined in the manual page as "Reboot the nodes in the system when they become idle using the RebootProgram...", meaning that slurmctld should take the reboot action as soon as the node enters the IDLE state. IMHO, there are a number of cases where a node can't be expected to enter an IDLE state unless special actions are taken. Reading the scontrol manual page section "NODES - SPECIFICATIONS FOR SHOW COMMAND" I think there are states and flags which ought to prevent the "scontrol reboot ASAP" action and result in a warning message being printed, namely: 1. Do not attempt to reboot nodes in these states: DOWN, ERROR, FUTURE, UNKNOWN. 2. Do not attempt to reboot nodes with these flags: NOT_RESPONDING, POWER_DOWN, POWERED_DOWN, POWERING_UP, REBOOT_ISSUED, REBOOT_REQUESTED. For the last 3 states I don't think it makes sense to force a new reboot. I hope the "scontrol reboot ASAP" command can be modified as suggested. Thanks, Ole Ole, Thanks for pointing that out. I'll try to make it clear in the docs when scontrol reboot can be requested. The implementation we're considering now is to ignore the request with appropriate error message from scontrol for the following states/stage flags[1]: FUTURE,REBOOT_REQUESTED, REBOOT_ISSUED, POWER_DOWN, POWERED_DOWN,POWERING_DOWN. Does that look reasonable for you? cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-23.11/src/slurmctld/proc_req.c#L5647-L5652 (In reply to Marcin Stolarek from comment #8) > Ole, > > Thanks for pointing that out. I'll try to make it clear in the docs when > scontrol reboot can be requested. > > The implementation we're considering now is to ignore the request with > appropriate error message from scontrol for the following states/stage > flags[1]: > FUTURE,REBOOT_REQUESTED, REBOOT_ISSUED, POWER_DOWN, > POWERED_DOWN,POWERING_DOWN. > > Does that look reasonable for you? I think that NOT_RESPONDING nodes should cause a warning message, because we don't know why it's not responding! It could be down due to hardware/power/network issues, and it may not become IDLE anytime soon. Regarding State=DOWN we use this for OS and firmware upgrades with an "scontrol reboot ASAP NextState=Down" so that a crontab job can perform the upgrades without disturbance and make a "scontrol reboot ASAP NextState=Resume" in the crontab job when completed. A DOWN node isn't IDLE, so how should "scontrol reboot ASAP" handle DOWN nodes correctly? I think a warning message should be issued as well. Thanks, Ole Hi Marcin, Since the fix only adds warning messages and no changed functionality, could you kindly make your patches available also to 23.02.8? Thanks, Ole Ole, Let me explain how this works in a little bit more details. There are basically 3 steps here. 1) scontrol reboot sends an RPC to slurmctld. The handler of it sets the "REBOOT_REQUESTED" flag on the node if is't not in one of the states mentioned in Comment 8. In this case I think we should send a message about the request being ignored to scontrol. (This is missing today). 2) Slurmctld periodically checks if there are hosts that require reboot. -) If the node is COMPLETING then it skip the reboot in current cycle (check again in next iteration). -) Keep nodes REBOOT_REQUESTED flag for next cycle unless (any of the below): *) node is idle, responding, not powering up and has no suspended jobs *) node is DOWN (this was added in Bug 5544 - it was part of customer requested feature, commit 4baa6d6a241) -) If any of the above succeeded (evaluated to true) we add the node reboot request to the agent queue. 3) Slurmctld agent thread sends the reboot request directly to nodes(without the over the tree communication) or (if SlurmctldParameters=reboot_from_controller set) run the script via slurmscriptd thread. >I think that NOT_RESPONDING nodes should cause a warning message, because we don't know why it's not responding![...]A DOWN node isn't IDLE, so how should "scontrol reboot ASAP" handle DOWN nodes correctly?[...] Just looking at the code we'll queue a request to a down and non-responding node, which may look wrong, but since it's in since Slurm 18.08 and we don't have any clear bug indication coming from that I don't feel confident enough to change it. In all other cases non-responding node is queued for reboot and we'll send the reboot request to the node when it fulfills above mentioned criteria. We have some customers relying on the cli commands output and return codes in automated workloads, because of that we usually avoid any changes to command output and return codes on a released branch. Having that in mind I think that master branch will be most appropriate, although I'll discuss your request with the reviewer. The patch should be easy to backport though. cheers, Marcin I'm out of the office, back on Jan. 2nd. Jeg er ikke på kontoret, tilbage igen 2. januar. Best regards / Venlig hilsen, Ole Holm Nielsen Ole, The improvement to return error message when node reboot request is ignored was merged to our public repository, commit: dc75901f9f[1]. It will be part of Slurm 24.08 release. Additionally the NOTE was added to scontrol man page[2]. I'll go ahead and close the bug as fixed now. Should you have any quetions please reopen. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/dc75901f9fa7a7136d03888c3f41e0a321eee4fa [2]https://github.com/SchedMD/slurm/commit/def39cd10ef86404a98d59b4430cdab2636ea0aa |