Ticket 14989 - power_save mode: do not suspend/resume offline nodes
Summary: power_save mode: do not suspend/resume offline nodes
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Ben Glines
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-09-19 04:08 MDT by Chrysovalantis Paschoulas
Modified: 2022-09-26 10:33 MDT (History)
0 users

See Also:
Site: Jülich
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chrysovalantis Paschoulas 2022-09-19 04:08:52 MDT
We enabled the power saving mechanism but we have some big issues. Currently Slurm will suspend and resume also nodes that are offline (drained, down). And this is a big problem because if the node is incapable of being resumed, e.g. damaged hardware, the resume action will fail (timeout) and Slurm will overwrite the reason why the node is drained.

We keep important info in the node reasons and we don't want to lose it. Now the reasons are overwritten and we have to manually check node events and restore old reasons, which is really not user friendly.

Also it doesn't make sense to try to resume drained nodes, they have issues and most probably the resume will fail.

So we are asking for the following fixes: Slurm shouldn't try to resume drained nodes. Maybe it makes sense to also not suspend drained nodes since the admins may be working on them to fix them. Make that either the default behavior or make it at least configurable. If it's configurable the admin should decide which node states should be excluded by the power_save mechanism. Another request from a colleague was not to suspend/resume nodes that are reserved, especially the ones that are in a MAINT reservation, which makes sense. Maybe make this also configurable?

This is very important for us because we want to reduce the power consumption of our clusters when nodes are idling, but current state of the power_save mechanism is not really usable for us.

This ticket is similar to https://bugs.schedmd.com/show_bug.cgi?id=13643 but in our case we have a support contract and we expect this to be fixed ;)

Currently we are using 21.08, would it be possible to backport the fixes to this version?

Best Regards,
Valantis
Comment 1 Ben Glines 2022-09-20 11:52:20 MDT
Hello,

First of all, I am reducing this to a sev 3 based on our definitions here: https://www.schedmd.com/support.php. Please review those guidelines

As for resuming/suspending drained/down nodes, I am not seeing all the behavior you are describing.

Slurm won't schedule jobs on a drained/downed node, and so it shouldn't ever try to resume a drained/down node. In my testing, I have not been able to get Slurm to try and resume a drained/down node. A job requesting such a node will remain pending until the node is manually updated with scontrol to be in the idle state. Can you manually reproduce this on your system?

It is expected that Slurm will suspend drained/down nodes. Once a node is drained/down though, Slurm should not try and resume it again, and you should be able to perform any maintenance needed.

I would suggest considering SuspendExcNodes or SuspendExcParts for your needs as well. 

As for your suggestion for configuring power saving features for nodes based on their state (i.e. not suspending drained/down nodes) or based on what reservations they are in, I am not sure if these are things we would want to implement. I'm discussing this internally and I'll get back to you on this.

Again, Slurm should not be resuming drained/down nodes currently, so hopefully we can track down the issue you're seeing. Please let me know if you can manually reproduce this, and if so, I would appreciate any steps required to reproduce it so I can work on this issue on my end.
Comment 2 Chrysovalantis Paschoulas 2022-09-21 09:35:36 MDT
OK I can confirm that no drained/down node is ever used for a job allocation. A colleague had enabled `idle_on_node_suspend` and I think this was the culprit for some unexpected behavior, now he will repeat his tests and I will report back to you soon about that.

But we found the bug!

So we have a small test cluster where node `jwtc10n002` has state `DOWN+DRAIN+POWERED_DOWN` (or in some tests `DOWN+POWERED_DOWN(+NOT_RESPONDING)`). That node belongs to batch partition:
```
PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
batch*          up   infinite          0/5/1/6 jwtc10n[000-002,048-050]
```
Now, when we run a job that requests all nodes before and after that node in the batch partition, e.g.:
```
srun -A root -p batch -w jwtc10n[000-001,048-050] bash -c 'sleep 10; hostname'
```
We get the following logs:
```
2022-09-21_16-36-03 resume_program[9521] resuming jwtc10n[000-002]
2022-09-21_16-36-23 resume_program[10197] resuming jwtc10n048
2022-09-21_16-36-43 resume_program[10703] resuming jwtc10n049
2022-09-21_16-37-03 resume_program[11223] resuming jwtc10n050
...
```
In our resume_program we just print the above line in the beginning with the arg/nodelist that slurmctld passes to the script for resuming them. As you can see node `jwtc10n002` is in the first nodelist, even when that node was NOT requested in the allocation! Also I am concerned about the resuming rate, currently we have set it to 3 nodes/min but the following nodes e.g. jwtc10n048, were resume before the first actually finish their resume_program!

In this case the resume program exits with non-zero code but the job runs successfully on the healthy nodes that were requested. That means that the return code of the resume program is ignored.. Is this desired?

Here is our related config:
```
ResumeFailProgram=/etc/slurm/resume_fail_program.sh
ResumeProgram=/etc/slurm/resume_program.sh
ResumeRate=3
ResumeTimeout=300
SuspendExcNodes=jwtc10n002,jwtc10n096
#SuspendExcParts=
SuspendProgram=/etc/slurm/suspend_program.sh
SuspendRate=3
SuspendTime=1800
SuspendTimeout=300
```

Ha! I just realized that jwtc10n002 shouldn't be even suspended, it's in SuspendExcNodes list! I am not sure if we added that config after the node was suspended.. Maybe Slurm tries to resume it because it is in SuspendExcNodes list? No, after removing that config it still tries to resume that node..

Another weird behavior we realized: when the suspend program exits successfully the nodes are not set to POWERED_DOWN immediately. Instead they are in POWERING_DOWN state until SuspendTimeout is reached (or a bit earlier).

Anyway I still believe that suspending down/drained nodes shouldn't happen, since admins may be working on them and there is no nice interface (e.g. with scontrol) for modifying the runtime config for SuspendExcNodes and SuspendExcParts. It would be also nice to have `SuspendExcNodeStates` in slurm.conf ;)
Comment 3 Chrysovalantis Paschoulas 2022-09-21 09:40:19 MDT
Even after removing `idle_on_node_suspend` my colleague reported that they get the same behavior. I guess they have same/similar problem with the bug I reported above...

They request a job with nodes `n[01-06,08-11,13-20,31-34,38-45]` and they see that the resume program runs for nodes `n[01-30]`.

So on 2 different cluster we could reproduce that the resume program receives wrong nodelists.. Please fix it :)
Comment 4 Ben Glines 2022-09-23 16:12:19 MDT
I am able to reproduce this issue on 21.08, but not on the latest 22.05 version. Looks like things are working just fine on 22.05 as far as I can tell right now. I'm discussing this with other engineers internally to see when this issue may have been fixed.

Could you tell me what version your colleague was running when they saw a similar problem?
Comment 5 Chrysovalantis Paschoulas 2022-09-26 05:56:43 MDT
(In reply to Ben Glines from comment #4)
> I am able to reproduce this issue on 21.08, but not on the latest 22.05
> version. Looks like things are working just fine on 22.05 as far as I can
> tell right now. I'm discussing this with other engineers internally to see
> when this issue may have been fixed.
> 
> Could you tell me what version your colleague was running when they saw a
> similar problem?

In both cases we use `21.08.8-2`.

We would really appreciate if you could provide us a patch for 21.08.8-2, we are not able to upgrade to 22.05 right now, because of some third party deps.
Comment 6 Ben Glines 2022-09-26 10:33:15 MDT
The fix for this is in the following commit: https://github.com/SchedMD/slurm/commit/fd1ea63a6d43407ea5c64bbce36b81ec96128025.

The first official tagged release that contains this commit is the first 22.05 release. We don't actively backport fixes, but you're free to apply the changes in this commit as a patch at your own risk.

I'm going to close this one out now. Feel free to reopen if you have any further questions on this. Your requests for enhancement would also be best tracked in a separate bug if you are still interested in them.