Ticket 15561

Summary:	power_save module should not suspend drained nodes
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	slurmctld	Assignee:	Skyler Malinowski <skyler>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	c.paschoulas
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2022-12-06 04:04:49 MST

We have enabled the Slurm power_save module by appending SuspendTime=600 to
PartitionName=xeon24 (an on-premise partition) in slurm.conf, and power saving is working through IPMI commands as provided by scripts in my GitHub page https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

Now we have an issue when draining nodes that need some kind of system maintenance (for example, fixing hardware errors). When the draining node finishes its job, the node will go to the drained state as expected. Then we would normally go and perform maintenance on the node.

Unfortunately, it appears that the power_save module picks up the drained node and turns it off after the SuspendTime interval. Later a job gets submitted and scheduled on the suspended node, which will then be resumed and run the new job. Therefore we're unable to perform the required maintenance.

In slurm.conf we have SlurmctldParameters=idle_on_node_suspend,cloud_dns
The idle_on_node_suspend as described in the slurm.conf manual page will cause the scenario described above. I have removed the idle_on_node_suspend parameter as a workaround, but for cloud nodes that get suspended this will preclude them from becoming idle and available for work, so this is not a viable fix.

IMHO, nodes that are drained (by scontrol update nodename=... state=drain reason=...) should be exempted from handling by the power_save module in the first place!

AFAICT, this is not the same issue as in bug 14989? We're planning to upgrade to 22.05 within a few weeks, but I don't know if this is going to solve the issue?

Thanks,
Ole

Comment 1 Skyler Malinowski 2022-12-07 11:29:01 MST

Hi Ole,

The good news is that we agree that drained nodes should not be powered down by power_save. Looks like bug #15184 will be addressing this. This enhancements could land for 23.11 (not confirmed).

In the meantime I can look into this as a bug fix. Given the change to Slurm power_save behavior, this bug fix would likely land for 23.02.

For now, there are some workarounds to this issue (with 'idle_on_node_suspend' on):
- Update SuspendExc with those nodes and reconfigure.
- (22.05) Use a new partition with SuspendTime=INFINITE and move the node into this partition with scontrol.

Best,
Skyler

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-12-08 00:29:34 MST

Hi Skyler,

(In reply to Skyler Malinowski from comment #1)
> The good news is that we agree that drained nodes should not be powered down
> by power_save. Looks like bug #15184 will be addressing this. This
> enhancements could land for 23.11 (not confirmed).
> 
> In the meantime I can look into this as a bug fix. Given the change to Slurm
> power_save behavior, this bug fix would likely land for 23.02.

Yeah, this is a pretty bad bug in the power_save module :-(  I hope you can get the fix into 23.02.  Nodes in state drained as well as maint should be exempted.

> For now, there are some workarounds to this issue (with
> 'idle_on_node_suspend' on):
> - Update SuspendExc with those nodes and reconfigure.
> - (22.05) Use a new partition with SuspendTime=INFINITE and move the node
> into this partition with scontrol.

I like the idea of adding nodes that require maintenance to SuspendExc in slurm.conf.  This adds one extra step for maintenance work, however, but that would be acceptable until a proper bug fix is in place.

Thanks,
Ole

Comment 3 Chrysovalantis Paschoulas 2022-12-09 08:05:30 MST

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #2)
> I like the idea of adding nodes that require maintenance to SuspendExc in
> slurm.conf.  This adds one extra step for maintenance work, however, but
> that would be acceptable until a proper bug fix is in place.

Dear Ole,

please check Bug 15185 ;)

If you agree with that, please push from your side too to convince SchedMD to implement that functionality..

Best Regards,
Valantis

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2022-12-13 00:41:16 MST

Based also on discussions in bug 15184 and bug 15185 I would like to urge SchedMD to consider for the 23.02 release some fixes for the power_save plugin so that it will handle on-premise nodes in a sensible way.  

This is really important for all customers in Europe and other regions where electricity prices have soared recently, and HPC centers are being asked to save money by cutting down power consumption as much as possible.

IMHO, it should be reconsidered how the power_save plugin should treat on-premise nodes as well as possible.  Among the different node states listed in, e.g., the sinfo manual page, it seems to me that nodes with the following states MUST be exempted automatically from suspension by slurmctld:

DOWN, DRAIN (for node in DRAINING or DRAINED states), DRAINED, DRAINING, FAIL, MAINT, NO_RESPOND, REBOOT_ISSUED, REBOOT_REQUESTED, RESV, RESERVED, UNK, and UNKNOWN.

Nodes that are "drained" must obviously *NOT* be powered off, but also the "down" state is being used at our site whenever we perform software and firmware updates.  In fact, I would like to propose that "idle" should be the only state which should be eligible for suspend/powering-down when dealing with on-premise (non-cloud) nodes!

Hopefully this could be implemented in the slurmctld code for all nodes that do *not* have a state=cloud.  If this is not feasible, yet another slurm.conf parameter might have to be introduced, for example:

SuspendExcStates=down,drained,fail,maint,reboot_issued,reserved,unknown

Furthermore, please consider also bug 15184 comment 8, where it has been found that slurmctld will start a job even if not all nodes assigned to the job have yet been resumed successfully.  The Jülich customer site has had to develop a lot of complex logic to work around problems with the power_save plugin :-(

I hope this request makes sense.

Comment 6 Skyler Malinowski 2022-12-15 14:59:33 MST

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4)
> Nodes that are "drained" must obviously *NOT* be powered off, but also the
> "down" state is being used at our site whenever we perform software and
> firmware updates.  In fact, I would like to propose that "idle" should be
> the only state which should be eligible for suspend/powering-down when
> dealing with on-premise (non-cloud) nodes!

I can see the benefit of only considering IDLE nodes for power_save. DOWN and DRAIN in particular would make sense not to suspend when node are in those states or have those flags. Certain interactions will need to be reconsidered (e.g. POWER_DOWN_FORCE, POWER_DOWN_ASAP) and fixed/adjusted accordingly.

It is entirely possible that this cannot be fixed as a bug due to the scope or knock-on effects that would ripple into bug 15184. Instead this ticket would be marked as a duplicate of bug 15184 or minimally an info-given because of the workaround. I will need to talk internally about how we want to handle this.

> Hopefully this could be implemented in the slurmctld code for all nodes that
> do *not* have a state=cloud.

I do not like branching/conditional handling for CLOUD vs non-CLOUD nodes. I want both to be handled in the same way otherwise admin/user expectations/understanding can get muddled. Beside, the rationale of not powering down IDLE and DRAIN nodes is the regardless of CLOUD or non-CLOUD nodes -- not having power_save interact with the node for debugging or maintenance reasons.

> Furthermore, please consider also bug 15184 comment 8, where it has been
> found that slurmctld will start a job even if not all nodes assigned to the
> job have yet been resumed successfully.  The Jülich customer site has had to
> develop a lot of complex logic to work around problems with the power_save
> plugin :-(

There certainly is room for Slurm improvements! And I would imagine that this will be addressed in a future release of Slurm.


Your suggestions do make sense and are appreciated but some are out of the scope of this ticket. We will take your words into consideration for bug 15184 and bug 15185. Thank you for voicing similarly felt shortcoming of Slurm.

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2022-12-16 00:42:49 MST

Hi Skyler,

(In reply to Skyler Malinowski from comment #6)
> It is entirely possible that this cannot be fixed as a bug due to the scope
> or knock-on effects that would ripple into bug 15184. Instead this ticket
> would be marked as a duplicate of bug 15184 or minimally an info-given
> because of the workaround. I will need to talk internally about how we want
> to handle this.

OK, I appreciate that the power_save plugin is quite complex and needs to fixed very carefully in a coming release, hopefully in 23.02.

> I do not like branching/conditional handling for CLOUD vs non-CLOUD nodes. I
> want both to be handled in the same way otherwise admin/user
> expectations/understanding can get muddled. Beside, the rationale of not
> powering down IDLE and DRAIN nodes is the regardless of CLOUD or non-CLOUD
> nodes -- not having power_save interact with the node for debugging or
> maintenance reasons.

I agree with this argument.

> > Furthermore, please consider also bug 15184 comment 8, where it has been
> > found that slurmctld will start a job even if not all nodes assigned to the
> > job have yet been resumed successfully.  The Jülich customer site has had to
> > develop a lot of complex logic to work around problems with the power_save
> > plugin :-(
> 
> There certainly is room for Slurm improvements! And I would imagine that
> this will be addressed in a future release of Slurm.

Yes please!  The soaring prices of electricity in Europe makes this a high priority concern for HPC sites.

> Your suggestions do make sense and are appreciated but some are out of the
> scope of this ticket. We will take your words into consideration for bug
> 15184 and bug 15185. Thank you for voicing similarly felt shortcoming of
> Slurm.

I appreciate your attentiveness!  I hope that SchedMD will act to help customers with saving on their increasing electricity bills.  At this time the power_save plugin is unfortunately somewhat lacking.

Best regards,
Ole

Comment 8 Skyler Malinowski 2023-01-04 07:45:23 MST

Hi Ole,

I will mark this ticket as a duplicate of bug 15184.

Said ticket encompasses your request and should be the one to make changes for it. It does not make sense for me to make intermediate changes for this ticket.

Thanks,
Skyler

*** This ticket has been marked as a duplicate of ticket 15184 ***