Bug 15184 - SuspendProgram: do not suspend drained/offline nodes
Summary: SuspendProgram: do not suspend drained/offline nodes
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 21.08.8
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Scott Hilton
QA Contact:
URL:
: 15561 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-10-16 09:16 MDT by Chrysovalantis Paschoulas
Modified: 2023-01-24 09:48 MST (History)
3 users (show)

See Also:
Site: Jülich
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02pre1
Target Release: 23.11
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments
SuspendProgram (4.98 KB, application/x-shellscript)
2022-12-09 07:53 MST, Chrysovalantis Paschoulas
Details
ResumeProgram (5.29 KB, application/x-shellscript)
2022-12-09 07:53 MST, Chrysovalantis Paschoulas
Details
ResumeFailProgram (704 bytes, application/x-shellscript)
2022-12-09 07:54 MST, Chrysovalantis Paschoulas
Details
Part of PrologSlurmctld script (1.80 KB, application/x-shellscript)
2022-12-09 07:55 MST, Chrysovalantis Paschoulas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chrysovalantis Paschoulas 2022-10-16 09:16:08 MDT
On our site we have reached into the conclusion that we don't want the SuspendProrgam to power down nodes that are in state DRAIN, and maybe also DOWN.

I had opened a similar ticket (Bug 14989) but in the end we discussed/solved there the issue of sending wrong nodelist to ResumeProgram.

Now I want to ask you to change the current behavior of Slurm where offline nodes are powered down. In Bug 14989 you told us:
```
It is expected that Slurm will suspend drained/down nodes. Once a node is drained/down though, Slurm should not try and resume it again, and you should be able to perform any maintenance needed.
```
But this is not good for us. We have diskless compute nodes, where of course we send syslogs to remote service nodes, but when we power them off we lose their whole state and debugging sometimes is impossible if we power them back on.

Currenlty we implemented a workaround where in SuspendProgram we do not power off nodes in state DRAIN (and maybe DOWN - we are still investigating if we want this) but it's not a perfect solution, slurmctld thinks that those nodes are POWERED_DOWN which is not true, we just fake it. And we made also the ResumeProgram to do the correct steps to bring the nodes successfully to POWERED_ON state even if nodes were already up.

So we are asking for a proper way/implementation to tell slurmctld not to power off drained nodes. One way that I can think of which shouldn't need much effort is to add a new parameter in slurm.conf where we will define all node states that we want to exclude from suspending, e.g. "SuspendExcStates=..". A more advanced solution would be to skip suspending nodes that have a certain substring in their reasons, e.g. "slurm_skip_suspend", maybe make this tag configurable in slurm.conf.

What do you think? Can you implement it? This is one of our requirements, it's not just a suggestion for improving Slurm..
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2022-12-07 12:53:22 MST
Hi Chrysovalantis,

(In reply to Chrysovalantis Paschoulas from comment #0)
> Currenlty we implemented a workaround where in SuspendProgram we do not
> power off nodes in state DRAIN (and maybe DOWN - we are still investigating
> if we want this) but it's not a perfect solution, slurmctld thinks that
> those nodes are POWERED_DOWN which is not true, we just fake it. And we made
> also the ResumeProgram to do the correct steps to bring the nodes
> successfully to POWERED_ON state even if nodes were already up.
I reported a similar issue in Bug 15561 because we don't want drained nodes to be suspended and then resumed.  As a workaround I removed idle_on_node_suspend from SlurmctldParameters so that drained nodes won't be resumed.

Would you be willing to share your code modifications for SuspendProgram and ResumeProgram?  No need for me to reinvent the wheel :-)  If possible, I would like to add these workarounds to my scripts in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

Thanks,
Ole
Comment 2 Chrysovalantis Paschoulas 2022-12-09 07:53:22 MST
Created attachment 28112 [details]
SuspendProgram
Comment 3 Chrysovalantis Paschoulas 2022-12-09 07:53:57 MST
Created attachment 28113 [details]
ResumeProgram
Comment 4 Chrysovalantis Paschoulas 2022-12-09 07:54:22 MST
Created attachment 28114 [details]
ResumeFailProgram
Comment 5 Chrysovalantis Paschoulas 2022-12-09 07:55:20 MST
Created attachment 28115 [details]
Part of PrologSlurmctld script
Comment 6 Chrysovalantis Paschoulas 2022-12-09 08:02:07 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #1)
> Hi Chrysovalantis,
> 
> (In reply to Chrysovalantis Paschoulas from comment #0)
> > Currenlty we implemented a workaround where in SuspendProgram we do not
> > power off nodes in state DRAIN (and maybe DOWN - we are still investigating
> > if we want this) but it's not a perfect solution, slurmctld thinks that
> > those nodes are POWERED_DOWN which is not true, we just fake it. And we made
> > also the ResumeProgram to do the correct steps to bring the nodes
> > successfully to POWERED_ON state even if nodes were already up.
> I reported a similar issue in Bug 15561 because we don't want drained nodes
> to be suspended and then resumed.  As a workaround I removed
> idle_on_node_suspend from SlurmctldParameters so that drained nodes won't be
> resumed.
> 
> Would you be willing to share your code modifications for SuspendProgram and
> ResumeProgram?  No need for me to reinvent the wheel :-)  If possible, I
> would like to add these workarounds to my scripts in
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
> 
> Thanks,
> Ole

Dear Ole,

I have uploaded our scripts. I hope they will help you..

Some extra infos:
- we run slurmd with the "-b" option, which means every time it's restarted the system boot time (inside slurmd state) is updated
- part of our logic is implemented in PrologSlurmctld, where we wait until all nodes of the job have been powered on and then we start the job

As we can see Slurm powersave mechanism is not perfect and there is a lot, a lot, of space for improvement. We don't powered down downed or drained nodes, we could extend this for nodes in MAINT too. And we are doing stupid workarounds where we restart slurmd in various places to make sure slurmctld stops the POWERING_UP|DOWN states otherwise we will reach the resume|suspend timeouts..

Best Regards,
Valantis
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2022-12-12 07:33:09 MST
Dear Valantis,

(In reply to Chrysovalantis Paschoulas from comment #6)
> I have uploaded our scripts. I hope they will help you..

Thanks a lot!  The scripts contain some pretty complex logic which will take some time for me to figure out :-)

> Some extra infos:
> - we run slurmd with the "-b" option, which means every time it's restarted
> the system boot time (inside slurmd state) is updated

Can I ask why you need this?

> - part of our logic is implemented in PrologSlurmctld, where we wait until
> all nodes of the job have been powered on and then we start the job

OK, but slurmctld ought to do this, right?

> As we can see Slurm powersave mechanism is not perfect and there is a lot, a
> lot, of space for improvement. We don't powered down downed or drained
> nodes, we could extend this for nodes in MAINT too. And we are doing stupid
> workarounds where we restart slurmd in various places to make sure slurmctld
> stops the POWERING_UP|DOWN states otherwise we will reach the resume|suspend
> timeouts..

I agree that the power_save plugin needs significant improvement for working correctly with IPMI power managed nodes.  The power_save Slurm plugin was probably developed mainly for cloud nodes.  But with current electricity prices, there is some motivation to save power with on-premise nodes.

Best regards,
Ole
Comment 8 Chrysovalantis Paschoulas 2022-12-12 10:47:03 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> > - we run slurmd with the "-b" option, which means every time it's restarted
> > the system boot time (inside slurmd state) is updated
> 
> Can I ask why you need this?
> 
Because e.g. we don't power down nodes that are drained, but then when we want to online them we want the resume program to make it look like they were just powered on, hence we need that -b option. So the old drained nodes (that we are faking them to slurmctld that are powered down) we want to be able to undrain them and then run the resume program (which will run restart slurmd in the end to give new boot time), no powerdown/reboot is involved here. The powersave mechanism from slurmctld side needs the slurmds to have a new "node" boot time.

> > - part of our logic is implemented in PrologSlurmctld, where we wait until
> > all nodes of the job have been powered on and then we start the job
> 
> OK, but slurmctld ought to do this, right?
> 
No unfortunately, first we saw that PrologSlurmctld starts immediately after the first resume program has been spawned where I would expect/like slurmctld to wait until all resume programs of a job have been finished and then run the prolog script. Slurm doesn't even care about the exit code of the resume programs at all.. I am not sure if `sbatch_wait_nodes` affects PrologSlurmctld at all, I will have to check but I think no. Second, we have some other node health checks in PrologSlurmctld and we want to wait first for nodes to come up correctly and then we run the checks on them.

> I agree that the power_save plugin needs significant improvement for working
> correctly with IPMI power managed nodes.  The power_save Slurm plugin was
> probably developed mainly for cloud nodes.  But with current electricity
> prices, there is some motivation to save power with on-premise nodes.
> 
Exactly!

Best Regards,
Valantis
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2022-12-14 07:35:02 MST
Hi Valantis,

We keep having nodes that are in the Slurm "down" state with a reason of "ResumeTimeout reached" because they were never powered down in reality by the slurmctld power_save plugin :-(  So I want to test your "slurmd -b" trick so that we can simply resume the nodes in stead of rebooting them.

(In reply to Chrysovalantis Paschoulas from comment #8)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> > > - we run slurmd with the "-b" option, which means every time it's restarted
> > > the system boot time (inside slurmd state) is updated
> > 
> > Can I ask why you need this?
> > 
> Because e.g. we don't power down nodes that are drained, but then when we
> want to online them we want the resume program to make it look like they
> were just powered on, hence we need that -b option. So the old drained nodes
> (that we are faking them to slurmctld that are powered down) we want to be
> able to undrain them and then run the resume program (which will run restart
> slurmd in the end to give new boot time), no powerdown/reboot is involved
> here. The powersave mechanism from slurmctld side needs the slurmds to have
> a new "node" boot time.

For the record, I had to find out how to modify Systemd on the CentOS 7 nodes to add "-b" automatically (this is non-trivial to me at least).  The solution is to create the file /etc/systemd/system/slurmd.service.d/override.conf with this content:

[Service]
Environment="SLURMD_OPTIONS=-b"

and then restart slurmd as follows:

$ systemctl daemon-reload
$ systemctl restart slurmd

Do you have any comments or suggestions?

Thanks,
Ole


Then the slurmd service
Comment 11 Chrysovalantis Paschoulas 2022-12-14 08:57:03 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #10)
> 
> For the record, I had to find out how to modify Systemd on the CentOS 7
> nodes to add "-b" automatically (this is non-trivial to me at least).  The
> solution is to create the file
> /etc/systemd/system/slurmd.service.d/override.conf with this content:
> 
> [Service]
> Environment="SLURMD_OPTIONS=-b"
> 
> and then restart slurmd as follows:
> 
> $ systemctl daemon-reload
> $ systemctl restart slurmd
> 
> Do you have any comments or suggestions?
> 

Hi Ole!

That's one possible way to do that ;)

The rpms include unit files under /usr/lib/systemd/system/ and you correctly override them with drop-in files under /etc/systemd/system/<name>.service.d/

In slurmd unit file they have:
```
...
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
EnvironmentFile=-/etc/default/slurmd
ExecStart=<path>/slurmd -D -s $SLURMD_OPTIONS
...
```

So you could achieve the same with having a drop-in file like this:
```
[Service]
ExecStart=<path>/slurmd -D -s -b
```

But as you can see in the original unit file they have already the line:
```
EnvironmentFile=-/etc/sysconfig/slurmd
```

So the proper way to do it is by creating/editing file "/etc/sysconfig/slurmd" where you will set:
```
SLURMD_OPTIONS=-b
```
In this way you don't need to create any drop-in file.

Now you learnt something new :P

Best Regards,
Valantis
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2022-12-14 09:44:32 MST
Hi Valantis,

(In reply to Chrysovalantis Paschoulas from comment #11)
> But as you can see in the original unit file they have already the line:
> ```
> EnvironmentFile=-/etc/sysconfig/slurmd
> ```
> 
> So the proper way to do it is by creating/editing file
> "/etc/sysconfig/slurmd" where you will set:
> ```
> SLURMD_OPTIONS=-b
> ```
> In this way you don't need to create any drop-in file.

Yes, this is even simpler that the Systemd solution.

> Now you learnt something new :P

Yes indeed :-)

Thanks a lot,
Ole
Comment 14 Skyler Malinowski 2023-01-04 07:45:23 MST
*** Bug 15561 has been marked as a duplicate of this bug. ***
Comment 19 Scott Hilton 2023-01-16 17:19:28 MST
Chrysovalantis,

We added SuspendExcStates as an option in slurm.conf. The change is on the master branch and should be part of the 23.02 release. (See commits: 6838bee3a0 through 94ce25675b)

Valid options include Down, Drain and Planned. (Planned is a state set by the backfill scheduler for nodes it plans on using).

-Scott
Comment 20 Chrysovalantis Paschoulas 2023-01-17 01:33:21 MST
(In reply to Scott Hilton from comment #19)
> Chrysovalantis,
> 
> We added SuspendExcStates as an option in slurm.conf. The change is on the
> master branch and should be part of the 23.02 release. (See commits:
> 6838bee3a0 through 94ce25675b)
> 
> Valid options include Down, Drain and Planned. (Planned is a state set by
> the backfill scheduler for nodes it plans on using).
> 
> -Scott

Hi Scott!

That's great news, thanks!

-Valantis
Comment 21 Ole.H.Nielsen@fysik.dtu.dk 2023-01-17 02:28:32 MST
Hi Scott,

(In reply to Scott Hilton from comment #19)
> We added SuspendExcStates as an option in slurm.conf. The change is on the
> master branch and should be part of the 23.02 release. (See commits:
> 6838bee3a0 through 94ce25675b)
> 
> Valid options include Down, Drain and Planned. (Planned is a state set by
> the backfill scheduler for nodes it plans on using).

This is great!  My suggestion in bug 15561 comment 4 seems to be accommodated by the new SuspendExcStates option.  I look forward to trying this out with 23.02.

I am unable to find the commits in https://github.com/SchedMD/slurm/commits/master, can you give more info?

Thanks,
Ole
Comment 22 Ole.H.Nielsen@fysik.dtu.dk 2023-01-17 04:02:55 MST
It seems to me that we need additional states to be exempted from power_saving.  In bug 15561 comment 4 I suggested that we need to exempt nodes with the following states (plus "planned"):

SuspendExcStates=down,drained,fail,maint,reboot_issued,reserved,unknown

Thanks,
Ole
Comment 23 Chrysovalantis Paschoulas 2023-01-17 04:12:45 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #22)
> It seems to me that we need additional states to be exempted from
> power_saving.  In bug 15561 comment 4 I suggested that we need to exempt
> nodes with the following states (plus "planned"):
> 
> SuspendExcStates=down,drained,fail,maint,reboot_issued,reserved,unknown
> 
> Thanks,
> Ole

I also agree with that. In general all states should be supported except: ALLOCATED, IDLE, MIXED, CLOUD and all POWER_* states. Okay, maybe I missed some states that doesn't make sense to exclude from suspension... :P

-Valantis
Comment 29 Scott Hilton 2023-01-23 11:13:20 MST
Chrysovalantis,

We added many more options for SuspendExcStates. 

Valid states include CLOUD, DOWN, DRAIN, DYNAMIC_FUTURE, DYNAMIC_NORM, FAIL,
INVALID_REG, MAINTENANCE, NOT_RESPONDING, PERFCTRS, PLANNED, and RESERVED.

-Scott
Comment 30 Scott Hilton 2023-01-23 11:13:43 MST
See commit 5f2cba52c6
Comment 31 Ole.H.Nielsen@fysik.dtu.dk 2023-01-23 12:16:13 MST
(In reply to Scott Hilton from comment #30)
> See commit 5f2cba52c6

Sorry, this commit doesn't come up in https://github.com/SchedMD/slurm/commits/master
Can you provide the correct link?
Comment 33 Chrysovalantis Paschoulas 2023-01-24 02:37:14 MST
(In reply to Scott Hilton from comment #29)
> Chrysovalantis,
> 
> We added many more options for SuspendExcStates. 
> 
> Valid states include CLOUD, DOWN, DRAIN, DYNAMIC_FUTURE, DYNAMIC_NORM, FAIL,
> INVALID_REG, MAINTENANCE, NOT_RESPONDING, PERFCTRS, PLANNED, and RESERVED.
> 
> -Scott

Thank you Scott!
Comment 34 Scott Hilton 2023-01-24 09:48:54 MST
Glad we could help.