On our site we have reached into the conclusion that we don't want the SuspendProrgam to power down nodes that are in state DRAIN, and maybe also DOWN. I had opened a similar ticket (Bug 14989) but in the end we discussed/solved there the issue of sending wrong nodelist to ResumeProgram. Now I want to ask you to change the current behavior of Slurm where offline nodes are powered down. In Bug 14989 you told us: ``` It is expected that Slurm will suspend drained/down nodes. Once a node is drained/down though, Slurm should not try and resume it again, and you should be able to perform any maintenance needed. ``` But this is not good for us. We have diskless compute nodes, where of course we send syslogs to remote service nodes, but when we power them off we lose their whole state and debugging sometimes is impossible if we power them back on. Currenlty we implemented a workaround where in SuspendProgram we do not power off nodes in state DRAIN (and maybe DOWN - we are still investigating if we want this) but it's not a perfect solution, slurmctld thinks that those nodes are POWERED_DOWN which is not true, we just fake it. And we made also the ResumeProgram to do the correct steps to bring the nodes successfully to POWERED_ON state even if nodes were already up. So we are asking for a proper way/implementation to tell slurmctld not to power off drained nodes. One way that I can think of which shouldn't need much effort is to add a new parameter in slurm.conf where we will define all node states that we want to exclude from suspending, e.g. "SuspendExcStates=..". A more advanced solution would be to skip suspending nodes that have a certain substring in their reasons, e.g. "slurm_skip_suspend", maybe make this tag configurable in slurm.conf. What do you think? Can you implement it? This is one of our requirements, it's not just a suggestion for improving Slurm..
Hi Chrysovalantis, (In reply to Chrysovalantis Paschoulas from comment #0) > Currenlty we implemented a workaround where in SuspendProgram we do not > power off nodes in state DRAIN (and maybe DOWN - we are still investigating > if we want this) but it's not a perfect solution, slurmctld thinks that > those nodes are POWERED_DOWN which is not true, we just fake it. And we made > also the ResumeProgram to do the correct steps to bring the nodes > successfully to POWERED_ON state even if nodes were already up. I reported a similar issue in Bug 15561 because we don't want drained nodes to be suspended and then resumed. As a workaround I removed idle_on_node_suspend from SlurmctldParameters so that drained nodes won't be resumed. Would you be willing to share your code modifications for SuspendProgram and ResumeProgram? No need for me to reinvent the wheel :-) If possible, I would like to add these workarounds to my scripts in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save Thanks, Ole
Created attachment 28112 [details] SuspendProgram
Created attachment 28113 [details] ResumeProgram
Created attachment 28114 [details] ResumeFailProgram
Created attachment 28115 [details] Part of PrologSlurmctld script
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #1) > Hi Chrysovalantis, > > (In reply to Chrysovalantis Paschoulas from comment #0) > > Currenlty we implemented a workaround where in SuspendProgram we do not > > power off nodes in state DRAIN (and maybe DOWN - we are still investigating > > if we want this) but it's not a perfect solution, slurmctld thinks that > > those nodes are POWERED_DOWN which is not true, we just fake it. And we made > > also the ResumeProgram to do the correct steps to bring the nodes > > successfully to POWERED_ON state even if nodes were already up. > I reported a similar issue in Bug 15561 because we don't want drained nodes > to be suspended and then resumed. As a workaround I removed > idle_on_node_suspend from SlurmctldParameters so that drained nodes won't be > resumed. > > Would you be willing to share your code modifications for SuspendProgram and > ResumeProgram? No need for me to reinvent the wheel :-) If possible, I > would like to add these workarounds to my scripts in > https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save > > Thanks, > Ole Dear Ole, I have uploaded our scripts. I hope they will help you.. Some extra infos: - we run slurmd with the "-b" option, which means every time it's restarted the system boot time (inside slurmd state) is updated - part of our logic is implemented in PrologSlurmctld, where we wait until all nodes of the job have been powered on and then we start the job As we can see Slurm powersave mechanism is not perfect and there is a lot, a lot, of space for improvement. We don't powered down downed or drained nodes, we could extend this for nodes in MAINT too. And we are doing stupid workarounds where we restart slurmd in various places to make sure slurmctld stops the POWERING_UP|DOWN states otherwise we will reach the resume|suspend timeouts.. Best Regards, Valantis
Dear Valantis, (In reply to Chrysovalantis Paschoulas from comment #6) > I have uploaded our scripts. I hope they will help you.. Thanks a lot! The scripts contain some pretty complex logic which will take some time for me to figure out :-) > Some extra infos: > - we run slurmd with the "-b" option, which means every time it's restarted > the system boot time (inside slurmd state) is updated Can I ask why you need this? > - part of our logic is implemented in PrologSlurmctld, where we wait until > all nodes of the job have been powered on and then we start the job OK, but slurmctld ought to do this, right? > As we can see Slurm powersave mechanism is not perfect and there is a lot, a > lot, of space for improvement. We don't powered down downed or drained > nodes, we could extend this for nodes in MAINT too. And we are doing stupid > workarounds where we restart slurmd in various places to make sure slurmctld > stops the POWERING_UP|DOWN states otherwise we will reach the resume|suspend > timeouts.. I agree that the power_save plugin needs significant improvement for working correctly with IPMI power managed nodes. The power_save Slurm plugin was probably developed mainly for cloud nodes. But with current electricity prices, there is some motivation to save power with on-premise nodes. Best regards, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > > - we run slurmd with the "-b" option, which means every time it's restarted > > the system boot time (inside slurmd state) is updated > > Can I ask why you need this? > Because e.g. we don't power down nodes that are drained, but then when we want to online them we want the resume program to make it look like they were just powered on, hence we need that -b option. So the old drained nodes (that we are faking them to slurmctld that are powered down) we want to be able to undrain them and then run the resume program (which will run restart slurmd in the end to give new boot time), no powerdown/reboot is involved here. The powersave mechanism from slurmctld side needs the slurmds to have a new "node" boot time. > > - part of our logic is implemented in PrologSlurmctld, where we wait until > > all nodes of the job have been powered on and then we start the job > > OK, but slurmctld ought to do this, right? > No unfortunately, first we saw that PrologSlurmctld starts immediately after the first resume program has been spawned where I would expect/like slurmctld to wait until all resume programs of a job have been finished and then run the prolog script. Slurm doesn't even care about the exit code of the resume programs at all.. I am not sure if `sbatch_wait_nodes` affects PrologSlurmctld at all, I will have to check but I think no. Second, we have some other node health checks in PrologSlurmctld and we want to wait first for nodes to come up correctly and then we run the checks on them. > I agree that the power_save plugin needs significant improvement for working > correctly with IPMI power managed nodes. The power_save Slurm plugin was > probably developed mainly for cloud nodes. But with current electricity > prices, there is some motivation to save power with on-premise nodes. > Exactly! Best Regards, Valantis
Hi Valantis, We keep having nodes that are in the Slurm "down" state with a reason of "ResumeTimeout reached" because they were never powered down in reality by the slurmctld power_save plugin :-( So I want to test your "slurmd -b" trick so that we can simply resume the nodes in stead of rebooting them. (In reply to Chrysovalantis Paschoulas from comment #8) > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > > > - we run slurmd with the "-b" option, which means every time it's restarted > > > the system boot time (inside slurmd state) is updated > > > > Can I ask why you need this? > > > Because e.g. we don't power down nodes that are drained, but then when we > want to online them we want the resume program to make it look like they > were just powered on, hence we need that -b option. So the old drained nodes > (that we are faking them to slurmctld that are powered down) we want to be > able to undrain them and then run the resume program (which will run restart > slurmd in the end to give new boot time), no powerdown/reboot is involved > here. The powersave mechanism from slurmctld side needs the slurmds to have > a new "node" boot time. For the record, I had to find out how to modify Systemd on the CentOS 7 nodes to add "-b" automatically (this is non-trivial to me at least). The solution is to create the file /etc/systemd/system/slurmd.service.d/override.conf with this content: [Service] Environment="SLURMD_OPTIONS=-b" and then restart slurmd as follows: $ systemctl daemon-reload $ systemctl restart slurmd Do you have any comments or suggestions? Thanks, Ole Then the slurmd service
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #10) > > For the record, I had to find out how to modify Systemd on the CentOS 7 > nodes to add "-b" automatically (this is non-trivial to me at least). The > solution is to create the file > /etc/systemd/system/slurmd.service.d/override.conf with this content: > > [Service] > Environment="SLURMD_OPTIONS=-b" > > and then restart slurmd as follows: > > $ systemctl daemon-reload > $ systemctl restart slurmd > > Do you have any comments or suggestions? > Hi Ole! That's one possible way to do that ;) The rpms include unit files under /usr/lib/systemd/system/ and you correctly override them with drop-in files under /etc/systemd/system/<name>.service.d/ In slurmd unit file they have: ``` ... [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd EnvironmentFile=-/etc/default/slurmd ExecStart=<path>/slurmd -D -s $SLURMD_OPTIONS ... ``` So you could achieve the same with having a drop-in file like this: ``` [Service] ExecStart=<path>/slurmd -D -s -b ``` But as you can see in the original unit file they have already the line: ``` EnvironmentFile=-/etc/sysconfig/slurmd ``` So the proper way to do it is by creating/editing file "/etc/sysconfig/slurmd" where you will set: ``` SLURMD_OPTIONS=-b ``` In this way you don't need to create any drop-in file. Now you learnt something new :P Best Regards, Valantis
Hi Valantis, (In reply to Chrysovalantis Paschoulas from comment #11) > But as you can see in the original unit file they have already the line: > ``` > EnvironmentFile=-/etc/sysconfig/slurmd > ``` > > So the proper way to do it is by creating/editing file > "/etc/sysconfig/slurmd" where you will set: > ``` > SLURMD_OPTIONS=-b > ``` > In this way you don't need to create any drop-in file. Yes, this is even simpler that the Systemd solution. > Now you learnt something new :P Yes indeed :-) Thanks a lot, Ole
*** Bug 15561 has been marked as a duplicate of this bug. ***
Chrysovalantis, We added SuspendExcStates as an option in slurm.conf. The change is on the master branch and should be part of the 23.02 release. (See commits: 6838bee3a0 through 94ce25675b) Valid options include Down, Drain and Planned. (Planned is a state set by the backfill scheduler for nodes it plans on using). -Scott
(In reply to Scott Hilton from comment #19) > Chrysovalantis, > > We added SuspendExcStates as an option in slurm.conf. The change is on the > master branch and should be part of the 23.02 release. (See commits: > 6838bee3a0 through 94ce25675b) > > Valid options include Down, Drain and Planned. (Planned is a state set by > the backfill scheduler for nodes it plans on using). > > -Scott Hi Scott! That's great news, thanks! -Valantis
Hi Scott, (In reply to Scott Hilton from comment #19) > We added SuspendExcStates as an option in slurm.conf. The change is on the > master branch and should be part of the 23.02 release. (See commits: > 6838bee3a0 through 94ce25675b) > > Valid options include Down, Drain and Planned. (Planned is a state set by > the backfill scheduler for nodes it plans on using). This is great! My suggestion in bug 15561 comment 4 seems to be accommodated by the new SuspendExcStates option. I look forward to trying this out with 23.02. I am unable to find the commits in https://github.com/SchedMD/slurm/commits/master, can you give more info? Thanks, Ole
It seems to me that we need additional states to be exempted from power_saving. In bug 15561 comment 4 I suggested that we need to exempt nodes with the following states (plus "planned"): SuspendExcStates=down,drained,fail,maint,reboot_issued,reserved,unknown Thanks, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #22) > It seems to me that we need additional states to be exempted from > power_saving. In bug 15561 comment 4 I suggested that we need to exempt > nodes with the following states (plus "planned"): > > SuspendExcStates=down,drained,fail,maint,reboot_issued,reserved,unknown > > Thanks, > Ole I also agree with that. In general all states should be supported except: ALLOCATED, IDLE, MIXED, CLOUD and all POWER_* states. Okay, maybe I missed some states that doesn't make sense to exclude from suspension... :P -Valantis
Chrysovalantis, We added many more options for SuspendExcStates. Valid states include CLOUD, DOWN, DRAIN, DYNAMIC_FUTURE, DYNAMIC_NORM, FAIL, INVALID_REG, MAINTENANCE, NOT_RESPONDING, PERFCTRS, PLANNED, and RESERVED. -Scott
See commit 5f2cba52c6
(In reply to Scott Hilton from comment #30) > See commit 5f2cba52c6 Sorry, this commit doesn't come up in https://github.com/SchedMD/slurm/commits/master Can you provide the correct link?
https://github.com/SchedMD/slurm/commit/5f2cba52c6e030840990f1677f50363137123606
(In reply to Scott Hilton from comment #29) > Chrysovalantis, > > We added many more options for SuspendExcStates. > > Valid states include CLOUD, DOWN, DRAIN, DYNAMIC_FUTURE, DYNAMIC_NORM, FAIL, > INVALID_REG, MAINTENANCE, NOT_RESPONDING, PERFCTRS, PLANNED, and RESERVED. > > -Scott Thank you Scott!
Glad we could help.