Ticket 17848 - power_save module doesn't work for the documented "no action" suspend/resume programs
Summary: power_save module doesn't work for the documented "no action" suspend/resume ...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 23.02.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-10-06 04:13 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2023-11-16 01:14 MST (History)
2 users (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.7, 23.11.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2023-10-06 04:13:55 MDT
The power_save module documentation https://slurm.schedmd.com/power_save.html states that:

"You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it."

We have tested just such a feature, in addition to our normal IPMI-based power saving.  This "no action" question came up in a thread in the slurm-users mailing list.

This is our testing setup:

I created a power_noaction script available in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save.  

We have configured suspend/resume scripts in slurm.conf:

ResumeProgram=/usr/local/bin/noderesume
SuspendProgram=/usr/local/bin/nodesuspend

These scripts will call a power_noaction script for our nodes which are configured with the "power_noaction" node feature:

NodeName=i[005-030] Weight=10313 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=128000 TmpDisk=198000 Feature=xeon2650v2,infiniband,xeon16,power_noaction

Now we enable power saving for the partition xeon16_test with SuspendTime=60:

PartitionName=xeon16_test Nodes=i[005-006] DefaultTime=10:00 MaxTime=10:00 DefMemPerCPU=3900 MaxMemPerCPU=4050 State=UP OverSubscribe=NO TRESBillingWeights="CPU=0.75" SuspendTime=60

Results of the testing:

1. After an "scontrol reconfig" the nodes i[005-006] correctly get "suspended" by the power_noaction script and get a new status of IDLE+POWERED_DOWN:

$ scontrol show node i005
NodeName=i005 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.01
   AvailableFeatures=xeon2650v2,infiniband,xeon16,power_noaction
   ActiveFeatures=xeon2650v2,infiniband,xeon16,power_noaction
   Gres=(null)
   NodeAddr=i005 NodeHostName=i005 Version=23.02.5
   OS=Linux 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023 
   RealMemory=128000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=1 TmpDisk=198000 Weight=10313 Owner=N/A MCS_label=N/A
   Partitions=xeon16_test 
   BootTime=2023-09-21T18:40:14 SlurmdStartTime=2023-09-26T20:24:42
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=16,mem=125G,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Please note that the node is NOT powered down in reality!!  The node is still up and running, and slurmd is running since 2023-09-26T20:24:42.

2. I submit a job to the xeon16_test partition, and now node i005 changes to  State=ALLOCATED+NOT_RESPONDING+POWERING_UP which sounds OK.

3. But after the ResumeTimeout = 2000 sec has been exceeded, the node gets a new State=DOWN+POWERED_DOWN:

$ scontrol show node i005
NodeName=i005 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.01
   AvailableFeatures=xeon2650v2,infiniband,xeon16,power_noaction
   ActiveFeatures=xeon2650v2,infiniband,xeon16,power_noaction
   Gres=(null)
   NodeAddr=i005 NodeHostName=i005 Version=23.02.5
   OS=Linux 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023 
   RealMemory=128000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=DOWN+POWERED_DOWN ThreadsPerCore=1 TmpDisk=198000 Weight=10313 Owner=N/A MCS_label=N/A
   Partitions=xeon16_test 
   BootTime=2023-09-21T18:40:14 SlurmdStartTime=2023-09-26T20:24:42
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=16,mem=125G,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=ResumeTimeout reached [slurm@2023-10-06T09:52:04]

The job submission and failed resume are logged by slurmctld:

$ grep i005 /var/log/slurm/slurmctld.log
[2023-10-06T09:18:34.456] sched: Allocate JobId=6738703 NodeList=i005 #CPUs=16 Partition=xeon16_test
[2023-10-06T09:52:04.636] node i005 not resumed by ResumeTimeout(2000) - marking down and power_save
[2023-10-06T09:52:04.636] Killing JobId=6738703 on failed node i005

Conclusion: This seems to prove that the power_save module doesn't work if we configure the "no action" suspend/resume method, which is documented in the power_save manual page.

Question: Is the documented "no action" suspend/resume method not valid any longer?  If it ought to work, the above test would indicate an issue with the power_save module.

Thanks,
Ole
Comment 1 Davide DelVento 2023-10-06 06:54:25 MDT
Thanks Ole!

As mentioned in the mailing list, I confirm this same unexpected behavior in slurm 23.02.3 too.
Comment 2 Marshall Garey 2023-10-06 09:27:39 MDT
This is working as expected. You verified that the scripts ran when they were supposed to run and that slurmctld changed node states, waiting for action by the powersave scripts. Because the powersave scripts took no action, when the appropriate timeouts happened slurmctld then changed node states again. This shows the powersave module working.

https://slurm.schedmd.com/power_save.html
```
SuspendProgram: Program to be executed to place nodes into power saving mode.
```

https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendProgram
```
This program is expected to place the node into some power save mode.
```

SuspendProgram is expected to place the node into a power saving mode. slurmctld just sets the node state to powering down, then to powered down. slurmctld cannot verify that the node was powered off, so it does not try.


https://slurm.schedmd.com/power_save.html
```
ResumeProgram: Program to be executed to remove nodes from power saving mode....
This program may use the scontrol show node command to ensure that a node has booted and the slurmd daemon started. If the slurmd daemon fails to respond within the configured ResumeTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) you can start slurmd with the "-b" option to report the node boot time as now.
```

https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeProgram
```
ResumeProgram
If ResumeProgram is unable to restore a node to service with a responding slurmd and an updated BootTime, it should set the node state to DOWN, which will result in a requeue of any job associated with the node - this will happen automatically if the node doesn't register within ResumeTimeout. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) starting slurmd with "-b" option might be useful.
```

Right before this program runs, slurmctld places the node into a powering up state. ResumeProgram is expected to either reboot the node and restart slurmd, or restart slurmd with the -b flag. If neither of these things happen before ResumeFailProgram, then slurmctld places the node into a down state and requeues the job if possible. Then the ResumeFailProgram is executed.


https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram
```
ResumeFailProgram
The program that will be executed when nodes fail to resume to by ResumeTimeout. The argument to the program will be the names of the failed nodes (using Slurm's hostlist expression format). Programs will be killed if they run longer than the largest configured, global or partition, ResumeTimeout or SuspendTimeout.
```

What did you expect to happen?
Comment 3 Davide DelVento 2023-10-06 09:47:46 MDT
Thanks for the quick follow up Marshall.

> ResumeProgram is expected to either reboot the node and restart slurmd, 
> or restart slurmd with the -b flag.

> What did you expect to happen?

Since slurmd was never stopped in these tests (like the node was never shut down), we were expecting that it were not necessary to restart it, but simply that slurmctl would attempt to communicate with slurmd on the pretend-reboot node, succeed and carry on.

If I understand correctly, a restart of slurmd is absolutely necessary for even the tests to succeed. I think this makes the tests a little more invasive, but that's okay. 

I recommend correcting the sentence about it on the page https://slurm.schedmd.com/power_save.html because one can *not* "configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram" as the documentation currently say. Such programs must restart slurmd which is definitely not "no action".

Thanks!
Comment 4 Marshall Garey 2023-10-06 10:22:52 MDT
(In reply to davide.quantum from comment #3)
> I recommend correcting the sentence about it on the page
> https://slurm.schedmd.com/power_save.html because one can *not* "configure
> Slurm with programs that perform no action as SuspendProgram and
> ResumeProgram" as the documentation currently say. Such programs must
> restart slurmd which is definitely not "no action".
> 
> Thanks!

We can clarify the documentation. However, I disagree that you cannot configure "no action" powersave scripts. You demonstrated that configuring Slurm with powersave scripts that do nothing tests that Slurm goes through the correct steps and will hit the ResumeTimeout and perform the appropriate action. Don't expect the "powersaving" part of the powersave module to work if your scripts do nothing. I think it is a matter of expectations that we can clarify in the documentation.

What do you think?
Comment 5 Davide DelVento 2023-10-06 11:04:01 MDT
I agree with you that it's matter of expectations.

My expectations were that a "no action" script would be something that I could test in a production deployment and it would have allowed me to first test that I did not configure something causing unexpected things, and second I could run some statistics on how much power I could potentially save. I cannot do that with a "no action" script because it instead it slowly brings down the partition I applied it to, for the reason you explained.

I think that as long as that is clarified in the documentation, the current behavior is okay, and for the test I intended to use the script must restart slurmd on the suspended nodes.

I will test this next week. As far as I am concerned, you may close this ticket.

Thanks again for the quick and useful discussion!
Comment 6 Marshall Garey 2023-10-06 11:32:46 MDT
Thanks for your input, Ole. It helps us improve our documentation, which in turn helps everyone else.

I'll keep this ticket open until we get the documentation updated.
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2023-10-06 13:40:54 MDT
(In reply to Marshall Garey from comment #4)
> (In reply to davide.quantum from comment #3)
> > I recommend correcting the sentence about it on the page
> > https://slurm.schedmd.com/power_save.html because one can *not* "configure
> > Slurm with programs that perform no action as SuspendProgram and
> > ResumeProgram" as the documentation currently say. Such programs must
> > restart slurmd which is definitely not "no action".
> > 
> > Thanks!
> 
> We can clarify the documentation. However, I disagree that you cannot
> configure "no action" powersave scripts. You demonstrated that configuring
> Slurm with powersave scripts that do nothing tests that Slurm goes through
> the correct steps and will hit the ResumeTimeout and perform the appropriate
> action. Don't expect the "powersaving" part of the powersave module to work
> if your scripts do nothing. I think it is a matter of expectations that we
> can clarify in the documentation.

It's great if the power save documentation can be clarified!  As I read the current manual, "no action" powersave scripts should do exactly nothing at all, just as a test!  To me "no action" is equivalent to your "do nothing".  

I'm confused why "no action" would ever be useful in connection with suspend and resume.  What's the point of documenting "no action" scripts "to assess the potential impact of power saving mode before enabling it"?  I fail to appreciate this logic.

If we wanted to emulate power saving with the "no action" scripts (without actually powering nodes down), then perhaps the resume script should restart slurmd on the nodes so that the powersave module would *think* that the nodes had actually been rebooted and slurmd restarted?  If this would be a valid scenario, then I don't understand how the slurm user running the resume script on the controller would be authorized to restart the slurmd service on the nodes?

I look forward to reading your updated documentation and see if "no action" becomes clearer to me and its usefulness becomes evident to readers.

Thanks,
Ole
Comment 8 Davide DelVento 2023-11-08 15:17:41 MST
I still can not get this to work.

I created a script (and a syslink to it) as follows:

$ cat poweroff-sudo
#!/bin/bash 
set -e # make sure slurm sees a failure when there is one 
log_file=/somewhere/slurm_power/$(date '+%Y')-$(date '+%m')-$(date '+%d').txt 
date >> $log_file 2>&1 
echo "Attempting to (fakely for now) $0 the following node(s): $1" >> $log_file 2>&1 
cv-power -n $1 status >> $log_file 2>&1 
invocation=$(basename $0) 
if [ "$invocation" = "poweron-sudo" ]; then 
cv-exec -n $1 'systemctl start slurmd' >> $log_file 2>&1 
elif [ "$invocation" = "poweroff-sudo" ]; then 
cv-exec -n $1 'systemctl stop slurmd' >> $log_file 2>&1 
else 
echo "Error Invalid invocation: $0" >> $log_file 2>&1 
fi 
echo -e "Done\n==============================" >> $log_file 2>&1 

I can invoke those scripts passwordless as user slurm with no problems and achieving the desired results, namely seeing the affected node(s) going "down" (tilde suffix in their idle status in sinfo output). These are almost immediate outcomes. Note: cv-exec are scripts from our cluster management software which run the command in single quote on the specified node.

So far, so good.

If I enable the powersave options in slurm with the timeouts as I discussed below, the powerdowns appear to be working fine, however the powerup do not. The problematic thing I see in /var/log/slurm/slurmctld.log is things like

[2023-11-08T14:12:03.058] sched/backfill: _start_job: Started JobId=296089 in compute256 on co49svnode04 
[2023-11-08T14:14:16.685] _slurm_rpc_submit_batch_job: JobId=296090 InitPrio=4294750896 usec=242 
[2023-11-08T14:14:17.062] sched/backfill: _start_job: Started JobId=296090 in compute256 on co49svnode08 
[2023-11-08T14:16:10.199] node co49svnode04 not resumed by ResumeTimeout(240) - marking down and power_save 

The corresponding entries from the log of my own script are 

Wed Nov 8 14:12:03 MST 2023 
Attempting to (fakely for now) /opt/slurm/poweron-sudo the following node(s): co49svnode04 
co49svnode04 : on 
Done 
============================== 
Wed Nov 8 14:14:17 MST 2023 
Attempting to (fakely for now) /opt/slurm/poweron-sudo the following node(s): co49svnode08 
co49svnode08 : on 
Done 
============================== 

And if I have no other nodes in the partition, the affected job will stay PD or CF forever.

Once I get tired and remove the SuspendTime from the partition, often times the nodes go back in production on their own (showing there was no problem with slurmd or anything else), or at most after invoking (as user slurm) that /opt/slurm/poweron-sudo script once again.


I suspect slurm is expecting something else to happen during a real power on, and I would like to simulate that so that I can tweak my various knobs to values reasonable for our cluster, before actually starting to shutting up and down nodes (with the power and time consequences). If that is correct, what else should I do other than stopping and starting slurmd?

By the way the documentation still claims that "You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it" (which I am really sold on doing)
Comment 10 Marshall Garey 2023-11-08 15:51:42 MST
Davide,

I have not yet updated the documentation. That is why the bug is still open. I'm keeping this bug open until the documentation is done.

This ticket was opened by Ole, and I'm going to reserve it for any specific questions that he has for getting powersave to work for his site. If you have any questions regarding your site, please open a new ticket.

Note that if you do not have a support contract, then we will not be able to help you. I do not see you on our supported users list; if that is a mistake, contact Jesser Arrington or whoever handles your support contracts with us.
Comment 11 Davide DelVento 2023-11-08 19:00:42 MST
Thanks for letting me know about the documentation, Marshall.

The fact that I don't have a support contract (yet) is correct, and I do understand that because of not having support you will not be helping me directly with this ticket (or any other problem).

However, I thought (and please correct me if I am mistaken here) that you guys can  still be interested in my reports, if well verified and documented as I think this is, because it can help you understand the issue and/or with clarifying the documentation. Of course you will prioritize solving issues based on requests from paying customers.

If instead you prefer me refraining from chiming in for this and other tickets, do let me know so I'll avoid creating unnecessary noise here, and discuss on the mailing list only (in such a case, I apologize for the confusion)

Speaking of "well documented", I forgot to mention that the behavior I mentioned in my previous update is happening in slurm v23.02.6

Thanks!
Comment 19 Ben Roberts 2023-11-15 08:55:43 MST
Hi Ole,

We have checked in a fix that removes the reference to running "no action" scripts to test the power save functionality.  With the descriptions for the SuspendProgram and ResumeProgram I think it is clear that these scripts need to perform some action.  This change will be reflected on the website with the release of 23.02.7 or 23.11.0, whichever comes first.  If you're interested you can see the commit here:
https://github.com/SchedMD/slurm/commit/39087aaa73c44d0f29a50cf8849548be140ed7c9

I'll close this ticket, but let us know if there's anything else we can do to help.

Thanks,
Ben
Comment 20 Ole.H.Nielsen@fysik.dtu.dk 2023-11-16 01:14:46 MST
Hi Ben,

(In reply to Ben Roberts from comment #19)
> We have checked in a fix that removes the reference to running "no action"
> scripts to test the power save functionality.  With the descriptions for the
> SuspendProgram and ResumeProgram I think it is clear that these scripts need
> to perform some action.  This change will be reflected on the website with
> the release of 23.02.7 or 23.11.0, whichever comes first.  If you're
> interested you can see the commit here:
> https://github.com/SchedMD/slurm/commit/
> 39087aaa73c44d0f29a50cf8849548be140ed7c9

Thanks a lot for updating and clarifying the documentation, it looks good now.

Best regards,
Ole