Ticket 15185 - Dynamically update SuspendExcNodes and SuspendExcParts during runtime
Summary: Dynamically update SuspendExcNodes and SuspendExcParts during runtime
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-16 09:23 MDT by Chrysovalantis Paschoulas
Modified: 2023-02-02 02:08 MST (History)
3 users (show)

See Also:
Site: Jülich
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0rc1
Target Release: 23.11
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chrysovalantis Paschoulas 2022-10-16 09:23:19 MDT
I have already opened Bug 15184 where I presented our hard requirement to be able to skip powering down nodes that are drained, and here I would like to talk about some other ideas I have to improve the whole power saving mechanism.

I think it would be really nice if it was possible to update during runtime SuspendExcNodes and SuspendExcParts and make them restart-proof.

The interface could be like:
```
scontrol suspendexcnodes=+/-<nodelist>
scontrol suspendexcparts=+/-<partitions>
```

What is in SuspendExcNodes and SuspendExcParts should be static and unchangeable, but we could add nodes and partitions dynamically (and be able to revome them later). Also these data should be stored in state files in order to be restorable after restarting slurmctld.

What do you think?
Comment 1 Bas van der Vlies 2022-10-18 05:17:53 MDT
This is really a nice idea  dynamically update the exclude config for power saving mechanism.
Comment 6 Scott Hilton 2023-01-16 16:05:13 MST
SuspendExcNodes has a peculiar kind of nodelist with the optional ":" separator. This is used to specify groups of nodes from which a certain number should stay online.
See: https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendExcNodes

Is it important to you to be able to add and remove from lists with this special ":" syntax? If so what is specifically needed for your workflow?

-Scott
Comment 7 Chrysovalantis Paschoulas 2023-01-17 01:25:32 MST
(In reply to Scott Hilton from comment #6)
> SuspendExcNodes has a peculiar kind of nodelist with the optional ":"
> separator. This is used to specify groups of nodes from which a certain
> number should stay online.
> See: https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendExcNodes
> 
> Is it important to you to be able to add and remove from lists with this
> special ":" syntax? If so what is specifically needed for your workflow?
> 
> -Scott

Hi Scott!

No, for our case I would say that this feature of ":" is not needed. As far as I can imagine we will need to exclude from suspension only specific nodes, e.g. because we will want to use them for various reasons (like reserving them for a course or doing some tests on them, running the testsuite etc..) or keeping them online for doing some maintenance, HW work, etc..

Cheers,
Valantis
Comment 13 Scott Hilton 2023-02-01 14:59:07 MST
Valantis,

We have completed this feature and it should be part of release 23.02. See commits fc5ec8c83f - 77c1c7d7ae.


-Scott
Comment 14 Chrysovalantis Paschoulas 2023-02-02 02:08:16 MST
(In reply to Scott Hilton from comment #13)
> Valantis,
> 
> We have completed this feature and it should be part of release 23.02. See
> commits fc5ec8c83f - 77c1c7d7ae.
> 
> 
> -Scott

Hi Scott, that's great!

I see that we will be able to dynamically update the excluded states too :)

Thank you very much!

-Valantis