Summary: | power_save mode: do not suspend/resume offline nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Chrysovalantis Paschoulas <c.paschoulas> |
Component: | Other | Assignee: | Ben Glines <ben.glines> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 21.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Jülich | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Chrysovalantis Paschoulas
2022-09-19 04:08:52 MDT
Hello, First of all, I am reducing this to a sev 3 based on our definitions here: https://www.schedmd.com/support.php. Please review those guidelines As for resuming/suspending drained/down nodes, I am not seeing all the behavior you are describing. Slurm won't schedule jobs on a drained/downed node, and so it shouldn't ever try to resume a drained/down node. In my testing, I have not been able to get Slurm to try and resume a drained/down node. A job requesting such a node will remain pending until the node is manually updated with scontrol to be in the idle state. Can you manually reproduce this on your system? It is expected that Slurm will suspend drained/down nodes. Once a node is drained/down though, Slurm should not try and resume it again, and you should be able to perform any maintenance needed. I would suggest considering SuspendExcNodes or SuspendExcParts for your needs as well. As for your suggestion for configuring power saving features for nodes based on their state (i.e. not suspending drained/down nodes) or based on what reservations they are in, I am not sure if these are things we would want to implement. I'm discussing this internally and I'll get back to you on this. Again, Slurm should not be resuming drained/down nodes currently, so hopefully we can track down the issue you're seeing. Please let me know if you can manually reproduce this, and if so, I would appreciate any steps required to reproduce it so I can work on this issue on my end. OK I can confirm that no drained/down node is ever used for a job allocation. A colleague had enabled `idle_on_node_suspend` and I think this was the culprit for some unexpected behavior, now he will repeat his tests and I will report back to you soon about that. But we found the bug! So we have a small test cluster where node `jwtc10n002` has state `DOWN+DRAIN+POWERED_DOWN` (or in some tests `DOWN+POWERED_DOWN(+NOT_RESPONDING)`). That node belongs to batch partition: ``` PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST batch* up infinite 0/5/1/6 jwtc10n[000-002,048-050] ``` Now, when we run a job that requests all nodes before and after that node in the batch partition, e.g.: ``` srun -A root -p batch -w jwtc10n[000-001,048-050] bash -c 'sleep 10; hostname' ``` We get the following logs: ``` 2022-09-21_16-36-03 resume_program[9521] resuming jwtc10n[000-002] 2022-09-21_16-36-23 resume_program[10197] resuming jwtc10n048 2022-09-21_16-36-43 resume_program[10703] resuming jwtc10n049 2022-09-21_16-37-03 resume_program[11223] resuming jwtc10n050 ... ``` In our resume_program we just print the above line in the beginning with the arg/nodelist that slurmctld passes to the script for resuming them. As you can see node `jwtc10n002` is in the first nodelist, even when that node was NOT requested in the allocation! Also I am concerned about the resuming rate, currently we have set it to 3 nodes/min but the following nodes e.g. jwtc10n048, were resume before the first actually finish their resume_program! In this case the resume program exits with non-zero code but the job runs successfully on the healthy nodes that were requested. That means that the return code of the resume program is ignored.. Is this desired? Here is our related config: ``` ResumeFailProgram=/etc/slurm/resume_fail_program.sh ResumeProgram=/etc/slurm/resume_program.sh ResumeRate=3 ResumeTimeout=300 SuspendExcNodes=jwtc10n002,jwtc10n096 #SuspendExcParts= SuspendProgram=/etc/slurm/suspend_program.sh SuspendRate=3 SuspendTime=1800 SuspendTimeout=300 ``` Ha! I just realized that jwtc10n002 shouldn't be even suspended, it's in SuspendExcNodes list! I am not sure if we added that config after the node was suspended.. Maybe Slurm tries to resume it because it is in SuspendExcNodes list? No, after removing that config it still tries to resume that node.. Another weird behavior we realized: when the suspend program exits successfully the nodes are not set to POWERED_DOWN immediately. Instead they are in POWERING_DOWN state until SuspendTimeout is reached (or a bit earlier). Anyway I still believe that suspending down/drained nodes shouldn't happen, since admins may be working on them and there is no nice interface (e.g. with scontrol) for modifying the runtime config for SuspendExcNodes and SuspendExcParts. It would be also nice to have `SuspendExcNodeStates` in slurm.conf ;) Even after removing `idle_on_node_suspend` my colleague reported that they get the same behavior. I guess they have same/similar problem with the bug I reported above... They request a job with nodes `n[01-06,08-11,13-20,31-34,38-45]` and they see that the resume program runs for nodes `n[01-30]`. So on 2 different cluster we could reproduce that the resume program receives wrong nodelists.. Please fix it :) I am able to reproduce this issue on 21.08, but not on the latest 22.05 version. Looks like things are working just fine on 22.05 as far as I can tell right now. I'm discussing this with other engineers internally to see when this issue may have been fixed. Could you tell me what version your colleague was running when they saw a similar problem? (In reply to Ben Glines from comment #4) > I am able to reproduce this issue on 21.08, but not on the latest 22.05 > version. Looks like things are working just fine on 22.05 as far as I can > tell right now. I'm discussing this with other engineers internally to see > when this issue may have been fixed. > > Could you tell me what version your colleague was running when they saw a > similar problem? In both cases we use `21.08.8-2`. We would really appreciate if you could provide us a patch for 21.08.8-2, we are not able to upgrade to 22.05 right now, because of some third party deps. The fix for this is in the following commit: https://github.com/SchedMD/slurm/commit/fd1ea63a6d43407ea5c64bbce36b81ec96128025. The first official tagged release that contains this commit is the first 22.05 release. We don't actively backport fixes, but you're free to apply the changes in this commit as a patch at your own risk. I'm going to close this one out now. Feel free to reopen if you have any further questions on this. Your requests for enhancement would also be best tracked in a separate bug if you are still interested in them. |