Ticket 9734 - Jobs sent to higher weight idle node instead of starting lower weight node
Summary: Jobs sent to higher weight idle node instead of starting lower weight node
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 20.02.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
: 10195 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-09-03 09:06 MDT by Brian Andrus
Modified: 2022-11-04 14:45 MDT (History)
5 users (show)

See Also:
Site: Lam
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Brian Andrus 2020-09-03 09:06:26 MDT
I have several nodes in the same partition. Node A has a weight of 500 all others have a weight of 1.
These are cloud-based nodes, so they get deallocated (powered_off) when not in use.
Node A has the higher weight because it has a node-locked license and we want users that need it to request it by name and not have it in-use unless the others are all busy.

I have seen that if Node A is idle and other nodes are powered down, a generic job will get assigned to Node A rather than resume a lower priority down node.

This kind of defeats the purpose of weights in a cloud environment.
Comment 1 Broderick Gardner 2020-09-23 09:03:45 MDT
Yes, as it is currently designed, nodes in a powered-down state are in a lower tier altogether, making weights useless for cloud. I'm looking into what the scope for an enhancement would be in this regard.

Thanks
Comment 4 Nate Rini 2020-11-11 10:12:00 MST
*** Ticket 10195 has been marked as a duplicate of this ticket. ***
Comment 6 Nick Ihli 2021-11-12 15:46:43 MST
Brian,

I am investigating the use cases around this requirement. Can you provide more details on why you would prefer nodes that are powered off versus using those that are already up? 

Is this a situation where the Power On nodes are more expensive and so it would be more cost effective to let those power down and instead use less expensive nodes?

Any more details is greatly appreciated.

Thanks,
Nick
Comment 7 Brian Andrus 2021-11-12 15:49:26 MST
Nick,

Simple:
The nodes have a configured feature that is not available on the other nodes.
In my case it is a node-locked license (the vendor does not provide floating licenses)


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, November 12, 2021 2:47 PM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9734] Jobs sent to higher weight idle node instead of starting lower weight node



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Nick Ihli<mailto:nick@schedmd.com> changed bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094408133%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2aY8fnabtTv9xQ0Tj9%2BoG5EdR6M5XhZ2NW1tTmaglZw%3D&reserved=0>
What
Removed
Added
CC

nick@schedmd.com<mailto:nick@schedmd.com>
Comment # 6<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734%23c6&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094418125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G8uggKENoKC6AzVsl3QphRkBi1WWyLyFOnFTQi8l5Q0%3D&reserved=0> on bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094418125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jsk1nIE5uhwtjnpMboe%2FS6ULscUcGsMn9IRXIARnpP4%3D&reserved=0> from Nick Ihli<mailto:nick@schedmd.com>

Brian,



I am investigating the use cases around this requirement. Can you provide more

details on why you would prefer nodes that are powered off versus using those

that are already up?



Is this a situation where the Power On nodes are more expensive and so it would

be more cost effective to let those power down and instead use less expensive

nodes?



Any more details is greatly appreciated.



Thanks,

Nick

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
Comment 8 Nick Ihli 2021-11-12 18:19:01 MST
How is the user asking for that special feature/license? If only the nodes with the feature are configured with it, then only those nodes should be used. If they are powered down, then Slurm would use those nodes (power them up) instead of Powered On nodes without the feature. Am I tracking this properly or missing anything?
Comment 9 Brian Andrus 2021-11-15 10:29:46 MST
This had been sitting so long, I hadn't checked things on updates and such.

I have tested and using constraint works as expected and is appropriate for our use case.

The 'bug' still exists in that the node weights are not considered if they are powered down. I would suggest a note in the documentation that states node weights are first considered only among currently available nodes. Not sure if it does start up the heaviest weight node first if the are all powered down, but would expect that to be the case.




[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems
brian.andrus@lamresearch.com

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, November 12, 2021 5:19 PM
To: Andrus, Brian <Brian.Andrus@lamresearch.com>
Subject: [Bug 9734] Jobs sent to higher weight idle node instead of starting lower weight node



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 8<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734%23c8&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5f3dc2e51ca247297f1408d9a64393dd%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723631472247274%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DbzPt1dZ0dIRFtuhA1NM6HzQWiMhxd2uSSDMRwf22O8%3D&reserved=0> on bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5f3dc2e51ca247297f1408d9a64393dd%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723631472257269%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpXagctpx5RxzCXhnUES%2Fg233c7PW9TCReaOTYrE8%2F8%3D&reserved=0> from Nick Ihli<mailto:nick@schedmd.com>

How is the user asking for that special feature/license? If only the nodes with

the feature are configured with it, then only those nodes should be used. If

they are powered down, then Slurm would use those nodes (power them up) instead

of Powered On nodes without the feature. Am I tracking this properly or missing

anything?

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
Comment 10 Nick Ihli 2021-11-16 11:33:04 MST
Great suggestion on the documentation. I will get that added.

There are some use cases we see it make sense where making the node weights adhered to for powered down nodes. Thanks for your insight in further clarifying your use case.
Comment 11 Jason Booth 2021-11-17 17:37:36 MST
> I have tested and using constraint works as expected and is appropriate for our use case.

Thank you for the feedback and as Nick stated we will make a note about this.


> The 'bug' still exists in that the node weights are not considered if they are powered down. I would suggest a note in the documentation that states node weights are first considered only among currently available nodes. Not sure if it does start up the heaviest weight node first if the are all powered down, but would expect that to be the case.

We  have looked into this and do not consider this a bug, however, we do consider it a feature improvement. We will take your feedback surrounding node priorities into consideration, however at this time we do not have any active plans to address this currently.
Comment 12 Skyler Malinowski 2022-03-04 13:07:18 MST
*** Ticket 13566 has been marked as a duplicate of this ticket. ***