I have several nodes in the same partition. Node A has a weight of 500 all others have a weight of 1. These are cloud-based nodes, so they get deallocated (powered_off) when not in use. Node A has the higher weight because it has a node-locked license and we want users that need it to request it by name and not have it in-use unless the others are all busy. I have seen that if Node A is idle and other nodes are powered down, a generic job will get assigned to Node A rather than resume a lower priority down node. This kind of defeats the purpose of weights in a cloud environment.
Yes, as it is currently designed, nodes in a powered-down state are in a lower tier altogether, making weights useless for cloud. I'm looking into what the scope for an enhancement would be in this regard. Thanks
*** Ticket 10195 has been marked as a duplicate of this ticket. ***
Brian, I am investigating the use cases around this requirement. Can you provide more details on why you would prefer nodes that are powered off versus using those that are already up? Is this a situation where the Power On nodes are more expensive and so it would be more cost effective to let those power down and instead use less expensive nodes? Any more details is greatly appreciated. Thanks, Nick
Nick, Simple: The nodes have a configured feature that is not available on the other nodes. In my case it is a node-locked license (the vendor does not provide floating licenses) [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, November 12, 2021 2:47 PM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9734] Jobs sent to higher weight idle node instead of starting lower weight node External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Nick Ihli<mailto:nick@schedmd.com> changed bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094408133%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2aY8fnabtTv9xQ0Tj9%2BoG5EdR6M5XhZ2NW1tTmaglZw%3D&reserved=0> What Removed Added CC nick@schedmd.com<mailto:nick@schedmd.com> Comment # 6<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734%23c6&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094418125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G8uggKENoKC6AzVsl3QphRkBi1WWyLyFOnFTQi8l5Q0%3D&reserved=0> on bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C6ef9e8bd6e8e4b8ca28208d9a62e4d0d%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723540094418125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jsk1nIE5uhwtjnpMboe%2FS6ULscUcGsMn9IRXIARnpP4%3D&reserved=0> from Nick Ihli<mailto:nick@schedmd.com> Brian, I am investigating the use cases around this requirement. Can you provide more details on why you would prefer nodes that are powered off versus using those that are already up? Is this a situation where the Power On nodes are more expensive and so it would be more cost effective to let those power down and instead use less expensive nodes? Any more details is greatly appreciated. Thanks, Nick ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
How is the user asking for that special feature/license? If only the nodes with the feature are configured with it, then only those nodes should be used. If they are powered down, then Slurm would use those nodes (power them up) instead of Powered On nodes without the feature. Am I tracking this properly or missing anything?
This had been sitting so long, I hadn't checked things on updates and such. I have tested and using constraint works as expected and is appropriate for our use case. The 'bug' still exists in that the node weights are not considered if they are powered down. I would suggest a note in the documentation that states node weights are first considered only among currently available nodes. Not sure if it does start up the heaviest weight node first if the are all powered down, but would expect that to be the case. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg] Brian Andrus - HPC Systems brian.andrus@lamresearch.com From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, November 12, 2021 5:19 PM To: Andrus, Brian <Brian.Andrus@lamresearch.com> Subject: [Bug 9734] Jobs sent to higher weight idle node instead of starting lower weight node External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 8<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734%23c8&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5f3dc2e51ca247297f1408d9a64393dd%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723631472247274%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DbzPt1dZ0dIRFtuhA1NM6HzQWiMhxd2uSSDMRwf22O8%3D&reserved=0> on bug 9734<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9734&data=04%7C01%7Cbrian.andrus%40lamresearch.com%7C5f3dc2e51ca247297f1408d9a64393dd%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637723631472257269%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YpXagctpx5RxzCXhnUES%2Fg233c7PW9TCReaOTYrE8%2F8%3D&reserved=0> from Nick Ihli<mailto:nick@schedmd.com> How is the user asking for that special feature/license? If only the nodes with the feature are configured with it, then only those nodes should be used. If they are powered down, then Slurm would use those nodes (power them up) instead of Powered On nodes without the feature. Am I tracking this properly or missing anything? ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.
Great suggestion on the documentation. I will get that added. There are some use cases we see it make sense where making the node weights adhered to for powered down nodes. Thanks for your insight in further clarifying your use case.
> I have tested and using constraint works as expected and is appropriate for our use case. Thank you for the feedback and as Nick stated we will make a note about this. > The 'bug' still exists in that the node weights are not considered if they are powered down. I would suggest a note in the documentation that states node weights are first considered only among currently available nodes. Not sure if it does start up the heaviest weight node first if the are all powered down, but would expect that to be the case. We have looked into this and do not consider this a bug, however, we do consider it a feature improvement. We will take your feedback surrounding node priorities into consideration, however at this time we do not have any active plans to address this currently.
*** Ticket 13566 has been marked as a duplicate of this ticket. ***