Bug 3881 - Floating Partitions and Pending Jobs
Summary: Floating Partitions and Pending Jobs
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 17.02.4
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-06-08 11:52 MDT by Stephen Fralich
Modified: 2020-04-28 15:21 MDT (History)
3 users (show)

See Also:
Site: University of Washington
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stephen Fralich 2017-06-08 11:52:27 MDT
In bug 3871 we had the following interaction, "Me" => SchedMD:
==
"Despite there being many nodes free on our cluster, Slurm is not starting preemptable jobs in our excess capacity"

This is by design. If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common.

This is not likely to change soon.
==


Despite it being not likely to change soon, I'd like to enter this as the UW's vote for this behavior to change or for there be some option for different behavior. I'm going to switch our cluster to use individual partitions for each customer group, but being able to use floating partitions would be a lot better.

Floating partitions are a huge asset to condo style clusters like the one we operate. It allows us to manage the entire cluster as a pool of resources rather than splitting it up into a bunch of partitions. We have numerous customers that own only one to a few nodes. Floating partitions get us out of the business of having to worry about specific nodes and their statuses since everyone has a large number of nodes available to them. We have an agreement with our customers that the capacity they purchased will always be available to them. In theory it should cut down on how much preemption occurs as well and let the scheduler preempt the least impacting job rather than a specific one based on node usage. That's also a really good thing.

It's critical, though, that there's some way we can use idle cycles. At our site at least, only about 50% of the cycles are used by the node owners. Another 30% are used by UW researchers, but in idle capacity. Usage of idle capacity is critical. It's one of the reasons central HPC exists. While I realize that sites like ours probably don't make up much of your customer base, the scheduler policy we need is based on the organizational reality here and I don't see that changing soon either.
Comment 1 Tim Wickberg 2017-06-08 15:37:39 MDT
I can classify this as an enhancement request if you'd like; but as I'd indicated this is not a trivial issue, but rather a side-effect of the architecture of Slurm's preemption and prioritization model.

Unfortunately, I don't have a great workaround for the combination of preemption and floating partitions - as the floating partitions "cast a shadow" across all nodes in the lower priority partition, any pending job in the floating partition would cause this issue.

The only suggestions I can make at the moment are to allow the floating partitions to only "float a little bit" over a reduced node count, or look into creating a short-wall-time partition with the same PriorityTier as the condo partitions, but with a reduced PriorityJobFactor to ensure the owner jobs take precedence.

- Tim
Comment 2 Stephen Fralich 2017-06-08 16:28:49 MDT
Yes, an enhancement for sure, but it'll never happen if we don't ask, so this is us asking. I didn't know what the proper avenue was for making such a request.

Yeah, I'm probably going to have one floating partition where small node customers nodes will all be pooled and then separate fixed partitions for large node customers or customers that queue a lot of work. I'll have to read up on PriorityJobFactor and then I'll have to think it over. We'll see how it goes.

Thanks for the reply and suggestions.
Comment 3 Tim Wickberg 2017-06-08 16:34:06 MDT
(In reply to Stephen Fralich from comment #2)
> Yes, an enhancement for sure, but it'll never happen if we don't ask, so
> this is us asking. I didn't know what the proper avenue was for making such
> a request.

Tagging appropriately.

> Yeah, I'm probably going to have one floating partition where small node
> customers nodes will all be pooled and then separate fixed partitions for
> large node customers or customers that queue a lot of work. I'll have to
> read up on PriorityJobFactor and then I'll have to think it over. We'll see
> how it goes.
>
> Thanks for the reply and suggestions.

Certainly happy to help.