13410 – Inconsistent Preemption Behavior

Bug 13410 - Inconsistent Preemption Behavior

Summary: Inconsistent Preemption Behavior

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other bugs)
Version:	20.11.8
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Duplicates (1):	9375 (view as bug list)
Depends on:
Blocks:

Reported:	2022-02-10 17:36 MST by Alex Mamach
Modified:	2022-06-06 10:15 MDT (History)
CC List:	2 users (show)

See Also:
Site:	Northwestern
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (qos preemption) (18.45 KB, text/plain) 2022-02-10 17:36 MST, Alex Mamach	Details
slurm.conf (partition_prio preemption) (18.46 KB, text/plain) 2022-02-10 17:36 MST, Alex Mamach	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alex Mamach 2022-02-10 17:36:13 MST

Created attachment 23425 [details]
slurm.conf (qos preemption)

Hi,

I've attempted to implement job preemption on our cluster but have run into some difficulties when using PreemptMode=SUSPEND,GANG, and was hoping you could assist me.

I've tried two different configurations (both attached), but ran into different issues with each. For both tests, I used the following parameters:

SchedulerParameters=preempt_strict_order
PreemptMode=SUSPEND,GANG
PreemptExemptTime=00:03:00

For my first attempt, I set PreemptType=preempt/qos. When submitting test jobs on the same partition, jobs with a higher QOS will share resources with the lower QOS job as expected. However, when submitting on different partitions with shared nodes, I see the same behavior, even though the documentation indicates that:

"If PreemptType=preempt/qos is configured and if the preempted job(s) and the preemptor job from are on the same partition, then they will share resources with the Gang scheduler (time-slicing). If not (i.e. if the preemptees and preemptor are on different partitions) then the preempted jobs will remain suspended until the preemptor ends."

I also attempted to use PreemptType=preempt/partition_prio, but found that the job submitted on the higher priority tier immediately pre-empted jobs from the lower priority tier, ignoring PreemptExemptTime.

I may have misunderstood the documentation but I was trying to get jobs submitted on different partitions to suspend jobs submitted with a lower qos/priority tier, but only after PreemptExemptTime. When using qos preemption, lower qos jobs weren’t suspended and the jobs ended up running concurrently, and when using partition_prio preemption, lower-priority tier jobs were suspended immediately instead of first waiting PreemptExemptTime.

Could you help me understand what I'm doing wrong in both of these configurations?

Thank you!

Alex

Comment 1 Alex Mamach 2022-02-10 17:36:56 MST

Created attachment 23426 [details]
slurm.conf (partition_prio preemption)

Comment 2 Ben Roberts 2022-02-14 14:30:17 MST

Hi Alex,

It looks like I can reproduce the behaviors you are seeing.  I've begun looking into them but I need to do some more investigation.  Let me keep looking into this and get back to you with more details.

Thanks,
Ben

Comment 4 Ben Roberts 2022-02-15 20:13:47 MST

Hi Alex,

I looked at this some more today and have some additional information for you. First I'll talk about the scenario you described with the QOS preemption, seeing the jobs run concurrently even though they were in different partitions. I thought I reproduced that behavior by testing with sleep jobs. I had a 2 minute sleep job in the preemptible QOS and a 1 minute sleep for the preemptor. I saw that the preempted job only ran for a minute after it resumed from being suspended, which made it look like it was running concurrently. This was just a side effect of using sleep to test this. The job was suspended, but that doesn't pause the sleep. I did some more testing with debug logging and I don't see that the gang scheduler is doing time slicing and when I tested with a job that did actual work, the job did take the amount of time I would expect after resuming. So, I can't actually reproduce the behavior you're describing. Can you go further into the behavior you're seeing?

For the issue with the PreemptExemptTime not being honored, I discussed this with my colleagues and this is by design. When a job is going to be requeued or cancelled it makes sense to allow the job to run for at least X minutes. When you are going to be doing gang scheduling then it makes less sense to allow a certain amount of time for it to run by itself. It would also cause problems if there had to be a delay between suspension with jobs doing time slicing. The behavior is actually the same with QOS and partition based preemption when using Gang scheduling.

Let me know if you have additional detail about the jobs in the different partitions running concurrently and I can look into it further. Also feel free to let me know if you have any questions about the PreemptExemptTime.

Thanks,
Ben

Comment 5 Alex Mamach 2022-02-17 23:44:20 MST

Hi Ben,

Thanks for the response! That makes sense about the PreemptExemptTime setting; would it be possible to request a documentation update to reflect that this parameter is only compatible with require/cancel, and not with gang scheduling/suspension? Additionally, do you have any suggestions on how I could implement a time delay before a job is suspended, effectively allowing the job to gracefully checkpoint or close before the pre-emotion takes place?

On the QOS pre-emption, I did some more testing and it looks like the behavior I described only occurs when the partitions in question do not have PriorityTier values assigned. If you take the QOS config I provided and remove the PriorityTier values from the queues and try QOS preemption from two different partitions, I believe you’ll see the behavior I mentioned. I apologize for not realizing the root of the issue earlier! Is this intentional?

For an example, when I submitted to the long partition in this configuration:

PartitionName=normal Default=YES QOS=normal PriorityTier=1
PartitionName=long  Default=NO  QOS=high PriorityTier=2

Pre-emption with suspension occurred as expected, (the high QOS has a priority significantly higher than the normal QOS).

However, submitting with the following partition configuration resulted in jobs submitted to the long partition running simultaneously with jobs on the normal partition on the same core, even though oversubscription is disabled:

PartitionName=normal Default=YES QOS=normal
PartitionName=long  Default=NO  QOS=high

It seems a little surprising since this is QOS preemption rather than partition_prio based preemption.

Thanks!

Alex

Comment 6 Ben Roberts 2022-02-18 11:44:50 MST

Hi Alex,

I've put together a patch to update the documentation, clarifying that the PreemptExemptTime only applies when the PreemptMode is CANCEL or REQUEUE. It has to go through our review process, but I'll let you know as there is progress with that.

You can add a delay before a job is preempted with the GraceTime parameter. This causes a signal (SIGCONT and SIGTERM) to be sent to a job when it is selected for preemption. Then it waits for a specified amount of time before the job is fully preempted. You can set the grace time on a partition or QOS.
https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime
https://slurm.schedmd.com/sacctmgr.html#OPT_GraceTime

I can reproduce the behavior when I set the PriorityTier to the same value for my partitions. Thanks for identifying that as a relevant part of reproducing the issue. This does look like a bug. The jobs shouldn't be running concurrently when they're on different partitions, regardless of the PriorityTier of the partitions. We will look into that issue as well.

In the meantime, I would like to make sure you have a solution that works for you. One additional thing I would point out is that when doing suspension, generic resources (including memory) are not available to preemptor jobs. For example, if a node has a preemptable job running on it, but the preemptable job requested 12G of the available 16G of RAM on the node, then a job would only be able to preempt the existing job if it requested 4G or less of RAM. If the combined amount of RAM requested between the two jobs exceeds the amount available on the node then the preemptor will be blocked because of resources. When the PreemptMode is REQUEUE then the amount of RAM requested by the preemptable job doesn't matter. The down side of requeuing jobs rather than suspending them is that they have to start from the beginning when they are rescheduled.

With that in mind, does using partition preemption with Gang scheduling and a GraceTime configured sound like the way you would like to proceed?

Thanks,
Ben

Comment 9 miwalls 2022-02-23 16:18:58 MST

Noticed other issues as well, this seems severly broken for PreemptType=qos and PreemptMode=SUSPEND,GANG. Cancel and requeue QOS get scheduled to what seems like suspend but they get cancelled immediately.

Using a QOS with PreemptMode cancel, the job tries to run or starts like suspend but gets instantly cancelled rather than queued. Requeue and Cancel are fixed if you request --mem=0.

If you use a QOS with PreemptMode requeue the job immediately gets preempted then requeues where it sits in the queue with reason BeginTime and essentially never runs. 
 

Interested if you all see this as well? I'd like to employee all three PreemptTypes for our lower priority jobs at the moment it seems we have to choose between REQUEUE and CANCEL or SUSPEND.

Comment 10 miwalls 2022-02-23 16:29:21 MST

Forgot to mention this probably needs a cluster with utilization to actually reproduce. The QOS suspend, requeue, and cancel jobs try for some reason to fit in on nodes that are not fully utilizing ram which makes sense and it doesn't try to run on nodes that are fully mem utilized. Now what happens that shouldn't is cancel and requeue try to gang schedule on the not fully utilized ram node and get immediately preempted. The suspend QOS gang scheduled on the not fully utilized ram node and gets suspended. To me requeue and cancel should just wait in queue and not be gang scheduled and either preempt lower priority jobs or wait for a node to clear up and not GANG schedule. One other thing to note is that if you boot up a job with --mem=0 and QOS requeue or cancel the job will obviously schedule as kind of expected but that requires a 1 core job to use all mem doesn't make sense. I don't know hope this information helps. For now I've just disabled suspend at least until cancel and requeue do not gang schedule in this situation.

Comment 13 Ben Roberts 2022-02-25 09:04:20 MST

Hi Alex,

The update to the documentation has been checked in with commit f1456cfffd.  Unfortunately it didn't quite make the cutoff for 21.08.6, so it will be visible on our website with the release of 21.08.7.

I also wanted to verify that you were able to get things working in your environment.  

miwalls@siue.edu, the issues you are describing don't sound exactly like the issues reported in this ticket.  If you have a support contract with SchedMD I would ask that you log these issues in a separate ticket and we would be glad to go over them with you.

Thanks,
Ben

Comment 14 miwalls 2022-02-25 09:08:06 MST

I'm certain it is related. We get the exact same behavior except I also know the bug goes deeper and affects QOSes with PreemptMode cancel and requeue not just suspend. I also get asked about support contracts a lot we do not need or want one. I'm strictly here to report the bug and say we have the same issue. We have reverted from using suspend entirely on our cluster until this is fixed.

Comment 16 Ben Roberts 2022-03-01 13:00:07 MST

Hi Alex,

I wanted to follow up and make sure that you were able to come up with a configuration that meets your needs.  The edge case you ran into with QOS preemption and the partitions having the same PriorityTier has an easy workaround and if you don't have a use case where you would require a configuration like this then we may document this as a known limitation.  

Thanks,
Ben

Comment 17 Ben Roberts 2022-03-03 13:59:48 MST

*** Bug 9375 has been marked as a duplicate of this bug. ***

Comment 18 Alex Mamach 2022-03-10 13:58:30 MST

Hi Ben,

Thanks for following up! I had a few follow up questions to try and get this set up in a way that works for our researchers.

1. Can you confirm whether my understanding of GraceTime is correct?

My understanding is that GraceTime would be in effect for every SchedulerTimeInterval. So the workflow would be submit -> pending -> suspended -> running -> suspended -> running etc, meaning that at each -> after 'running' you would add GraceTime before the job got suspended.

2. Given the discussion around PreemptExemptTime not working on exempt jobs, do you have any suggestions on how we might achieve the below goal?


Goal: Supporting two job types:
-a Long running jobs (could be months) and low priority jobs can be 
preempted and suspended. Should not be preempted until the first 24 
hours are up.

-b Short high priority jobs, total runtime 24 hours max, cannot be preempted. 
Will preempt long running/low priority jobs for its duration.

Researchers do not want or are unable to write their jobs in a way that 
they can be quit and resumed later, which is why we were looking to implement PreemptExemptTime with job suspension, since researchers have expressed they are ok with the repercussions to resource availability that would come out of that kind of arrangement.

3. One of my colleagues has proposed a solution to the goals outlined in #2, but we're unsure if it would achieve what we want. Do you think the below would help us achieve the goals we outlined in #2?

In this example we have one partition with all nodes.

preempt/qos

PartitionName=DEFAULT OverSubscribe=FORCE:1 Nodes=qnode[201-203] DefMemPerCPU=800 PartitionName=active Default=YES

qos: normal and expedite


- userA submits a job with QOS=normal (this is the default)

- userB submits a job with QOS=expedite
    - QOS=expedite cannot be preempted
    - QOS=expedite: set MaxTRESMinsPerJob to N minutes (24 hour limit per
job)


- userA's job is guaranteed to run for $PreemptExemptTime without being
gang scheduled.*

- Once $PreemptExemptTime on userA's job expires it can be preempted and
gang scheduled with a maximum of one other job (OverSubscribe=FORCE:1).


Ref: "SchedulerTimeSlice: Number of seconds in each time slice when gang
scheduling is enabled (PreemptMode=SUSPEND,GANG). The value must be
between 5 seconds and 65533 seconds. The default value is 30 seconds."


    From observation the process goes like this after job submit for
QOS=expedite.  Legend: ->: SchedulerTimeSlice interval

   job submit PENDING(0) -> SUSPEND(1) -> RUNNING(2) (go to 1, as
needed) -> END

As you can see we waited SchedulerTimeSlice*2 until our job started. So
we need an effective PreemptExemptTime.


Example: We want an effective "PreemptExemptTime" of ~3min:

SchedulerTimeSlice=90
PreemptExemptTime=01:30

0: 0:00: QOS=expedite submit job: state==PENDING
1: 1:30: state==SUSPENDED
2: 3:00: state==RUNNING
Now every SchedulerTimeSlice it will swap with the preempted job until
one of them finishes.

Meaning we now have an effective PreemptExemptTime of 3min.

You could now imagine setting the following values to 12 hours for an
effective PreemptExemptTime of 24 hours.
SchedulerTimeSlice=43200
PreemptExemptTime=12:00:00

Once PreemptExemptTime expired jobs would trade using resources every 12
hours.

I believe this means that an expedited 24 hour job may take up to 36
hours to complete in the worst case.

Comment 19 Ben Roberts 2022-03-30 13:21:17 MDT

Hi Alex,

My apologies for the delayed response.  I let your update get lost in the shuffle of other updates.  I hope I'm not responding too late to be of help.

Regarding the GraceTime parameter, that is only honored for the first time a job is preempted, not for each SchedulerTimeslice.  And I have to walk back my suggestion of using it with the gang scheduler, it looks like the GraceTime is only honored with a preempt type of Cancel or Requeue.

Most of the mechanisms we have in place are to accommodate situations where the high priority jobs are the long ones and you want to allow short, low priority jobs to fill in some of the space cycles.  But the plan you outline for your situation does line up with how I would approach it.  The only modification I would suggest would be to decrease the SchedulerTimeSlice value.  If you only have the jobs alternate every 12 hours, that can lead to unnecessarily long wait times for jobs to finish.  As an example, if there is a job that can complete in 13 hours, the first 12 hours will run, but then it will have to wait for the other job on the node to run for 12 hours before it can finish its final hour.  If you have the jobs alternate every 10 minutes, then the 13 hour job wouldn't have to wait as long to finish the last hour of run time.  This is just a matter of preference though since it would average out to be close to the same in the end.  

Again, my apologies for the lack of response on this one.  Please let me know if you have any additional questions about this.

Thanks,
Ben

Comment 20 Alex Mamach 2022-05-02 09:08:21 MDT

Thanks Ben, this is great!