Ticket 2468 - Jobs that exceed account GrpCPURunMins limit have Reason=Resources
Summary: Jobs that exceed account GrpCPURunMins limit have Reason=Resources
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 15.08.8
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-02-22 09:23 MST by Ryan Cox
Modified: 2018-11-07 11:52 MST (History)
1 user (show)

See Also:
Site: BYU - Brigham Young University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf-combined (13.98 KB, text/plain)
2016-02-23 02:49 MST, Ryan Cox
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Cox 2016-02-22 09:23:37 MST
We have set a GrpCPURunMins limit of 17280000 for each account.  For some reason, jobs that exceed that limit on their own (e.g. 5000 cores for 5 days) usually have a reason of Resources (assuming high enough priority) rather than something specific to a limit.  Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Priority".  Lower priority jobs are unable to start in that partition if Reason=Resources.

The fix Tim suggested in bug 2465 (assoc_limit_continue) seems to work if the job changes its reason to reflect that it would violate a limit.  If it doesn't do that, it just seems to hold up other jobs while never having the possibility of starting.  I assume that the Reason is a symptom of the problem.

So far we can reproduce this by submitting a job with "-p m8 -t 5-0 -n 5000" under an account with GrpTRESRunMin=cpu=17280000.  After submitting that job, we use a different user to submit lower priority jobs of various sizes and time limits in the m8 partition.  If the 5000 core job has a reason of Resources at the time the other jobs are submitted, the other jobs won't start even if sufficient resources free up.

One thing I tried is to hold and release the job.  After that it gets BadConstraints and doesn't seem to affect other jobs even though it still has a high priority.  After an scontrol reconfigure, the priority drops to 0.

An scontrol reconfigure does not seem to affect a job that has Reason=Resources and hasn't been held/released before.

A restart of slurmctld didn't seem to have any effect on the reason of the 5000 core 5 day job's reason or on lower priority jobs' ability to start.

I'm not quite sure what else to try but it's looking like it's easy to reproduce on our system.
Comment 1 Ryan Cox 2016-02-22 09:25:18 MST
That first paragraph had a sentence that read 'Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Priority"' but should read 'Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Resources"'
Comment 2 Tim Wickberg 2016-02-23 02:34:32 MST
Do you have AccountingStorageEnforce=safe ? Note there are some side-effects of that (the man page can describe it better than I), but I believe it should block that job from being submitted at all.

Does that partition have 5000 cores available? Is resources a valid pending reason here? If you don't mind, can you attach a recent slurm.conf?

assoc_limit_continue only continues when the job has Assoc*Limit as a reason - if the job is stuck on resources the normal blocking mechanics still remain to prevent smallers job from stalling it indefinitely.
Comment 3 Ryan Cox 2016-02-23 02:43:27 MST
(In reply to Tim Wickberg from comment #2)
> Do you have AccountingStorageEnforce=safe ? Note there are some side-effects
> of that (the man page can describe it better than I), but I believe it
> should block that job from being submitted at all.

AccountingStorageEnforce=associations,limits,qos

I have to admit, the safe option isn't quite clear to me.  The manpage sounds specific to GrpCPUMins and not GrpCPURunMins, but I'm not sure.  I should add that I haven't looked at AccountingStorageEnforce since setting up Slurm in 2013 so I may need to read some more about what it does and play with it on a test system.

> Does that partition have 5000 cores available? Is resources a valid pending
> reason here? If you don't mind, can you attach a recent slurm.conf?

Yes, that partition has 7680 cores.  Not all are available due to hardware failure, but much more than 5000 are available.  Resources would be a valid reason for the job if not for the job being over the association's GrpCPURunMins limit of 17280000.

I'll attach our slurm.conf in a minute.

> assoc_limit_continue only continues when the job has Assoc*Limit as a reason
> - if the job is stuck on resources the normal blocking mechanics still
> remain to prevent smallers job from stalling it indefinitely.

Understood.
Comment 4 Ryan Cox 2016-02-23 02:49:08 MST
Created attachment 2761 [details]
slurm.conf-combined

This is our slurm.conf.  Normally it includes external files with Include but I just appended those files to the bottom.
Comment 5 Tim Wickberg 2016-02-23 04:02:10 MST
>> Do you have AccountingStorageEnforce=safe ? Note there are some side-effects
>> of that (the man page can describe it better than I), but I believe it
>> should block that job from being submitted at all.
>
> AccountingStorageEnforce=associations,limits,qos
>
> I have to admit, the safe option isn't quite clear to me.  The manpage sounds
> specific to GrpCPUMins and not GrpCPURunMins, but I'm not sure.  I should add
> that I haven't looked at AccountingStorageEnforce since setting up Slurm in
> 2013 so I may need to read some more about what it does and play with it on a
> test system.

I now regret saying the man page can describe it better, I'll see if I 
can clean that description up, or at least break the run-on sentences up 
into smaller chunks. QOS/Association Limits, and the interplay could all 
stand to have some better documentation.

I believe you're right on GrpCPURunMins not being affected there.

I'm guessing the QOS in question doesn't have the DenyOnLimit flag set? 
Although it looks like that won't enforce GrpCPURunMins, only Max* limits.

>> Does that partition have 5000 cores available? Is resources a valid pending
>> reason here? If you don't mind, can you attach a recent slurm.conf?
>
> Yes, that partition has 7680 cores.  Not all are available due to hardware
> failure, but much more than 5000 are available.  Resources would be a valid
> reason for the job if not for the job being over the association's
> GrpCPURunMins limit of 17280000.

I'm looking into reproducing that now.
Comment 6 Ryan Cox 2016-02-23 04:05:23 MST
(In reply to Tim Wickberg from comment #5)
> I'm guessing the QOS in question doesn't have the DenyOnLimit flag set? 
> Although it looks like that won't enforce GrpCPURunMins, only Max* limits.

Correct.
Comment 7 Ryan Cox 2016-02-23 05:08:27 MST
One other thing I just noticed is that if there are 5000 cores actually available (as opposed to the total configured) for the job to start, the job correctly gets a reason of AssocGrpCPURunMinutesLimit.  If the job doesn't have 5000 cores available to use, the reason is Resources.  The status does change over time as resources are freed up or consumed.

In other words:
If the job would violate the GrpCPURunMins limit AND there are free resources, the job gets Reason=AssocGrpCPURunMinutesLimit.
If the job would violate the GrpCPURunMins limit BUT there are insufficient free resources anyway, the job gets Reason=Resources.
Comment 8 Tim Wickberg 2016-02-23 06:40:52 MST
The main scheduler checks each job in the partition in order of priority 
- if the highest priority eligible job can't start due to resources it 
will mark which future resources it expects to assign, then skips past 
the rest of the jobs in the partition, and the backfill scheduler then 
gets a chance to fill in the gaps. Without that mechanism larger jobs 
would never be able to start.

Only after resource assignment does check the association limits (which 
weren't originally expected to be constantly in force, whereas people 
are now building extensive policies with them leading to them limiting 
more frequently than expected). It sounds like that order may need to 
change, and association limits checked more frequently, but that would 
be a significant impact to how the core scheduler operates and is not 
something we'd do lightly.

So until that job passes the resource check, it won't fall through to 
association limits, and thus assoc_limit_continue flag won't impact 
anything. The rest of the state transitions you described I'm less sure 
of, but I believe once the job made it to AssocGrpCPURunMinutesLimit 
there may be more extensive checks before it becomes eligible again.

I suspect some of that is due to subtle differences between the backfill 
and main schedulers, and will have to look into this further.

On 02/23/2016 04:08 PM, bugs@schedmd.com wrote:
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=2468#c7> on bug
> 2468 <https://bugs.schedmd.com/show_bug.cgi?id=2468> from Ryan Cox
> <mailto:ryan_cox@byu.edu> *
>
> One other thing I just noticed is that if there are 5000 cores actually
> available (as opposed to the total configured) for the job to start, the job
> correctly gets a reason of AssocGrpCPURunMinutesLimit.  If the job doesn't have
> 5000 cores available to use, the reason is Resources.  The status does change
> over time as resources are freed up or consumed.
>
> In other words:
> If the job would violate the GrpCPURunMins limit AND there are free resources,
> the job gets Reason=AssocGrpCPURunMinutesLimit.
> If the job would violate the GrpCPURunMins limit BUT there are insufficient
> free resources anyway, the job gets Reason=Resources.
Comment 9 Tim Wickberg 2016-02-24 07:31:37 MST
Bug 2472 describes why the jobs jumped to 'BadConstraints', and the fix for that is in commit bd9fa8300b1 and de28c13a159d.

That covers only part of this though - it does not alleviate the issue of the job pending Resources rather than the Assoc*Limit. We're still looking into that, although at least now the behavior is more predictable.

One approach may be to extend DenyOnLimit to at least deny jobs exceeding the Grp limits, but that only solves part of the issue as well - large jobs that are near but not exceeding the limit would likely still cause the queue to starve. Whether that's intended behavior at that point I'm not sure, and will look into further.
Comment 10 Ryan Cox 2016-02-24 07:52:44 MST
(In reply to Tim Wickberg from comment #9)
> One approach may be to extend DenyOnLimit to at least deny jobs exceeding
> the Grp limits, but that only solves part of the issue as well - large jobs
> that are near but not exceeding the limit would likely still cause the queue
> to starve. Whether that's intended behavior at that point I'm not sure, and
> will look into further.

Thanks for the update.  That could be a partial solution if jobs that single-handedly exceed a limit are denied since there is no way they can ever run, assuming the limits stay the same.  If you do that I suggest you make the behavior optional (though probably the default) since someone somewhere will want to change limits every 5 minutes up or down just to make things more complicated :)

What I wouldn't want is for job submissions to be denied for temporary conditions, such as when GrpCpus=10 and the user already has 10 CPUs in use.  That usage will drop over time so I don't think the submission should be denied.  That's different than if the user submits a job requesting 15 CPUs, a submission that should probably be denied since it can't run.

As you mention, that's just a partial solution that could help in particular circumstances.
Comment 11 Tim Wickberg 2016-02-24 07:58:23 MST
> Thanks for the update.  That could be a partial solution if jobs that
> single-handedly exceed a limit are denied since there is no way they can ever
> run, assuming the limits stay the same.  If you do that I suggest you make the
> behavior optional (though probably the default) since someone somewhere will
> want to change limits every 5 minutes up or down just to make things more
> complicated :)

I've got to discuss it a bit, but my assumption (which is a warning flag 
to Danny when he reads this) is that if you set DenyOnLimit, you should 
be ready for that to Deny things on limit. Why it doesn't do that right 
now for Grp vs Max limits I need to look into.

But if you're playing games by moving those values around constantly you 
probably should steer clear of that flag anyways.

> What I wouldn't want is for job submissions to be denied for temporary
> conditions, such as when GrpCpus=10 and the user already has 10 CPUs in use.
> That usage will drop over time so I don't think the submission should be
> denied.  That's different than if the user submits a job requesting 15 CPUs, a
> submission that should probably be denied since it can't run.

Yeah, that's what I'd expect it to do - only deny if it could never be 
run given the current Grp limits, ignoring any current usage which would 
decrease over time.
Comment 12 Ryan Cox 2016-02-24 08:06:06 MST
One other thing I should clarify just in case it got lost through the series of comments above is that these are association, not QOS limits.  We do have some QOS limits in place for some QOS's, some with DenyOnLimit and some without.  I haven't tested the QOS behavior.  I'm pretty sure we're talking about the same thing currently but it never hurts to make sure :)
Comment 13 Tim Wickberg 2016-03-15 08:18:27 MDT
... sorry for the delay, my fault on this.

The QOS / Associations are closely related, and should both lead to this. assoc_limit_continue should override both of those.
Comment 14 Ryan Cox 2016-03-15 10:05:08 MDT
(In reply to Tim Wickberg from comment #13)
> ... sorry for the delay, my fault on this.
> 
> The QOS / Associations are closely related, and should both lead to this.
> assoc_limit_continue should override both of those.

OK. Thanks.
Comment 15 Danny Auble 2016-03-29 08:50:25 MDT
Ryan, is there anything else on this?
Comment 16 Tim Wickberg 2016-04-26 05:58:17 MDT
Ryan -

Please reopen if you have further issues on this... haven't seen anything since Danny'd asked for an update last month so I'm assuming it's okay for now.

assoc_limit_continue will be the default behavior in 16.05; assoc_limit_stop replaces it and inverts the logic.

- Tim