We have set a GrpCPURunMins limit of 17280000 for each account. For some reason, jobs that exceed that limit on their own (e.g. 5000 cores for 5 days) usually have a reason of Resources (assuming high enough priority) rather than something specific to a limit. Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Priority". Lower priority jobs are unable to start in that partition if Reason=Resources. The fix Tim suggested in bug 2465 (assoc_limit_continue) seems to work if the job changes its reason to reflect that it would violate a limit. If it doesn't do that, it just seems to hold up other jobs while never having the possibility of starting. I assume that the Reason is a symptom of the problem. So far we can reproduce this by submitting a job with "-p m8 -t 5-0 -n 5000" under an account with GrpTRESRunMin=cpu=17280000. After submitting that job, we use a different user to submit lower priority jobs of various sizes and time limits in the m8 partition. If the 5000 core job has a reason of Resources at the time the other jobs are submitted, the other jobs won't start even if sufficient resources free up. One thing I tried is to hold and release the job. After that it gets BadConstraints and doesn't seem to affect other jobs even though it still has a high priority. After an scontrol reconfigure, the priority drops to 0. An scontrol reconfigure does not seem to affect a job that has Reason=Resources and hasn't been held/released before. A restart of slurmctld didn't seem to have any effect on the reason of the 5000 core 5 day job's reason or on lower priority jobs' ability to start. I'm not quite sure what else to try but it's looking like it's easy to reproduce on our system.
That first paragraph had a sentence that read 'Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Priority"' but should read 'Occasionally I have seen them have a status of something like "ReqNodeNotAvail" but it's usually "Resources"'
Do you have AccountingStorageEnforce=safe ? Note there are some side-effects of that (the man page can describe it better than I), but I believe it should block that job from being submitted at all. Does that partition have 5000 cores available? Is resources a valid pending reason here? If you don't mind, can you attach a recent slurm.conf? assoc_limit_continue only continues when the job has Assoc*Limit as a reason - if the job is stuck on resources the normal blocking mechanics still remain to prevent smallers job from stalling it indefinitely.
(In reply to Tim Wickberg from comment #2) > Do you have AccountingStorageEnforce=safe ? Note there are some side-effects > of that (the man page can describe it better than I), but I believe it > should block that job from being submitted at all. AccountingStorageEnforce=associations,limits,qos I have to admit, the safe option isn't quite clear to me. The manpage sounds specific to GrpCPUMins and not GrpCPURunMins, but I'm not sure. I should add that I haven't looked at AccountingStorageEnforce since setting up Slurm in 2013 so I may need to read some more about what it does and play with it on a test system. > Does that partition have 5000 cores available? Is resources a valid pending > reason here? If you don't mind, can you attach a recent slurm.conf? Yes, that partition has 7680 cores. Not all are available due to hardware failure, but much more than 5000 are available. Resources would be a valid reason for the job if not for the job being over the association's GrpCPURunMins limit of 17280000. I'll attach our slurm.conf in a minute. > assoc_limit_continue only continues when the job has Assoc*Limit as a reason > - if the job is stuck on resources the normal blocking mechanics still > remain to prevent smallers job from stalling it indefinitely. Understood.
Created attachment 2761 [details] slurm.conf-combined This is our slurm.conf. Normally it includes external files with Include but I just appended those files to the bottom.
>> Do you have AccountingStorageEnforce=safe ? Note there are some side-effects >> of that (the man page can describe it better than I), but I believe it >> should block that job from being submitted at all. > > AccountingStorageEnforce=associations,limits,qos > > I have to admit, the safe option isn't quite clear to me. The manpage sounds > specific to GrpCPUMins and not GrpCPURunMins, but I'm not sure. I should add > that I haven't looked at AccountingStorageEnforce since setting up Slurm in > 2013 so I may need to read some more about what it does and play with it on a > test system. I now regret saying the man page can describe it better, I'll see if I can clean that description up, or at least break the run-on sentences up into smaller chunks. QOS/Association Limits, and the interplay could all stand to have some better documentation. I believe you're right on GrpCPURunMins not being affected there. I'm guessing the QOS in question doesn't have the DenyOnLimit flag set? Although it looks like that won't enforce GrpCPURunMins, only Max* limits. >> Does that partition have 5000 cores available? Is resources a valid pending >> reason here? If you don't mind, can you attach a recent slurm.conf? > > Yes, that partition has 7680 cores. Not all are available due to hardware > failure, but much more than 5000 are available. Resources would be a valid > reason for the job if not for the job being over the association's > GrpCPURunMins limit of 17280000. I'm looking into reproducing that now.
(In reply to Tim Wickberg from comment #5) > I'm guessing the QOS in question doesn't have the DenyOnLimit flag set? > Although it looks like that won't enforce GrpCPURunMins, only Max* limits. Correct.
One other thing I just noticed is that if there are 5000 cores actually available (as opposed to the total configured) for the job to start, the job correctly gets a reason of AssocGrpCPURunMinutesLimit. If the job doesn't have 5000 cores available to use, the reason is Resources. The status does change over time as resources are freed up or consumed. In other words: If the job would violate the GrpCPURunMins limit AND there are free resources, the job gets Reason=AssocGrpCPURunMinutesLimit. If the job would violate the GrpCPURunMins limit BUT there are insufficient free resources anyway, the job gets Reason=Resources.
The main scheduler checks each job in the partition in order of priority - if the highest priority eligible job can't start due to resources it will mark which future resources it expects to assign, then skips past the rest of the jobs in the partition, and the backfill scheduler then gets a chance to fill in the gaps. Without that mechanism larger jobs would never be able to start. Only after resource assignment does check the association limits (which weren't originally expected to be constantly in force, whereas people are now building extensive policies with them leading to them limiting more frequently than expected). It sounds like that order may need to change, and association limits checked more frequently, but that would be a significant impact to how the core scheduler operates and is not something we'd do lightly. So until that job passes the resource check, it won't fall through to association limits, and thus assoc_limit_continue flag won't impact anything. The rest of the state transitions you described I'm less sure of, but I believe once the job made it to AssocGrpCPURunMinutesLimit there may be more extensive checks before it becomes eligible again. I suspect some of that is due to subtle differences between the backfill and main schedulers, and will have to look into this further. On 02/23/2016 04:08 PM, bugs@schedmd.com wrote: > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=2468#c7> on bug > 2468 <https://bugs.schedmd.com/show_bug.cgi?id=2468> from Ryan Cox > <mailto:ryan_cox@byu.edu> * > > One other thing I just noticed is that if there are 5000 cores actually > available (as opposed to the total configured) for the job to start, the job > correctly gets a reason of AssocGrpCPURunMinutesLimit. If the job doesn't have > 5000 cores available to use, the reason is Resources. The status does change > over time as resources are freed up or consumed. > > In other words: > If the job would violate the GrpCPURunMins limit AND there are free resources, > the job gets Reason=AssocGrpCPURunMinutesLimit. > If the job would violate the GrpCPURunMins limit BUT there are insufficient > free resources anyway, the job gets Reason=Resources.
Bug 2472 describes why the jobs jumped to 'BadConstraints', and the fix for that is in commit bd9fa8300b1 and de28c13a159d. That covers only part of this though - it does not alleviate the issue of the job pending Resources rather than the Assoc*Limit. We're still looking into that, although at least now the behavior is more predictable. One approach may be to extend DenyOnLimit to at least deny jobs exceeding the Grp limits, but that only solves part of the issue as well - large jobs that are near but not exceeding the limit would likely still cause the queue to starve. Whether that's intended behavior at that point I'm not sure, and will look into further.
(In reply to Tim Wickberg from comment #9) > One approach may be to extend DenyOnLimit to at least deny jobs exceeding > the Grp limits, but that only solves part of the issue as well - large jobs > that are near but not exceeding the limit would likely still cause the queue > to starve. Whether that's intended behavior at that point I'm not sure, and > will look into further. Thanks for the update. That could be a partial solution if jobs that single-handedly exceed a limit are denied since there is no way they can ever run, assuming the limits stay the same. If you do that I suggest you make the behavior optional (though probably the default) since someone somewhere will want to change limits every 5 minutes up or down just to make things more complicated :) What I wouldn't want is for job submissions to be denied for temporary conditions, such as when GrpCpus=10 and the user already has 10 CPUs in use. That usage will drop over time so I don't think the submission should be denied. That's different than if the user submits a job requesting 15 CPUs, a submission that should probably be denied since it can't run. As you mention, that's just a partial solution that could help in particular circumstances.
> Thanks for the update. That could be a partial solution if jobs that > single-handedly exceed a limit are denied since there is no way they can ever > run, assuming the limits stay the same. If you do that I suggest you make the > behavior optional (though probably the default) since someone somewhere will > want to change limits every 5 minutes up or down just to make things more > complicated :) I've got to discuss it a bit, but my assumption (which is a warning flag to Danny when he reads this) is that if you set DenyOnLimit, you should be ready for that to Deny things on limit. Why it doesn't do that right now for Grp vs Max limits I need to look into. But if you're playing games by moving those values around constantly you probably should steer clear of that flag anyways. > What I wouldn't want is for job submissions to be denied for temporary > conditions, such as when GrpCpus=10 and the user already has 10 CPUs in use. > That usage will drop over time so I don't think the submission should be > denied. That's different than if the user submits a job requesting 15 CPUs, a > submission that should probably be denied since it can't run. Yeah, that's what I'd expect it to do - only deny if it could never be run given the current Grp limits, ignoring any current usage which would decrease over time.
One other thing I should clarify just in case it got lost through the series of comments above is that these are association, not QOS limits. We do have some QOS limits in place for some QOS's, some with DenyOnLimit and some without. I haven't tested the QOS behavior. I'm pretty sure we're talking about the same thing currently but it never hurts to make sure :)
... sorry for the delay, my fault on this. The QOS / Associations are closely related, and should both lead to this. assoc_limit_continue should override both of those.
(In reply to Tim Wickberg from comment #13) > ... sorry for the delay, my fault on this. > > The QOS / Associations are closely related, and should both lead to this. > assoc_limit_continue should override both of those. OK. Thanks.
Ryan, is there anything else on this?
Ryan - Please reopen if you have further issues on this... haven't seen anything since Danny'd asked for an update last month so I'm assuming it's okay for now. assoc_limit_continue will be the default behavior in 16.05; assoc_limit_stop replaces it and inverts the logic. - Tim