Created attachment 2755 [details] squeue_2016-02-22T09:16 We have a fairly high priority job (9535820) that isn't starting in spite of free resources and I can't figure out why. It's a pretty large job for our system and is preventing other jobs from starting or even backfilling, as far as I can tell. I think it has to do with the fact that there is another job before it (9530664) from the same user that exceeds the per-account GrpCPURunMins limit of 17280000. root@sched1:~# scontrol show job 9530664 JobId=9530664 JobName=m4webb-hcc-test-000 UserId=m4webb(20680) GroupId=m4webb(20812) Priority=101372 Nice=0 Account=jchpde QOS=normal JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A SubmitTime=2016-02-16T10:25:30 EligibleTime=2016-02-16T10:25:30 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m8 AllocNode:Sid=m7int01:6957 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=219 NumCPUs=5000 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=5000,mem=10240000,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bluehome3/m4webb/slurmwork/sampler.5.sh WorkDir=/bluehome3/m4webb/slurmwork StdErr=/bluehome3/m4webb/slurmwork/slurm-9530664.out StdIn=/dev/null StdOut=/bluehome3/m4webb/slurmwork/slurm-9530664.out Power= SICP=0 root@sched1:~# scontrol show job 9535820 JobId=9535820 JobName=m4webb-hcc-test-000 UserId=m4webb(20680) GroupId=m4webb(20812) Priority=101372 Nice=0 Account=jchpde QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A SubmitTime=2016-02-16T17:50:48 EligibleTime=2016-02-16T17:50:48 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m6,m8 AllocNode:Sid=m8int01:34630 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=167 NumCPUs=2000 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2000,mem=1024000,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=512M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bluehome3/m4webb/slurmwork/sampler.sh WorkDir=/bluehome3/m4webb/slurmwork StdErr=/bluehome3/m4webb/slurmwork/slurm-9535820.out StdIn=/dev/null StdOut=/bluehome3/m4webb/slurmwork/slurm-9535820.out Power= SICP=0 I'll attach logs with +backfill and +backfillmap debugflags
Created attachment 2756 [details] sinfo_2016-02-22T09:16
Created attachment 2757 [details] slurmctld.log.2016-02-22T09:38.gz
I haven't looked through the logs yet, but can review them further if this doesn't answer your question: The reason 9530664 won't start is AssocGrpCPURunMinutesLimit. That part is working as designed - once you hit the limit it won't start. What I suspect you're really asking is: why won't other jobs run, and why is the partition not scheduling anything else? For each partition, once the scheduler finds (in priority order) a job that it expects to start next it will skip past all lower priority jobs for that partition. Even if that job is blocked by an association limit and will not actually launch immediately. There's a tradeoff here - if we don't block on this request, a series of smaller jobs could continue launching instead preventing and stall the large job from ever launching. This did change ~ 15.08.5. Before that, a job waiting on an association had its start time moved five minutes into the future, which allowed for backfill to schedule and launch some jobs (but not all depending on the queue status... it was a loose compromise between stalling and allowing small jobs through that didn't satisfy either case well), but could leave that limited job constantly being pushed back indefinitely. I added a flag to SchedulerParameters in 15.08.8 that controls this behavior. If you set "assoc_limit_continue", jobs held due to association limits will be skipped over during the scheduling pass and lower priority jobs in that partition will be eligible to start. This may prevent those larger jobs from ever launching though, but restores the older behavior.
Thanks, Tim. I'll upgrade and enable that flag. assoc_limit_continue seems to be the most logical to me since if the job can't run, the job can't run. Sure, you could do calculations to figure out when the user, association, qos or whatever will drop below its limit in the future, but that seems way too expensive and error-prone.
Upgrade complete, feature enabled, and jobs have started. Thanks.
(In reply to Ryan Cox from comment #5) > Upgrade complete, feature enabled, and jobs have started. Thanks. Glad that fixed it for you. Marking as a duplicate of 2388 where this first came up. *** This ticket has been marked as a duplicate of ticket 2388 ***