Ticket 2465 - Job not starting
Summary: Job not starting
Status: RESOLVED DUPLICATE of ticket 2388
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 15.08.7
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-02-22 02:37 MST by Ryan Cox
Modified: 2016-02-22 04:47 MST (History)
0 users

See Also:
Site: BYU - Brigham Young University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
squeue_2016-02-22T09:16 (313.01 KB, text/plain)
2016-02-22 02:37 MST, Ryan Cox
Details
sinfo_2016-02-22T09:16 (4.05 KB, text/plain)
2016-02-22 02:38 MST, Ryan Cox
Details
slurmctld.log.2016-02-22T09:38.gz (130.77 KB, application/gzip)
2016-02-22 02:40 MST, Ryan Cox
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Cox 2016-02-22 02:37:35 MST
Created attachment 2755 [details]
squeue_2016-02-22T09:16

We have a fairly high priority job (9535820) that isn't starting in spite of free resources and I can't figure out why.  It's a pretty large job for our system and is preventing other jobs from starting or even backfilling, as far as I can tell.  I think it has to do with the fact that there is another job before it (9530664) from the same user that exceeds the per-account GrpCPURunMins limit of 17280000.


root@sched1:~# scontrol show job 9530664
JobId=9530664 JobName=m4webb-hcc-test-000
   UserId=m4webb(20680) GroupId=m4webb(20812)
   Priority=101372 Nice=0 Account=jchpde QOS=normal
   JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2016-02-16T10:25:30 EligibleTime=2016-02-16T10:25:30
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m8 AllocNode:Sid=m7int01:6957
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=219 NumCPUs=5000 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5000,mem=10240000,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bluehome3/m4webb/slurmwork/sampler.5.sh
   WorkDir=/bluehome3/m4webb/slurmwork
   StdErr=/bluehome3/m4webb/slurmwork/slurm-9530664.out
   StdIn=/dev/null
   StdOut=/bluehome3/m4webb/slurmwork/slurm-9530664.out
   Power= SICP=0

root@sched1:~# scontrol show job 9535820
JobId=9535820 JobName=m4webb-hcc-test-000
   UserId=m4webb(20680) GroupId=m4webb(20812)
   Priority=101372 Nice=0 Account=jchpde QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2016-02-16T17:50:48 EligibleTime=2016-02-16T17:50:48
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m6,m8 AllocNode:Sid=m8int01:34630
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=167 NumCPUs=2000 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2000,mem=1024000,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bluehome3/m4webb/slurmwork/sampler.sh
   WorkDir=/bluehome3/m4webb/slurmwork
   StdErr=/bluehome3/m4webb/slurmwork/slurm-9535820.out
   StdIn=/dev/null
   StdOut=/bluehome3/m4webb/slurmwork/slurm-9535820.out
   Power= SICP=0

I'll attach logs with +backfill and +backfillmap debugflags
Comment 1 Ryan Cox 2016-02-22 02:38:03 MST
Created attachment 2756 [details]
sinfo_2016-02-22T09:16
Comment 2 Ryan Cox 2016-02-22 02:40:46 MST
Created attachment 2757 [details]
slurmctld.log.2016-02-22T09:38.gz
Comment 3 Tim Wickberg 2016-02-22 03:57:28 MST
I haven't looked through the logs yet, but can review them further if this doesn't answer your question:

The reason 9530664 won't start is AssocGrpCPURunMinutesLimit. That part is working as designed - once you hit the limit it won't start.

What I suspect you're really asking is: why won't other jobs run, and why is the partition not scheduling anything else?

For each partition, once the scheduler finds (in priority order) a job that it expects to start next it will skip past all lower priority jobs for that partition. Even if that job is blocked by an association limit and will not actually launch immediately.

There's a tradeoff here - if we don't block on this request, a series of smaller jobs could continue launching instead preventing and stall the large job from ever launching. This did change ~ 15.08.5. Before that, a job waiting on an association had its start time moved five minutes into the future, which allowed for backfill to schedule and launch some jobs (but not all depending on the queue status... it was a loose compromise between stalling and allowing small jobs through that didn't satisfy either case well), but could leave that limited job constantly being pushed back indefinitely.

I added a flag to SchedulerParameters in 15.08.8 that controls this behavior. If you set "assoc_limit_continue", jobs held due to association limits will be skipped over during the scheduling pass and lower priority jobs in that partition will be eligible to start. This may prevent those larger jobs from ever launching though, but restores the older behavior.
Comment 4 Ryan Cox 2016-02-22 04:08:04 MST
Thanks, Tim.  I'll upgrade and enable that flag.

assoc_limit_continue seems to be the most logical to me since if the job can't run, the job can't run.  Sure, you could do calculations to figure out when the user, association, qos or whatever will drop below its limit in the future, but that seems way too expensive and error-prone.
Comment 5 Ryan Cox 2016-02-22 04:28:46 MST
Upgrade complete, feature enabled, and jobs have started.  Thanks.
Comment 6 Tim Wickberg 2016-02-22 04:47:36 MST
(In reply to Ryan Cox from comment #5)
> Upgrade complete, feature enabled, and jobs have started.  Thanks.

Glad that fixed it for you. Marking as a duplicate of 2388 where this first came up.

*** This ticket has been marked as a duplicate of ticket 2388 ***