Ticket 3871 - Preemption behaving improperly when using Partition QOS
Summary: Preemption behaving improperly when using Partition QOS
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.02.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-06-06 12:50 MDT by Stephen Fralich
Modified: 2017-07-06 04:38 MDT (History)
1 user (show)

See Also:
Site: University of Washington
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm configuration (6.72 KB, text/plain)
2017-06-06 12:51 MDT, Stephen Fralich
Details
scontrol show job output (487.90 KB, text/plain)
2017-06-06 12:51 MDT, Stephen Fralich
Details
sacctmgr show qos output (6.20 KB, text/plain)
2017-06-06 12:52 MDT, Stephen Fralich
Details
Debug slurmctld log (69.93 KB, text/plain)
2017-06-06 12:52 MDT, Stephen Fralich
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Stephen Fralich 2017-06-06 12:50:17 MDT
Despite there being many nodes free on our cluster, Slurm is not starting preemptable jobs in our excess capacity and in some cases is incorrectly preempting jobs for preemptors (21687) that can't run because they are blocked by a partition qos GrpCpuLimit.

I'll attach the usual requests for information.

[2017-06-06T11:13:47.815] backfill: Started JobId=21735 in ckpt on n2038
[2017-06-06T11:13:47.816] backfill: Started JobId=21736 in ckpt on n2039
[2017-06-06T11:13:47.817] backfill: Started JobId=21737 in ckpt on n2040
[2017-06-06T11:13:47.818] backfill: Started JobId=21738 in ckpt on n2041
[2017-06-06T11:14:17.835] preempted job 21735 has been requeued to reclaim resources for job 21687
[2017-06-06T11:14:17.839] backfill: Started JobId=21739 in ckpt on n2021
[2017-06-06T11:14:17.840] backfill: Started JobId=21740 in ckpt on n2022
[2017-06-06T11:14:19.894] Requeuing JobID=21735 State=0x0 NodeCnt=0
[2017-06-06T11:14:47.853] preempted job 21736 has been requeued to reclaim resources for job 21687
[2017-06-06T11:14:47.853] preempted job 21737 has been requeued to reclaim resources for job 21687
[2017-06-06T11:14:47.857] backfill: Started JobId=21741 in ckpt on n2128
[2017-06-06T11:14:49.903] Requeuing JobID=21736 State=0x0 NodeCnt=0
[2017-06-06T11:14:49.904] Requeuing JobID=21737 State=0x0 NodeCnt=0
[2017-06-06T11:15:17.871] preempted job 21738 has been requeued to reclaim resources for job 21687
[2017-06-06T11:15:19.926] Requeuing JobID=21738 State=0x0 NodeCnt=0


             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             21736      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21737      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21738      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21742      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21743      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21744      ckpt test-sle     sjf4 PD       0:00      1 (Priority)
             21735      ckpt test-sle     sjf4 PD       0:00      1 (Resources)
             21741      ckpt test-sle     sjf4  R       5:04      1 n2128
             21739      ckpt test-sle     sjf4  R       5:34      1 n2021
             21740      ckpt test-sle     sjf4  R       5:34      1 n2022
             21734      ckpt test-sle     sjf4  R      35:36      1 n2188
             21728      ckpt test-sle     sjf4  R      37:36      1 n2185
             21729      ckpt test-sle     sjf4  R      37:36      1 n2186
             21730      ckpt test-sle     sjf4  R      37:36      1 n2187
             21731      ckpt test-sle     sjf4  R      37:36      1 n2189
             21732      ckpt test-sle     sjf4  R      37:36      1 n2190
             21733      ckpt test-sle     sjf4  R      37:36      1 n2191
             21727      ckpt test-sle     sjf4  R      38:36      1 n2194
             21247      ckpt RT_YNVC1 apetrone  R   17:51:21      1 n2204
             21246      ckpt RTzPure_ apetrone  R   17:56:07      1 n2203
             21248      ckpt RT_XNVC1 apetrone  R   17:48:34      1 n2205
             21242      ckpt Opt_NeTr apetrone  R   18:01:43      1 n2193

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ckpt         up   infinite     10 drain* n[2011,2023,2046,2073,2090,2101,2112,2122,2140,2192]
ckpt         up   infinite      3  drain n[2174,2177,2183]
ckpt         up   infinite    158  alloc n[2001,2004-2005,2008,2010,2012-2014,2021-2022,2024-2034,2040-2045,2047-2059,2063-2072,2074-2081,2091-2100,2102-2111,2113-2115,2118-2121,2128-2139,2141-2173,2175-2176,2184-2191,2193-2210]
ckpt         up   infinite     39   idle n[2002-2003,2006-2007,2009,2015-2020,2035-2039,2060-2062,2082-2089,2116-2117,2123-2127,2178-2182]
Comment 1 Stephen Fralich 2017-06-06 12:51:25 MDT
Created attachment 4703 [details]
Slurm configuration
Comment 2 Stephen Fralich 2017-06-06 12:51:45 MDT
Created attachment 4704 [details]
scontrol show job output
Comment 3 Stephen Fralich 2017-06-06 12:52:07 MDT
Created attachment 4705 [details]
sacctmgr show qos output
Comment 4 Stephen Fralich 2017-06-06 12:52:28 MDT
Created attachment 4706 [details]
Debug slurmctld log
Comment 5 Stephen Fralich 2017-06-06 12:53:20 MDT
I guess I should mention we're running a patched version of the slurm-17.02 branch from bug 3859.
Comment 6 Tim Wickberg 2017-06-06 13:21:47 MDT
I just added in the 17.02.4 version, and am switching to that - what you're running should be almost identical to that maintenance release which is due out shortly.

To unpack your concerns a bit:

"Despite there being many nodes free on our cluster, Slurm is not starting preemptable jobs in our excess capacity"

This is by design. If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common.

This is not likely to change soon.

" and in some cases is incorrectly preempting jobs for preemptors (21687) that can't run because they are blocked by a partition qos GrpCpuLimit."

This is a bit trickier - the PartitionQOS limits (and other limits) are meant as last-change restrictions, and only are enforced after nodes for a job have become available.

I'm sending ticket over to Dominik for his input; I know he's been working on some related issues in the backfill scheduler, and can provide some current advise on that.

- Tim
Comment 7 Dominik Bartkiewicz 2017-06-09 08:26:25 MDT
I agree with Tim about first issue.

I have some idea what happened in second problem but to be sure I need to see full history of 
21687 or other job with this problem.

Does this problem stile occur after applying patch from bug 3859?

Dominik
Comment 8 Dominik Bartkiewicz 2017-06-20 06:52:21 MDT
Hi

I tried recreate this without success.
Can you confirm if this still occur after applying patch from bug 3859?

Dominik
Comment 9 Stephen Fralich 2017-06-26 16:47:04 MDT
I was out on vacation.

Yes, this happens with the patch for bug 3859.

Please explain in more detail what you mean by, "I have some idea what happened in second problem but to be sure I need to see full history of 
21687 or other job with this problem."
Comment 10 Dominik Bartkiewicz 2017-06-27 08:05:16 MDT
Hi

I need you to send me slurmctld log which contains history of job 21687 from submission to allocation.

Is there any chance that some of these jobs were submitted before applying patch from bug 3859?

Dominik
Comment 11 Stephen Fralich 2017-06-27 11:21:05 MDT
Yes, they were most certainly submitted before I installed the patch.

You can go ahead and close this since we're not going forward with this scheduler configuration anyway.