Despite there being many nodes free on our cluster, Slurm is not starting preemptable jobs in our excess capacity and in some cases is incorrectly preempting jobs for preemptors (21687) that can't run because they are blocked by a partition qos GrpCpuLimit. I'll attach the usual requests for information. [2017-06-06T11:13:47.815] backfill: Started JobId=21735 in ckpt on n2038 [2017-06-06T11:13:47.816] backfill: Started JobId=21736 in ckpt on n2039 [2017-06-06T11:13:47.817] backfill: Started JobId=21737 in ckpt on n2040 [2017-06-06T11:13:47.818] backfill: Started JobId=21738 in ckpt on n2041 [2017-06-06T11:14:17.835] preempted job 21735 has been requeued to reclaim resources for job 21687 [2017-06-06T11:14:17.839] backfill: Started JobId=21739 in ckpt on n2021 [2017-06-06T11:14:17.840] backfill: Started JobId=21740 in ckpt on n2022 [2017-06-06T11:14:19.894] Requeuing JobID=21735 State=0x0 NodeCnt=0 [2017-06-06T11:14:47.853] preempted job 21736 has been requeued to reclaim resources for job 21687 [2017-06-06T11:14:47.853] preempted job 21737 has been requeued to reclaim resources for job 21687 [2017-06-06T11:14:47.857] backfill: Started JobId=21741 in ckpt on n2128 [2017-06-06T11:14:49.903] Requeuing JobID=21736 State=0x0 NodeCnt=0 [2017-06-06T11:14:49.904] Requeuing JobID=21737 State=0x0 NodeCnt=0 [2017-06-06T11:15:17.871] preempted job 21738 has been requeued to reclaim resources for job 21687 [2017-06-06T11:15:19.926] Requeuing JobID=21738 State=0x0 NodeCnt=0 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 21736 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21737 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21738 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21742 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21743 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21744 ckpt test-sle sjf4 PD 0:00 1 (Priority) 21735 ckpt test-sle sjf4 PD 0:00 1 (Resources) 21741 ckpt test-sle sjf4 R 5:04 1 n2128 21739 ckpt test-sle sjf4 R 5:34 1 n2021 21740 ckpt test-sle sjf4 R 5:34 1 n2022 21734 ckpt test-sle sjf4 R 35:36 1 n2188 21728 ckpt test-sle sjf4 R 37:36 1 n2185 21729 ckpt test-sle sjf4 R 37:36 1 n2186 21730 ckpt test-sle sjf4 R 37:36 1 n2187 21731 ckpt test-sle sjf4 R 37:36 1 n2189 21732 ckpt test-sle sjf4 R 37:36 1 n2190 21733 ckpt test-sle sjf4 R 37:36 1 n2191 21727 ckpt test-sle sjf4 R 38:36 1 n2194 21247 ckpt RT_YNVC1 apetrone R 17:51:21 1 n2204 21246 ckpt RTzPure_ apetrone R 17:56:07 1 n2203 21248 ckpt RT_XNVC1 apetrone R 17:48:34 1 n2205 21242 ckpt Opt_NeTr apetrone R 18:01:43 1 n2193 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ckpt up infinite 10 drain* n[2011,2023,2046,2073,2090,2101,2112,2122,2140,2192] ckpt up infinite 3 drain n[2174,2177,2183] ckpt up infinite 158 alloc n[2001,2004-2005,2008,2010,2012-2014,2021-2022,2024-2034,2040-2045,2047-2059,2063-2072,2074-2081,2091-2100,2102-2111,2113-2115,2118-2121,2128-2139,2141-2173,2175-2176,2184-2191,2193-2210] ckpt up infinite 39 idle n[2002-2003,2006-2007,2009,2015-2020,2035-2039,2060-2062,2082-2089,2116-2117,2123-2127,2178-2182]
Created attachment 4703 [details] Slurm configuration
Created attachment 4704 [details] scontrol show job output
Created attachment 4705 [details] sacctmgr show qos output
Created attachment 4706 [details] Debug slurmctld log
I guess I should mention we're running a patched version of the slurm-17.02 branch from bug 3859.
I just added in the 17.02.4 version, and am switching to that - what you're running should be almost identical to that maintenance release which is due out shortly. To unpack your concerns a bit: "Despite there being many nodes free on our cluster, Slurm is not starting preemptable jobs in our excess capacity" This is by design. If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common. This is not likely to change soon. " and in some cases is incorrectly preempting jobs for preemptors (21687) that can't run because they are blocked by a partition qos GrpCpuLimit." This is a bit trickier - the PartitionQOS limits (and other limits) are meant as last-change restrictions, and only are enforced after nodes for a job have become available. I'm sending ticket over to Dominik for his input; I know he's been working on some related issues in the backfill scheduler, and can provide some current advise on that. - Tim
I agree with Tim about first issue. I have some idea what happened in second problem but to be sure I need to see full history of 21687 or other job with this problem. Does this problem stile occur after applying patch from bug 3859? Dominik
Hi I tried recreate this without success. Can you confirm if this still occur after applying patch from bug 3859? Dominik
I was out on vacation. Yes, this happens with the patch for bug 3859. Please explain in more detail what you mean by, "I have some idea what happened in second problem but to be sure I need to see full history of 21687 or other job with this problem."
Hi I need you to send me slurmctld log which contains history of job 21687 from submission to allocation. Is there any chance that some of these jobs were submitted before applying patch from bug 3859? Dominik
Yes, they were most certainly submitted before I installed the patch. You can go ahead and close this since we're not going forward with this scheduler configuration anyway.