Ticket 9257

Summary:	Jobs Not Starting
Product:	Slurm	Reporter:	Kaizaad <kaizaad>
Component:	Scheduling	Assignee:	Chad Vizino <chad>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	nathan.wielenga
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	SharcNet	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf Waiting jobs for t4 GPUs Waiting jobs for t4 GPUs sdiag output

Description Kaizaad 2020-06-18 13:27:44 MDT

Hello,

We are running an older version, Slurm 17.11.7 and will be upgrading soon but I would still like to know if this is an issue because of our config or something else?

I can find no reason that a specific set of jobs are not starting. The group(s) have "used a lot" so their RawUsage is high compared to their FairShare target, still the jobs do seem to be evaluated according to "LastSchedEval" from scontrol -ddd show job #, and the resources they are asking for are idle.

All pending jobs sorted by priority asking for t4 GPUs:
# squeue -t pd -S -p|grep -i 'gpu:t4'
   JOBID     USER              ACCOUNT PARTITION           NAME  ST START_TIME        TIME_LEFT NODES CPUS       GRES MIN_MEM NODELIST (REASON) 
33134917    gp104     def-josedolz_gpu gpubase_bygpu_b1,gpubackfill segment-job.sh  PD N/A                 2:00:00     1    6   gpu:t4:1     16G  (JobArrayTaskLimit) 
33135227    gp104     def-josedolz_gpu gpubase_bygpu_b1,gpubackfill segment-job.sh  PD N/A                 2:00:00     1    6   gpu:t4:1     16G  (JobArrayTaskLimit) 
33147701    adash     def-branzana_gpu gpubase_bygpu_b2,gpubackfill 1-unet_code_te  PD N/A                 5:00:00     1    3   gpu:t4:1  62.50G  (Resources) 
33147704    adash     def-branzana_gpu gpubase_bygpu_b2,gpubackfill 2-unet_code_te  PD N/A                 5:00:00     1    3   gpu:t4:1  62.50G  (Resources) 
33148179 jrajaman  def-erangauk-ab_gpu gpubase_bygpu_b3,gpubackfill       test_run  PD N/A                15:00:00     1   16   gpu:t4:1  25000M  (Resources) 
33148227 kfountou     def-kfountou_gpu gpubase_bygpu_b5         job.sh  PD N/A              6-06:00:00     1    1   gpu:t4:1  16000M  (Resources) 
33113184      the        def-dylan_gpu gpubase_bygpu_b4 1.5x1.5_flippe  PD N/A              1-12:00:00     1   16   gpu:t4:1     14G  (Resources) 
33115233      the        def-dylan_gpu gpubase_bygpu_b4 1.5x1.5_flippe  PD N/A              1-12:00:00     1   16   gpu:t4:1     14G  (Resources) 
33115283      the        def-dylan_gpu gpubase_bygpu_b4 gpujob_origina  PD N/A              1-12:00:00     1   16   gpu:t4:1     14G  (Resources) 
33061403 tristan4     def-fouchers_gpu gpubase_bynode_b3,gpubackfill    cv3_0_gr.sh  PD N/A              1-00:00:00     1    2   gpu:t4:4    178G  (Resources) 
33094266  mitch12      def-nghugre_gpu gpubase_bynode_b2,gpubackfill     efb7-1-224  PD N/A                12:00:00     1   12   gpu:t4:4     64G  (Resources) 
33094269  mitch12      def-nghugre_gpu gpubase_bynode_b2,gpubackfill     efb7-2-224  PD N/A                12:00:00     1   12   gpu:t4:4     64G  (Resources) 
33094273  mitch12      def-nghugre_gpu gpubase_bynode_b2,gpubackfill     efb7-5-224  PD N/A                12:00:00     1   12   gpu:t4:4     64G  (Resources) 
33094276  mitch12      def-nghugre_gpu gpubase_bynode_b2,gpubackfill    efb7-10-224  PD N/A                12:00:00     1   12   gpu:t4:4     64G  (Resources) 
33094235  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb4-1-224  PD N/A                12:00:00     1    9   gpu:t4:3     48G  (Resources) 
33094236  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb4-2-224  PD N/A                12:00:00     1    9   gpu:t4:3     48G  (Resources) 
33094239  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb4-5-224  PD N/A                12:00:00     1    9   gpu:t4:3     48G  (Resources) 
33094240  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill    efb4-10-224  PD N/A                12:00:00     1    9   gpu:t4:3     48G  (Resources) 
33094189  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-1-112  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094207  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-2-112  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094214  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-5-112  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094221  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill    efb0-10-112  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094224  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-1-224  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094226  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-2-224  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094228  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill     efb0-5-224  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33094230  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill    efb0-10-224  PD N/A                12:00:00     1    6   gpu:t4:2     32G  (Resources) 
33141351  mitch12      def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-1-112-SGD  PD N/A                12:00:00     1    3   gpu:t4:1     16G  (Resources) 
33102541 andymtng         def-jtus_gpu gpubase_bygpu_b3,gpubackfill     mec12_mec7  PD N/A              1-00:00:00     1   16   gpu:t4:1  30000M  (Resources) 
33079554  m7adnan     def-tizhoosh_gpu gpubase_bynode_b4        test.sh  PD N/A              2-00:00:00     1    3   gpu:t4:4     24G  (ReqNodeNotAvail, UnavailableNodes:gra[1-795,801-826,828-880,932-937,944-957,972-977,984-1153,1160-1189,1191-1261]) 
32406434   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              1-10:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648262   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648263   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648264   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648265   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648266   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648267   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648268   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648269   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648271   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648272   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648273   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648274   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648275   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32648276   srnaik     def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc  PD N/A              2-00:00:00     1   16   gpu:t4:1      2G  (Resources) 
32408846   srnaik     def-aganesan_gpu gpubase_bygpu_b4 script_2C10ent  PD N/A              2-00:00:00     1    4   gpu:t4:1      2G  (Resources) 
32408864   srnaik     def-aganesan_gpu gpubase_bygpu_b4 script_210Entr  PD N/A              2-00:00:00     1    4   gpu:t4:1      2G  (Resources) 
33094029  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 
33094030  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 
33094031  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 
33094032  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 
33094033  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 
33094034  xuleon1       def-hroest_gpu gpubase_bynode_b6 run_experiment  PD N/A              14-00:00:00     1    3   gpu:t4:1    178G  (Resources) 


The idle resources in question:
# sinfo --node gra[1160-1189]|grep -i idle
gpubase_bynode_b1   up    1-infini    3:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bynode_b2   up    1-infini   12:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bynode_b3   up    1-infini 1-00:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bynode_b4   up    1-infini 3-00:00:00     44 2:22:1      18 idle       gra[1164-1167,1173-1186] none                                                                                                
gpubase_bynode_b5   up    1-infini 7-00:00:00     44 2:22:1      12 idle       gra[1164-1167,1173-1180] none                                                                                                
gpubase_bynode_b6   up    1-infini 28-00:00:0     44 2:22:1       8 idle       gra[1164-1167,1173-1176] none                                                                                                
gpubase_bygpu_b1    up    1-infini    3:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bygpu_b2    up    1-infini   12:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bygpu_b3    up    1-infini 1-00:00:00     44 2:22:1      21 idle       gra[1164-1167,1173-1189] none                                                                                                
gpubase_bygpu_b4    up    1-infini 3-00:00:00     44 2:22:1      12 idle       gra[1164-1167,1173-1180] none                                                                                                
gpubase_bygpu_b5    up    1-infini 7-00:00:00     44 2:22:1      10 idle       gra[1164-1167,1173-1178] none                                                                                                
gpubase_bygpu_b6    up    1-infini 28-00:00:0     44 2:22:1       7 idle       gra[1164-1167,1173-1175] none                                                                  


Maybe I'm missing something obvious? If I boost the priority of these jobs, they do start. I'll attach our config and some job outputs.

thanks
-k

Comment 1 Kaizaad 2020-06-18 13:32:52 MDT

Created attachment 14729 [details]
slurm.conf

Comment 2 Kaizaad 2020-06-18 13:33:38 MDT

Created attachment 14730 [details]
gres.conf

Comment 4 Kaizaad 2020-06-18 13:40:49 MDT

Created attachment 14731 [details]
Waiting jobs for t4 GPUs

Comment 5 Kaizaad 2020-06-18 13:54:55 MDT

I don't think I have a kaizaad@sharcnet.ca account on this bug site so I logged in and opened the ticket with my computecanada.ca e-mail. The site should be "SHARCNET" and we do have a valid support contract.

thanks
-k

Comment 6 Chad Vizino 2020-06-18 14:51:43 MDT

I'll take a look at what you've supplied.

Comment 7 Chad Vizino 2020-06-22 10:23:20 MDT

Would you provide your slurmctld.log covering the jobs you list? It looks like you have info (the default) level set on slurmctld but we might need a more detailed version with debug level or higher. Let's just start with what you have for now.

Comment 8 Kaizaad 2020-06-22 10:46:57 MDT

Hi Chad,

There is nothing in there for those jobs except when they were accepted and submitted and now all those jobs have either been cancelled or completed.

I have asked a user to submit more jobs to see if there is an issue.

What/how should I set the slurmctld log to temporarily increase the logging?

thanks
-k

Comment 10 Chad Vizino 2020-06-22 11:40:11 MDT

Let's have you try "debug" level with backfill added to DebugFlags. Note that this may slow down scheduling if you have lots of jobs so you may want to capture only a few scheduling iterations or maybe enable for just 10-15 minutes.

Also, it looks like you have a default (1 day) bf_window implicitly set. The docs (https://slurm.schedmd.com/slurm.conf.html) advise:

>A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. 
How long is your longest job?

Comment 11 Kaizaad 2020-06-22 11:47:40 MDT

(In reply to Chad Vizino from comment #10)
> Let's have you try "debug" level with backfill added to DebugFlags. Note
> that this may slow down scheduling if you have lots of jobs so you may want
> to capture only a few scheduling iterations or maybe enable for just 10-15
> minutes.

Sounds good. I'll do that the next time I notice the issue.

> Also, it looks like you have a default (1 day) bf_window implicitly set. The
> docs (https://slurm.schedmd.com/slurm.conf.html) advise:
> 
> >A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. 
> How long is your longest job?

Unfortunately it is 28 days. Most of the jobs fit in our 7 day max walltime window. When looking at this last week, it did seem like the pending jobs in the _b6 (28day) partition asking for the "t4" GPU were causing an issue but they are all gone now.

thanks
-k

Comment 12 Kaizaad 2020-06-24 12:42:11 MDT

Hello Chad,

I will attach a current .tgz of pending jobs and their details along with a slurmctld.log that has debug loglevel.

I spot checked some of the pending jobs in the logs and the ones I found didn't say much.

thanks
-k

Comment 13 Kaizaad 2020-06-24 12:45:28 MDT

Created attachment 14769 [details]
Waiting jobs for t4 GPUs

Comment 14 Chad Vizino 2020-06-24 18:23:23 MDT

Thanks for supplying this. From your slurmctld.log I see there are over 9500 jobs to be considered by backfill:

>[2020-06-24T14:22:29.665] backfill: beginning
>[2020-06-24T14:22:29.712] debug:  backfill: 9504 jobs to backfill
Locks are being yielded every approx. 2 seconds which is expected since you have the default bf_yield_interval (2 sec):

>[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179
>...
>[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256
>...
I would try setting your bf_window to something in the 10000-40000 minute (roughly 6-30 days) range which is what some of our larger system customers use. Again, it is advisable to set bf_window to at least as long as the highest allowed time limit to prevent starvation.

Comment 15 Kaizaad 2020-06-25 11:04:15 MDT

(In reply to Chad Vizino from comment #14)
> Thanks for supplying this. From your slurmctld.log I see there are over 9500
> jobs to be considered by backfill:
> 
> >[2020-06-24T14:22:29.665] backfill: beginning
> >[2020-06-24T14:22:29.712] debug:  backfill: 9504 jobs to backfill
> Locks are being yielded every approx. 2 seconds which is expected since you
> have the default bf_yield_interval (2 sec):
> 
> >[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179
> >...
> >[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256
> >...
> I would try setting your bf_window to something in the 10000-40000 minute
> (roughly 6-30 days) range which is what some of our larger system customers
> use. Again, it is advisable to set bf_window to at least as long as the
> highest allowed time limit to prevent starvation.

Hi,

We do have "bf_continue" so shouldn't it still evaluate the jobs? And I do see "LastSchedEval" (scontrol -ddd show job #) for these jobs being relatively current.

I will look at setting the "bf_window" as you suggest.

Though shouldn't these jobs be evaluated by the main scheduling loop? We have a "default_queue_depth" of 20k. 

thanks
-k

Comment 16 Chad Vizino 2020-06-26 10:20:05 MDT

Would you provide output from "sdiag"? I'd like to see some of your scheduling stats.

Thanks.

Comment 17 Kaizaad 2020-06-26 10:40:37 MDT

Created attachment 14800 [details]
sdiag output

Comment 18 Chad Vizino 2020-06-26 17:36:43 MDT

Thanks. Am looking it over and will get back to you with some observations.

Comment 19 Chad Vizino 2020-07-10 18:23:21 MDT

Couple things we've noticed from the sdiag output:

>REQUEST_PARTITION_INFO                  ( 2009) count:2810325 ave_time:1736   total_time:4879675895
>REQUEST_NODE_INFO_SINGLE                ( 2040) count:2569466 ave_time:233466 total_time:599884804831
This is about 100 rpc/sec which seems high.

Also, there is a big difference between "Depth Mean" (mean count of jobs processed during all backfilling scheduling cycles since last reset) and "Depth Mean (try depth)" (subset of Depth Mean that the backfill scheduler attempted to schedule):

>Depth Mean: 2127
>Depth Mean (try depth): 452
This may be due to bf_max_job_user=10.

One last thing: you have bf_interval=180 and bf_max_time defaults to the value of bf_interval--this may not be enough time and you may want to consider increasing bf_max_time.

Comment 21 Chad Vizino 2020-08-05 10:23:02 MDT

I think you mentioned that you are planning to upgrade soon and we strongly encourage you to do so since these issues with 17.11.7 may be fixed in a later release.

I will close this ticket for now but feel free to submit a new one once you upgrade if you continue to experience these issues and we'll look into them.

 -Chad