Hello, We are running an older version, Slurm 17.11.7 and will be upgrading soon but I would still like to know if this is an issue because of our config or something else? I can find no reason that a specific set of jobs are not starting. The group(s) have "used a lot" so their RawUsage is high compared to their FairShare target, still the jobs do seem to be evaluated according to "LastSchedEval" from scontrol -ddd show job #, and the resources they are asking for are idle. All pending jobs sorted by priority asking for t4 GPUs: # squeue -t pd -S -p|grep -i 'gpu:t4' JOBID USER ACCOUNT PARTITION NAME ST START_TIME TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON) 33134917 gp104 def-josedolz_gpu gpubase_bygpu_b1,gpubackfill segment-job.sh PD N/A 2:00:00 1 6 gpu:t4:1 16G (JobArrayTaskLimit) 33135227 gp104 def-josedolz_gpu gpubase_bygpu_b1,gpubackfill segment-job.sh PD N/A 2:00:00 1 6 gpu:t4:1 16G (JobArrayTaskLimit) 33147701 adash def-branzana_gpu gpubase_bygpu_b2,gpubackfill 1-unet_code_te PD N/A 5:00:00 1 3 gpu:t4:1 62.50G (Resources) 33147704 adash def-branzana_gpu gpubase_bygpu_b2,gpubackfill 2-unet_code_te PD N/A 5:00:00 1 3 gpu:t4:1 62.50G (Resources) 33148179 jrajaman def-erangauk-ab_gpu gpubase_bygpu_b3,gpubackfill test_run PD N/A 15:00:00 1 16 gpu:t4:1 25000M (Resources) 33148227 kfountou def-kfountou_gpu gpubase_bygpu_b5 job.sh PD N/A 6-06:00:00 1 1 gpu:t4:1 16000M (Resources) 33113184 the def-dylan_gpu gpubase_bygpu_b4 1.5x1.5_flippe PD N/A 1-12:00:00 1 16 gpu:t4:1 14G (Resources) 33115233 the def-dylan_gpu gpubase_bygpu_b4 1.5x1.5_flippe PD N/A 1-12:00:00 1 16 gpu:t4:1 14G (Resources) 33115283 the def-dylan_gpu gpubase_bygpu_b4 gpujob_origina PD N/A 1-12:00:00 1 16 gpu:t4:1 14G (Resources) 33061403 tristan4 def-fouchers_gpu gpubase_bynode_b3,gpubackfill cv3_0_gr.sh PD N/A 1-00:00:00 1 2 gpu:t4:4 178G (Resources) 33094266 mitch12 def-nghugre_gpu gpubase_bynode_b2,gpubackfill efb7-1-224 PD N/A 12:00:00 1 12 gpu:t4:4 64G (Resources) 33094269 mitch12 def-nghugre_gpu gpubase_bynode_b2,gpubackfill efb7-2-224 PD N/A 12:00:00 1 12 gpu:t4:4 64G (Resources) 33094273 mitch12 def-nghugre_gpu gpubase_bynode_b2,gpubackfill efb7-5-224 PD N/A 12:00:00 1 12 gpu:t4:4 64G (Resources) 33094276 mitch12 def-nghugre_gpu gpubase_bynode_b2,gpubackfill efb7-10-224 PD N/A 12:00:00 1 12 gpu:t4:4 64G (Resources) 33094235 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb4-1-224 PD N/A 12:00:00 1 9 gpu:t4:3 48G (Resources) 33094236 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb4-2-224 PD N/A 12:00:00 1 9 gpu:t4:3 48G (Resources) 33094239 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb4-5-224 PD N/A 12:00:00 1 9 gpu:t4:3 48G (Resources) 33094240 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb4-10-224 PD N/A 12:00:00 1 9 gpu:t4:3 48G (Resources) 33094189 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-1-112 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094207 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-2-112 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094214 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-5-112 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094221 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-10-112 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094224 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-1-224 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094226 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-2-224 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094228 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-5-224 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33094230 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-10-224 PD N/A 12:00:00 1 6 gpu:t4:2 32G (Resources) 33141351 mitch12 def-nghugre_gpu gpubase_bygpu_b2,gpubackfill efb0-1-112-SGD PD N/A 12:00:00 1 3 gpu:t4:1 16G (Resources) 33102541 andymtng def-jtus_gpu gpubase_bygpu_b3,gpubackfill mec12_mec7 PD N/A 1-00:00:00 1 16 gpu:t4:1 30000M (Resources) 33079554 m7adnan def-tizhoosh_gpu gpubase_bynode_b4 test.sh PD N/A 2-00:00:00 1 3 gpu:t4:4 24G (ReqNodeNotAvail, UnavailableNodes:gra[1-795,801-826,828-880,932-937,944-957,972-977,984-1153,1160-1189,1191-1261]) 32406434 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 1-10:00:00 1 16 gpu:t4:1 2G (Resources) 32648262 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648263 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648264 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648265 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648266 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648267 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648268 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648269 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648271 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648272 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648273 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648274 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648275 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32648276 srnaik def-aganesan_gpu gpubase_bygpu_b4 Graham_ambersc PD N/A 2-00:00:00 1 16 gpu:t4:1 2G (Resources) 32408846 srnaik def-aganesan_gpu gpubase_bygpu_b4 script_2C10ent PD N/A 2-00:00:00 1 4 gpu:t4:1 2G (Resources) 32408864 srnaik def-aganesan_gpu gpubase_bygpu_b4 script_210Entr PD N/A 2-00:00:00 1 4 gpu:t4:1 2G (Resources) 33094029 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) 33094030 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) 33094031 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) 33094032 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) 33094033 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) 33094034 xuleon1 def-hroest_gpu gpubase_bynode_b6 run_experiment PD N/A 14-00:00:00 1 3 gpu:t4:1 178G (Resources) The idle resources in question: # sinfo --node gra[1160-1189]|grep -i idle gpubase_bynode_b1 up 1-infini 3:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bynode_b2 up 1-infini 12:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bynode_b3 up 1-infini 1-00:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bynode_b4 up 1-infini 3-00:00:00 44 2:22:1 18 idle gra[1164-1167,1173-1186] none gpubase_bynode_b5 up 1-infini 7-00:00:00 44 2:22:1 12 idle gra[1164-1167,1173-1180] none gpubase_bynode_b6 up 1-infini 28-00:00:0 44 2:22:1 8 idle gra[1164-1167,1173-1176] none gpubase_bygpu_b1 up 1-infini 3:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bygpu_b2 up 1-infini 12:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bygpu_b3 up 1-infini 1-00:00:00 44 2:22:1 21 idle gra[1164-1167,1173-1189] none gpubase_bygpu_b4 up 1-infini 3-00:00:00 44 2:22:1 12 idle gra[1164-1167,1173-1180] none gpubase_bygpu_b5 up 1-infini 7-00:00:00 44 2:22:1 10 idle gra[1164-1167,1173-1178] none gpubase_bygpu_b6 up 1-infini 28-00:00:0 44 2:22:1 7 idle gra[1164-1167,1173-1175] none Maybe I'm missing something obvious? If I boost the priority of these jobs, they do start. I'll attach our config and some job outputs. thanks -k
Created attachment 14729 [details] slurm.conf
Created attachment 14730 [details] gres.conf
Created attachment 14731 [details] Waiting jobs for t4 GPUs
I don't think I have a kaizaad@sharcnet.ca account on this bug site so I logged in and opened the ticket with my computecanada.ca e-mail. The site should be "SHARCNET" and we do have a valid support contract. thanks -k
I'll take a look at what you've supplied.
Would you provide your slurmctld.log covering the jobs you list? It looks like you have info (the default) level set on slurmctld but we might need a more detailed version with debug level or higher. Let's just start with what you have for now.
Hi Chad, There is nothing in there for those jobs except when they were accepted and submitted and now all those jobs have either been cancelled or completed. I have asked a user to submit more jobs to see if there is an issue. What/how should I set the slurmctld log to temporarily increase the logging? thanks -k
Let's have you try "debug" level with backfill added to DebugFlags. Note that this may slow down scheduling if you have lots of jobs so you may want to capture only a few scheduling iterations or maybe enable for just 10-15 minutes. Also, it looks like you have a default (1 day) bf_window implicitly set. The docs (https://slurm.schedmd.com/slurm.conf.html) advise: >A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. How long is your longest job?
(In reply to Chad Vizino from comment #10) > Let's have you try "debug" level with backfill added to DebugFlags. Note > that this may slow down scheduling if you have lots of jobs so you may want > to capture only a few scheduling iterations or maybe enable for just 10-15 > minutes. Sounds good. I'll do that the next time I notice the issue. > Also, it looks like you have a default (1 day) bf_window implicitly set. The > docs (https://slurm.schedmd.com/slurm.conf.html) advise: > > >A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. > How long is your longest job? Unfortunately it is 28 days. Most of the jobs fit in our 7 day max walltime window. When looking at this last week, it did seem like the pending jobs in the _b6 (28day) partition asking for the "t4" GPU were causing an issue but they are all gone now. thanks -k
Hello Chad, I will attach a current .tgz of pending jobs and their details along with a slurmctld.log that has debug loglevel. I spot checked some of the pending jobs in the logs and the ones I found didn't say much. thanks -k
Created attachment 14769 [details] Waiting jobs for t4 GPUs
Thanks for supplying this. From your slurmctld.log I see there are over 9500 jobs to be considered by backfill: >[2020-06-24T14:22:29.665] backfill: beginning >[2020-06-24T14:22:29.712] debug: backfill: 9504 jobs to backfill Locks are being yielded every approx. 2 seconds which is expected since you have the default bf_yield_interval (2 sec): >[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179 >... >[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256 >... I would try setting your bf_window to something in the 10000-40000 minute (roughly 6-30 days) range which is what some of our larger system customers use. Again, it is advisable to set bf_window to at least as long as the highest allowed time limit to prevent starvation.
(In reply to Chad Vizino from comment #14) > Thanks for supplying this. From your slurmctld.log I see there are over 9500 > jobs to be considered by backfill: > > >[2020-06-24T14:22:29.665] backfill: beginning > >[2020-06-24T14:22:29.712] debug: backfill: 9504 jobs to backfill > Locks are being yielded every approx. 2 seconds which is expected since you > have the default bf_yield_interval (2 sec): > > >[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179 > >... > >[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256 > >... > I would try setting your bf_window to something in the 10000-40000 minute > (roughly 6-30 days) range which is what some of our larger system customers > use. Again, it is advisable to set bf_window to at least as long as the > highest allowed time limit to prevent starvation. Hi, We do have "bf_continue" so shouldn't it still evaluate the jobs? And I do see "LastSchedEval" (scontrol -ddd show job #) for these jobs being relatively current. I will look at setting the "bf_window" as you suggest. Though shouldn't these jobs be evaluated by the main scheduling loop? We have a "default_queue_depth" of 20k. thanks -k
Would you provide output from "sdiag"? I'd like to see some of your scheduling stats. Thanks.
Created attachment 14800 [details] sdiag output
Thanks. Am looking it over and will get back to you with some observations.
Couple things we've noticed from the sdiag output: >REQUEST_PARTITION_INFO ( 2009) count:2810325 ave_time:1736 total_time:4879675895 >REQUEST_NODE_INFO_SINGLE ( 2040) count:2569466 ave_time:233466 total_time:599884804831 This is about 100 rpc/sec which seems high. Also, there is a big difference between "Depth Mean" (mean count of jobs processed during all backfilling scheduling cycles since last reset) and "Depth Mean (try depth)" (subset of Depth Mean that the backfill scheduler attempted to schedule): >Depth Mean: 2127 >Depth Mean (try depth): 452 This may be due to bf_max_job_user=10. One last thing: you have bf_interval=180 and bf_max_time defaults to the value of bf_interval--this may not be enough time and you may want to consider increasing bf_max_time.
I think you mentioned that you are planning to upgrade soon and we strongly encourage you to do so since these issues with 17.11.7 may be fixed in a later release. I will close this ticket for now but feel free to submit a new one once you upgrade if you continue to experience these issues and we'll look into them. -Chad