Summary: | Jobs Not Starting | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kaizaad <kaizaad> |
Component: | Scheduling | Assignee: | Chad Vizino <chad> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | nathan.wielenga |
Version: | - Unsupported Older Versions | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | SharcNet | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurm.conf
gres.conf Waiting jobs for t4 GPUs Waiting jobs for t4 GPUs sdiag output |
Description
Kaizaad
2020-06-18 13:27:44 MDT
Created attachment 14729 [details]
slurm.conf
Created attachment 14730 [details]
gres.conf
Created attachment 14731 [details]
Waiting jobs for t4 GPUs
I don't think I have a kaizaad@sharcnet.ca account on this bug site so I logged in and opened the ticket with my computecanada.ca e-mail. The site should be "SHARCNET" and we do have a valid support contract. thanks -k I'll take a look at what you've supplied. Would you provide your slurmctld.log covering the jobs you list? It looks like you have info (the default) level set on slurmctld but we might need a more detailed version with debug level or higher. Let's just start with what you have for now. Hi Chad, There is nothing in there for those jobs except when they were accepted and submitted and now all those jobs have either been cancelled or completed. I have asked a user to submit more jobs to see if there is an issue. What/how should I set the slurmctld log to temporarily increase the logging? thanks -k Let's have you try "debug" level with backfill added to DebugFlags. Note that this may slow down scheduling if you have lots of jobs so you may want to capture only a few scheduling iterations or maybe enable for just 10-15 minutes. Also, it looks like you have a default (1 day) bf_window implicitly set. The docs (https://slurm.schedmd.com/slurm.conf.html) advise: >A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. How long is your longest job? (In reply to Chad Vizino from comment #10) > Let's have you try "debug" level with backfill added to DebugFlags. Note > that this may slow down scheduling if you have lots of jobs so you may want > to capture only a few scheduling iterations or maybe enable for just 10-15 > minutes. Sounds good. I'll do that the next time I notice the issue. > Also, it looks like you have a default (1 day) bf_window implicitly set. The > docs (https://slurm.schedmd.com/slurm.conf.html) advise: > > >A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. > How long is your longest job? Unfortunately it is 28 days. Most of the jobs fit in our 7 day max walltime window. When looking at this last week, it did seem like the pending jobs in the _b6 (28day) partition asking for the "t4" GPU were causing an issue but they are all gone now. thanks -k Hello Chad, I will attach a current .tgz of pending jobs and their details along with a slurmctld.log that has debug loglevel. I spot checked some of the pending jobs in the logs and the ones I found didn't say much. thanks -k Created attachment 14769 [details]
Waiting jobs for t4 GPUs
Thanks for supplying this. From your slurmctld.log I see there are over 9500 jobs to be considered by backfill: >[2020-06-24T14:22:29.665] backfill: beginning >[2020-06-24T14:22:29.712] debug: backfill: 9504 jobs to backfill Locks are being yielded every approx. 2 seconds which is expected since you have the default bf_yield_interval (2 sec): >[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179 >... >[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256 >... I would try setting your bf_window to something in the 10000-40000 minute (roughly 6-30 days) range which is what some of our larger system customers use. Again, it is advisable to set bf_window to at least as long as the highest allowed time limit to prevent starvation. (In reply to Chad Vizino from comment #14) > Thanks for supplying this. From your slurmctld.log I see there are over 9500 > jobs to be considered by backfill: > > >[2020-06-24T14:22:29.665] backfill: beginning > >[2020-06-24T14:22:29.712] debug: backfill: 9504 jobs to backfill > Locks are being yielded every approx. 2 seconds which is expected since you > have the default bf_yield_interval (2 sec): > > >[2020-06-24T14:22:31.667] backfill: yielding locks after testing 965(965) jobs tested, 409 time slots, usec=2002179 > >... > >[2020-06-24T14:22:34.180] backfill: yielding locks after testing 1053(89) jobs tested, 550 time slots, usec=2013256 > >... > I would try setting your bf_window to something in the 10000-40000 minute > (roughly 6-30 days) range which is what some of our larger system customers > use. Again, it is advisable to set bf_window to at least as long as the > highest allowed time limit to prevent starvation. Hi, We do have "bf_continue" so shouldn't it still evaluate the jobs? And I do see "LastSchedEval" (scontrol -ddd show job #) for these jobs being relatively current. I will look at setting the "bf_window" as you suggest. Though shouldn't these jobs be evaluated by the main scheduling loop? We have a "default_queue_depth" of 20k. thanks -k Would you provide output from "sdiag"? I'd like to see some of your scheduling stats. Thanks. Created attachment 14800 [details]
sdiag output
Thanks. Am looking it over and will get back to you with some observations. Couple things we've noticed from the sdiag output: >REQUEST_PARTITION_INFO ( 2009) count:2810325 ave_time:1736 total_time:4879675895 >REQUEST_NODE_INFO_SINGLE ( 2040) count:2569466 ave_time:233466 total_time:599884804831 This is about 100 rpc/sec which seems high. Also, there is a big difference between "Depth Mean" (mean count of jobs processed during all backfilling scheduling cycles since last reset) and "Depth Mean (try depth)" (subset of Depth Mean that the backfill scheduler attempted to schedule): >Depth Mean: 2127 >Depth Mean (try depth): 452 This may be due to bf_max_job_user=10. One last thing: you have bf_interval=180 and bf_max_time defaults to the value of bf_interval--this may not be enough time and you may want to consider increasing bf_max_time. I think you mentioned that you are planning to upgrade soon and we strongly encourage you to do so since these issues with 17.11.7 may be fixed in a later release. I will close this ticket for now but feel free to submit a new one once you upgrade if you continue to experience these issues and we'll look into them. -Chad |