sacct state is not updating for many jobs that have dispatched, even after waiting some period of time. Sacct listing for jobs shows as pending after the job has dispatched, or even completed. # date; squeue -j 25845876; sacct -j 25845876 Tue Jun 25 12:21:21 EDT 2019 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 25845876 general myjob.sl dbcortes R 1:08 1 c0828 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 25845876 myjob.sl general rc_akshau+ 1 PENDING 0:0 # date; squeue -j 25845876; sleep 3; sacct -j 25845876 Tue Jun 25 12:21:38 EDT 2019 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 25845876 general myjob.sl dbcortes R 1:24 1 c0828 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 25845876 myjob.sl general rc_akshau+ 1 PENDING 0:0
after 15 minutes sacct has still not updated # date; sacct -j 25845876; squeue -j 25845876 Tue Jun 25 12:39:47 EDT 2019 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 25845876 myjob.sl general rc_akshau+ 1 PENDING 0:0 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 25845876 general myjob.sl dbcortes R 19:33 1 c0828
(I wrote this before your comment 1 update) Can you upload the output of sdiag? Specifically, we're looking for the value of DBD Agent queue size. If this is growing and not decreasing, this is a sign of a bigger problem. If DBD Agent queue size is zero or close to zero, then check the output of sacct again for jobs that have started. It takes some non-zero amount of time for the information to get to the database. How long it takes varies widely depending on hardware setup, innodb configuration, Slurm configuration, how busy the system is, etc. (Post comment 1) This is a bit more concerning. Please upload the output of sdiag and upload your slurmdbd and slurmctld log files for the last few hours.
should i close partitions. I am also seeing jobs terminating for time elapsed that only ran moments - # sacct -j 25801569 --format=timelimit,elapsed,nodelist Timelimit Elapsed NodeList ---------- ---------- --------------- 1-00:00:00 1-00:00:10 t0601 1-00:00:12 t0601 1-00:00:12 t0601 [2019-06-24T10:56:02.206] _run_prolog: prolog with lock for job 25801569 ran for 0 seconds [2019-06-24T10:56:02.237] [25801569.extern] task/cgroup: /slurm/uid_275171/job_25801569: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited [2019-06-24T10:56:02.249] [25801569.extern] task/cgroup: /slurm/uid_275171/job_25801569/step_extern: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited [2019-06-24T10:56:04.664] launch task 25801569.0 request from UID:275171 GID:100 HOST:172.26.113.2 PORT:54925 [2019-06-24T10:56:04.686] [25801569.0] task/cgroup: /slurm/uid_275171/job_25801569: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited [2019-06-24T10:56:04.700] [25801569.0] task/cgroup: /slurm/uid_275171/job_25801569/step_0: alloc=307200MB mem.limit=307200MB memsw.limit=unlimited [2019-06-24T10:56:04.729] [25801569.0] in _window_manager [2019-06-25T10:56:12.967] [25801569.0] error: *** STEP 25801569.0 ON t0601 CANCELLED AT 2019-06-25T10:56:12 DUE TO TIME LIMIT *** [2019-06-25T10:56:13.132] [25801569.extern] _oom_event_monitor: oom-kill event count: 1 [2019-06-25T10:56:14.334] [25801569.0] done with job [2019-06-25T10:56:14.334] [25801569.extern] done with job
First upload the sdiag output and the slurmctld and slurmdbd log files, and a current slurm.conf. I'd like to focus on the database issue for the moment, but I'll also look for potential other problems in the logs.
Created attachment 10700 [details] slurmctld log
Created attachment 10701 [details] slurmdbd log
slurmdbd.log has: [2019-06-25T13:14:15.353] error: There is no reservation by id 150, time_start 1561474890, and cluster 'longleaf' We've seen the issue in the past. This can cause lost database records if not fixed. I'll look for the other bugs with this issue and come back with a way to get around this. Please upload the output of sdiag.
]# sdiag ******************************************************* sdiag output at Tue Jun 25 13:10:36 2019 (1561482636) Data since Tue Jun 25 11:13:09 2019 (1561475589) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 74785 Jobs submitted: 15654 Jobs started: 11298 Jobs completed: 10295 Jobs canceled: 356 Jobs failed: 0 Job states ts: Tue Jun 25 13:10:20 2019 (1561482620) Jobs pending: 12078 Jobs running: 7298 Main schedule statistics (microseconds): Last cycle: 32184 Max cycle: 305196 Total cycles: 420 Mean cycle: 65003 Mean depth cycle: 1048 Cycles per minute: 3 Last queue length: 9305 Backfilling stats Total backfilled jobs (since last slurm start): 5568 Total backfilled jobs (since last stats cycle start): 5568 Total backfilled heterogeneous job components: 0 Total cycles: 115 Last cycle when: Tue Jun 25 13:10:02 2019 (1561482602) Last cycle: 20061760 Max cycle: 20210859 Mean cycle: 19558606 Last depth cycle: 368 Last depth cycle (try sched): 368 Depth Mean: 2502 Depth Mean (try depth): 475 Last queue length: 9349 Queue length mean: 8770 Latency for gettimeofday() (x1000): 15 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:15738 ave_time:125 total_time:1969355 REQUEST_UPDATE_JOB ( 3001) count:14159 ave_time:4058 total_time:57459883 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:13459 ave_time:18931 total_time:254800919 REQUEST_STEP_COMPLETE ( 5016) count:11355 ave_time:50720 total_time:575932246 MESSAGE_EPILOG_COMPLETE ( 6012) count:11334 ave_time:108111 total_time:1225333677 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:11302 ave_time:172254 total_time:1946822114 REQUEST_COMPLETE_PROLOG ( 6018) count:11298 ave_time:118845 total_time:1342720014 REQUEST_FED_INFO ( 2049) count:7713 ave_time:92 total_time:713246 REQUEST_JOB_USER_INFO ( 2039) count:7568 ave_time:179424 total_time:1357886133 REQUEST_NODE_INFO ( 2007) count:7482 ave_time:51497 total_time:385303984 REQUEST_JOB_INFO ( 2003) count:639 ave_time:92941 total_time:59389860 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:394 ave_time:81152 total_time:31974181 REQUEST_KILL_JOB ( 5032) count:371 ave_time:10293 total_time:3818810 REQUEST_JOB_INFO_SINGLE ( 2021) count:144 ave_time:96857 total_time:13947501 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:84 ave_time:109883 total_time:9230173 REQUEST_STATS_INFO ( 2035) count:63 ave_time:97 total_time:6136 REQUEST_JOB_READY ( 4019) count:54 ave_time:18593 total_time:1004062 REQUEST_RESOURCE_ALLOCATION ( 4001) count:48 ave_time:222277 total_time:10669328 REQUEST_RESERVATION_INFO ( 2024) count:31 ave_time:132192 total_time:4097961 REQUEST_UPDATE_PARTITION ( 3005) count:31 ave_time:26442 total_time:819720 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:25 ave_time:104509 total_time:2612727 REQUEST_JOB_STEP_CREATE ( 5001) count:24 ave_time:283 total_time:6801 REQUEST_CANCEL_JOB_STEP ( 5005) count:16 ave_time:57433 total_time:918942 REQUEST_SHARE_INFO ( 2022) count:12 ave_time:1453 total_time:17445 REQUEST_NODE_INFO_SINGLE ( 2040) count:7 ave_time:97634 total_time:683444 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:2 ave_time:92 total_time:185 REQUEST_UPDATE_NODE ( 3002) count:2 ave_time:516 total_time:1032 REQUEST_PING ( 1008) count:1 ave_time:57 total_time:57 REQUEST_BUILD_INFO ( 2001) count:1 ave_time:385 total_time:385 REQUEST_CREATE_RESERVATION ( 3006) count:1 ave_time:48550 total_time:48550 Remote Procedure Call statistics by user root ( 0) count:63687 ave_time:82333 total_time:5243577792 hepperla ( 214234) count:9676 ave_time:17485 total_time:169190111 rujin ( 236960) count:7039 ave_time:58996 total_time:415273906 zabbix ( 57384) count:6310 ave_time:26776 total_time:168962503 jitong ( 236926) count:6203 ave_time:22984 total_time:142571774 kamoser ( 302527) count:5971 ave_time:110991 total_time:662728612 markreed ( 84588) count:3910 ave_time:23155 total_time:90537503 slurmmon ( 979) count:3525 ave_time:42683 total_time:150458575 nhanph ( 282984) count:1450 ave_time:13104 total_time:19000953 erovenko ( 290553) count:1118 ave_time:1251 total_time:1399509 bpk ( 238366) count:1087 ave_time:33276 total_time:36171979 pllittle ( 139508) count:563 ave_time:41185 total_time:23187538 fenqiang ( 303630) count:411 ave_time:9388 total_time:3858493 vicbucha ( 244261) count:215 ave_time:18056 total_time:3882152 jackmag ( 238144) count:212 ave_time:47815 total_time:10136853 nil1 ( 297111) count:142 ave_time:49405 total_time:7015605 danliang ( 238535) count:116 ave_time:57715 total_time:6694959 jraab ( 209642) count:96 ave_time:64674 total_time:6208706 mizhou ( 239103) count:95 ave_time:28214 total_time:2680387 huihanc ( 288136) count:89 ave_time:122310 total_time:10885618 jmsimon ( 158261) count:70 ave_time:19912 total_time:1393905 mashner ( 295352) count:70 ave_time:70842 total_time:4958944 rc_swans.svc ( 238383) count:57 ave_time:180238 total_time:10273578 zmozent ( 256972) count:56 ave_time:75511 total_time:4228658 wuys ( 301497) count:54 ave_time:122758 total_time:6628982 jxenakis ( 189218) count:53 ave_time:41432 total_time:2195913 weitangc ( 275171) count:51 ave_time:56547 total_time:2883909 swijesi ( 302521) count:50 ave_time:188310 total_time:9415526 tongqiu ( 259043) count:47 ave_time:71161 total_time:3344576 dyang ( 145983) count:43 ave_time:156766 total_time:6740943 weifangl ( 296908) count:42 ave_time:137466 total_time:5773607 kalaineb ( 302600) count:42 ave_time:57631 total_time:2420510 dbcortes ( 259981) count:42 ave_time:40417 total_time:1697537 ble ( 190370) count:38 ave_time:82527 total_time:3136041 chanijo ( 236728) count:35 ave_time:89127 total_time:3119448 zubatyuk ( 305224) count:34 ave_time:19122 total_time:650151 jiawei ( 238237) count:33 ave_time:1440 total_time:47528 hfalan ( 256369) count:31 ave_time:73944 total_time:2292271 esdavis ( 217058) count:30 ave_time:35248 total_time:1057447 cseppan ( 83739) count:24 ave_time:131726 total_time:3161431 sahoon ( 285205) count:24 ave_time:61618 total_time:1478833 arterca ( 239277) count:23 ave_time:77949 total_time:1792836 kyoung13 ( 208524) count:23 ave_time:83041 total_time:1909965 noahs ( 108340) count:23 ave_time:56853 total_time:1307626 roachjm ( 84937) count:22 ave_time:33241 total_time:731312 migraff ( 173391) count:22 ave_time:3186 total_time:70092 milove ( 272499) count:21 ave_time:50760 total_time:1065969 jcwong ( 238779) count:21 ave_time:162326 total_time:3408848 jblocher ( 32545) count:20 ave_time:82787 total_time:1655754 bekahb ( 244677) count:20 ave_time:54204 total_time:1084097 zln3 ( 247310) count:19 ave_time:55693 total_time:1058177 hannaho ( 312325) count:19 ave_time:54108 total_time:1028065 ranekj ( 297151) count:18 ave_time:65697 total_time:1182555 pzivich ( 262033) count:17 ave_time:108814 total_time:1849839 meichen ( 236962) count:17 ave_time:137117 total_time:2331004 anju ( 302408) count:16 ave_time:11261 total_time:180182 btemple ( 45943) count:14 ave_time:170885 total_time:2392400 awella ( 247585) count:12 ave_time:77689 total_time:932268 psioda ( 199710) count:11 ave_time:8724 total_time:95964 schauer ( 50562) count:11 ave_time:40051 total_time:440565 hubenxia ( 315580) count:10 ave_time:61669 total_time:616696 wancen ( 294745) count:10 ave_time:70027 total_time:700272 mshvets ( 282607) count:9 ave_time:96341 total_time:867073 guoshili ( 238620) count:9 ave_time:11706 total_time:105361 benb3 ( 260674) count:8 ave_time:4525 total_time:36206 liuwei1 ( 282044) count:8 ave_time:277614 total_time:2220912 tptacek ( 287668) count:7 ave_time:3493 total_time:24455 bsmelton ( 238256) count:7 ave_time:2622 total_time:18358 evanbure ( 227460) count:7 ave_time:128816 total_time:901715 halina ( 309107) count:6 ave_time:206 total_time:1237 zengsi ( 297128) count:6 ave_time:1712 total_time:10275 ccornea ( 253025) count:6 ave_time:122958 total_time:737752 rc_htsfdata.svc ( 238384) count:6 ave_time:1657 total_time:9947 esh ( 238456) count:6 ave_time:133 total_time:802 nanam ( 310488) count:5 ave_time:27283 total_time:136415 znasipak ( 238001) count:5 ave_time:13287 total_time:66439 anqizhu ( 227868) count:4 ave_time:252069 total_time:1008279 pfsulliv ( 84298) count:4 ave_time:133408 total_time:533634 angelvv ( 252144) count:4 ave_time:153390 total_time:613561 deep ( 22492) count:3 ave_time:216 total_time:648 cmaillie ( 317505) count:3 ave_time:643 total_time:1929 anessa1 ( 240331) count:3 ave_time:325 total_time:977 trhenry ( 213418) count:3 ave_time:809 total_time:2428 hunycho ( 258380) count:3 ave_time:327 total_time:983 methan ( 257945) count:3 ave_time:61322 total_time:183967 mikepab ( 238053) count:3 ave_time:577 total_time:1732 tigran ( 258377) count:3 ave_time:331592 total_time:994776 beibo ( 238167) count:3 ave_time:336874 total_time:1010624 mwaldron ( 151063) count:3 ave_time:198100 total_time:594301 baldoni ( 236917) count:3 ave_time:247 total_time:743 bryce38 ( 278179) count:2 ave_time:219016 total_time:438033 nfb ( 237549) count:2 ave_time:194 total_time:389 stonerm ( 227901) count:1 ave_time:19003 total_time:19003 mwfoster ( 135399) count:1 ave_time:36424 total_time:36424 sgonell ( 285710) count:1 ave_time:15629 total_time:15629 kojiya ( 301812) count:1 ave_time:436112 total_time:436112 Pending RPC statistics No pending RPCs
I had already restarted each of slurmctld and slurmdbd in that order with about 10 min in between before sending the bug report.
Restarting the daemons doesn't help. You'll need to manually modify the database. We will have a way to automatically fix this issue in Slurm 20.02. Instructions to manually modify the database are in bug 2741 comment 11: First, please down the partitions (scontrol update partitionname=<name> state=down or set the states in slurm.conf and reconfigure) so that no more jobs will run and so you don't lose accounting records. Then, backup your database. I'll give you updated instructions for 18.08 in a bit
partitions down. backing up things now.
Created attachment 10705 [details] slurmdbd - add invalid reservation instead of giving up Jenny, Can you please apply this patch, rebuild, and restart slurmdbd? Rather than you having to manually modify the database, this patch will insert the invalid reservation for you, then the DBD Agent queue size should begin to go down. Can you let us know if the DBD Agent queue size is decreasing? Also, can you watch in the slurmdbd log for similar messages to "There is no reservation by id?" This patch should fix those. I just applied this patch to 18.08.6, so it should build for you. Let me know if there's a problem building. thanks, Marshall
Things appear much improved is the other issue i reported where timelimit was reported as the exit reason also related ?
Okay, good. If the DBD agent queue is going down, please bring your partitions back up. Can you let me know when your partitions are running again and the DBD agent queue is down to 0 (or close to 0) and stays there? I'll change the severity back to a sev-4 when that happens, but want to keep it at sev-2 until we know it's running. About the timelimit issue - I think the information is accurate. Looking closely, I see the job started on June 24th and was killed almost exactly one day later on June 25th. No worries if you missed that detail - I missed it the first time, too. slurmctld logs: [2019-06-24T10:56:02.111] sched: Allocate JobId=25801569 NodeList=t0601 #CPUs=1 Partition=bigmem [2019-06-25T10:56:12.835] Time limit exhausted for JobId=25801569 the slurmd/slurmstepd logs you uploaded: [2019-06-24T10:56:02.206] _run_prolog: prolog with lock for job 25801569 ran for 0 seconds [2019-06-25T10:56:12.967] [25801569.0] error: *** STEP 25801569.0 ON t0601 CANCELLED AT 2019-06-25T10:56:12 DUE TO TIME LIMIT *** I assume that was just an oversight (not seeing the change in date), and it was correctly killed for exceeding its time limit of one day. Do you agree?
DBD Agent queue size: 0 Queues back up and jobs dispatching
You are correct! i didn't see the date stamp. Thanks!
Awesome. In that case, I'm changing this bug back to a sev-4. The patch I provided you was taken from a SchedMD internal bug. It is currently pending review for QA, so it may not be the final fix. It probably won't go into 18.08; I'm not sure about 19.05. Since it is working, please keep this patch in place for now. It's somewhat of a band-aid patch in that it doesn't fix the underlying cause, but does repair the database by replacing the missing reservation record. We're also investigating what might cause this situation in the first place. Can you let me know if you see logs in slurmdbd like this? "Creating reservation by id #, time_start #, and cluster '<name>' instead of modifying." (This log is in the patch I gave you.) info("%s: Creating reservation by id %u, time_start %ld, and cluster '%s' instead of modifying.", Also, thanks for the prompt uploads of logs, sdiag, etc. I'm glad we got this fixed before you lost database records.
[2019-06-25T14:51:40.281] as_mysql_modify_resv: Creating reservation by id 150, time_start 1561474890, and cluster 'longleaf' instead of modifying.
How many times did you see that log message? Is it still happening?
ONe time
Okay, I hoped to see it one time - the time it got fixed. If it happens again with the same or another reservation id, please let me know.
I'm investigating what might have caused this bug. I noticed: > slurmctld restart: > > [2019-06-25T11:12:19.760] slurmctld version 18.08.6-2 started on cluster > longleaf > > > slurmdbd errors right after: > > [2019-06-25T11:13:39.069] error: There is no reservation by id 150, > time_start 1561474890, and cluster 'longleaf' In comment 9 you said: > I had already restarted each of slurmctld and slurmdbd in that order with > about 10 min in between before sending the bug report. Do you recall if this was happening before you restarted slurmctld, or only after? I'm wondering if a restart of slurmctld may be part of the problem. The slurmctld has these logs: [2019-06-25T11:10:49.331] Terminate signal (SIGINT or SIGTERM) received [2019-06-25T11:12:19.760] error: chdir(/pine/EX/root/slurm-log): Permission denied [2019-06-25T11:12:19.760] slurmctld version 18.08.6-2 started on cluster longleaf I suspect an unclean shutdown of slurmctld may be a problem, and we have a couple of internal bugs about that. What is StateSaveLocation in slurm.conf? In our records for UNC, it is /var/spool/slurmd. Is that still correct? What kind of filesystem (network, local?) and storage (SSD, HDD?) is that? I also noticed that the DBD doesn't have any messages between June 21 and June 25: [2019-06-21T08:46:27.095] error: We have more time than is possible (46483200+748800+0)(47232000) > 46483200 for cluster longleaf(12912) from 2019-06-10T08:00:00 - 2019-06-10T09:00:00 tres 5 [2019-06-25T11:13:39.069] error: There is no reservation by id 150, time_start 1561474890, and cluster 'longleaf' Were there perhaps missing message that didn't make it into the slurmdbd log that you uploaded? Or is that it?
I'd just like to follow up on my last comment - when you have a chance, can you respond to those questions?
Q:Do you recall if this was happening before you restarted slurmctld, or only after? I'm wondering if a restart of slurmctld may be part of the problem. A: The restarts were an attempt to resolve the issue, thinking the processes might have been in an awkward state - restarts were after the problem was spotted. Q:I suspect an unclean shutdown of slurmctld may be a problem, and we have a couple of internal bugs about that. What is StateSaveLocation in slurm.conf? In our records for UNC, it is /var/spool/slurmd. Is that still correct? What kind of filesystem (network, local?) and storage (SSD, HDD?) is that? A:StateSaveLocation = /pine/EX/root/slurm-log/slurmctld This was done because local var was too small to allow for debugging to ever be used given our volume of job throughput at times. The error is expected - we have the settings for slurm controlled by salt -- there is a base salt formula for slurm, then exceptions are in a salt pillar. That message only occurs when slurm.cfg is edited via salt and the changes pushed out, which then triggers a slurmctld restart. This has been the case since that StateSaveLocation was changed which was many months ago. Q: I also noticed that the DBD doesn't have any messages between June 21 and June 25: A: Well, that's disappointing.
Thanks for the additional information. Since we already have other internal tickets tracking this bug, I'm going to close this bug as resolved/infogiven. The patch I gave you is waiting to be reviewed and hasn't gone upstream yet - please see bug 7155 for updates on that. I don't know what version of Slurm it will be in, but am confident that it won't be in 18.08, so please continue running with this local patch as long as you're running 18.08.
I am out of the office. If you have questions that require a reply before then please email research@unc.edu. Regards, Virginia Wililams Systems Administrator Research Computing UNC Chapel Hill