Good Afternoon Slurm Support We are getting these errors in our slurmctld.log, see the followings: -- [2019-11-01T11:05:11.960] Registering slurmctld at port 6817 with slurmdbd. [2019-11-01T11:06:49.397] error: slurmdbd: agent queue filling (55619), RESTART SLURMDBD NOW [2019-11-01T11:08:54.015] error: slurmdbd: agent queue filling (55659), RESTART SLURMDBD NOW [2019-11-01T11:10:55.293] error: slurmdbd: agent queue filling (55697), RESTART SLURMDBD NOW [2019-11-01T11:13:03.528] error: slurmdbd: agent queue filling (55721), RESTART SLURMDBD NOW [2019-11-01T11:15:13.307] error: slurmdbd: agent queue filling (55771), RESTART SLURMDBD NOW [2019-11-01T11:17:40.362] error: slurmdbd: agent queue filling (55818), RESTART SLURMDBD NOW [2019-11-01T11:19:42.175] error: slurmdbd: agent queue filling (55852), RESTART SLURMDBD NOW [2019-11-01T11:21:53.956] error: slurmdbd: agent queue filling (55908), RESTART SLURMDBD NOW [2019-11-01T11:23:55.364] error: slurmdbd: agent queue filling (55936), RESTART SLURMDBD NOW [2019-11-01T11:26:01.045] error: slurmdbd: agent queue filling (55992), RESTART SLURMDBD NOW [2019-11-01T11:28:09.010] error: slurmdbd: agent queue filling (56016), RESTART SLURMDBD NOW [2019-11-01T11:30:11.090] error: slurmdbd: agent queue filling (56052), RESTART SLURMDBD NOW [2019-11-01T11:31:52.000] slurmdbd: agent queue size 56100 [2019-11-01T11:32:35.690] error: slurmdbd: agent queue filling (56108), RESTART SLURMDBD NOW [2019-11-01T11:34:43.201] error: slurmdbd: agent queue filling (56128), RESTART SLURMDBD NOW [2019-11-01T11:36:50.111] error: slurmdbd: agent queue filling (56193), RESTART SLURMDBD NOW [2019-11-01T11:45:15.943] Registering slurmctld at port 6817 with slurmdbd. It is making us very nervous. These circumstances might contributed to these errors: - On Monday (28th Oct 2019), we suffered from https://bugs.schedmd.com/show_bug.cgi?id=8006 - Two days ago, we rebuild the VM that is running its mariaDB, so the database is down for at least 12 hours I will be sending more logs into this ticket. Kindly advise Many Thanks Damien
Created attachment 12192 [details] slurmctld log slurmctld log with errors slurmdbd
Created attachment 12193 [details] slurmdb logs
Created attachment 12194 [details] patch proposal Like in bug 8006 you hit the known bug. This patch drops multiple irrelevant messages that spam slurmdbd. Please recompile slurmdbd and restart it.
Hi If you didn't run slurmdbd with the above patch you can also cherrypick patch from 19.05: https://github.com/SchedMD/slurm/commit/89910fd50a3120bd1d8b29fcf6d543e53dbfe732 Could you send me outputs from: "scontrol show res" "sacctmgr show reservation" Dominik
$ scontrol show res ReservationName=mxbeampostproc StartTime=2017-12-18T12:20:35 EndTime=2025-01-08T11:38:57 Duration=2577-23:18:22 Nodes=m3d[006-007] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=48 Users=(null) Accounts=-training Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=cryosparc StartTime=2018-11-09T11:33:37 EndTime=2020-11-08T11:33:37 Duration=730-00:00:00 Nodes=m3h015 NodeCnt=1 CoreCnt=28 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=28 Users=jafarl,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=highmem StartTime=2018-11-30T11:34:35 EndTime=2019-11-30T11:34:35 Duration=365-00:00:00 Nodes=m3m000 NodeCnt=1 CoreCnt=36 Features=(null) PartitionName=m3m Flags=SPEC_NODES TRES=cpu=36 Users=trungn,ctan,damienl,chines,lancew,jafarl,nwong,kerriw,philipc,smichnow,jchang,mmcg0002,angu0022,ska565,ozgeo,rbeare,zseeger,keving,swil0005,lawyl1,nlkar1,stya0001 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=sexton StartTime=2019-02-07T11:36:44 EndTime=2019-12-31T12:00:00 Duration=327-00:23:16 Nodes=m3g[000,006] NodeCnt=2 CoreCnt=72 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=72 Users=galisa,lynnliang,mbelouso,xzha0043,ctan,spiper Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=M3-backup StartTime=2019-03-15T08:43:14 EndTime=2020-03-14T08:43:14 Duration=365-00:00:00 Nodes=m3d[002,008] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=48 Users=ctan,trungn Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=CryoemFacility StartTime=2019-05-28T10:12:23 EndTime=2020-05-27T10:12:23 Duration=365-00:00:00 Nodes=m3c004,m3g014 NodeCnt=2 CoreCnt=60 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=60 Users=lancew,hven0001,jafarl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=AWX StartTime=2019-06-07T14:48:59 EndTime=2019-12-31T12:00:00 Duration=206-20:11:01 Nodes=m3a[009-010] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=48 Users=trungn,ctan,damienl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=simple StartTime=2019-06-30T10:41:27 EndTime=2020-06-29T10:41:27 Duration=365-00:00:00 Nodes=m3a[016-020] NodeCnt=5 CoreCnt=120 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=120 Users=hael,creboul,zqi1,sarahle,lancew,marionb,ctan,skie0002,mcaggian Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=uow StartTime=2019-09-13T11:06:48 EndTime=2025-01-08T10:00:00 Duration=1943-21:53:12 Nodes=m3c011,m3h011 NodeCnt=2 CoreCnt=52 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=52 Users=lancew,simonb,jafarl,jbouwer Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=nfs StartTime=2019-09-23T18:17:40 EndTime=2020-09-22T18:17:40 Duration=365-00:00:00 Nodes=m3f[027-028] NodeCnt=2 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=6 Users=chines Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=p100testing StartTime=2019-10-28T09:00:00 EndTime=2019-11-11T09:00:00 Duration=14-00:00:00 Nodes=m3h[001,003] NodeCnt=2 CoreCnt=56 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=56 Users=philipc,mbelouso Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
Hi Have you managed to recompile slurmdbd with one of those patches? Did this help? Dominik
Hi Dominik Does v18.08.08 have the fixes for these too ? Cheers Damien On Friday, 1 November 2019, <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=8033#c6> on bug > 8033 <https://bugs.schedmd.com/show_bug.cgi?id=8033> from Dominik > Bartkiewicz <bart@schedmd.com> * > > Hi > > Have you managed to recompile slurmdbd with one of those patches? > Did this help? > > Dominik > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi 89910fd50a is only available in 19.05. 18.08 contains a few fixes which prevent such a situation (incoherent reservation data in slurmctld and slurmdbd). I recommend applying 89910fd50a before you start losing accounting data. Please send me the output from "sacctmgr show reservation" Dominik
$ sacctmgr show reservation Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- m3 AWX cpu=48 2019-10-28T20:16:55 2019-12-31T12:00:00 3.4818e+05 m3 CryoemFacility cpu=60 2019-10-28T20:16:55 2020-05-27T10:12:23 4.4692e+04 m3 cryosparc cpu=28 2019-10-28T20:16:55 2020-11-08T11:33:37 3.4818e+05 m3 highmem cpu=36 2019-10-28T20:16:55 2019-11-30T11:34:35 1.5440e+05 m3 M3-backup cpu=48 2019-10-28T20:16:55 2020-03-14T08:43:14 3.4818e+05 m3 mxbeampostproc cpu=48 2019-10-28T20:16:55 2025-01-08T11:38:57 3.4818e+05 m3 nfs cpu=6 2019-10-28T20:16:55 2020-09-22T18:17:40 3.4818e+05 m3 p100testing cpu=56 2019-10-28T20:16:55 2019-11-11T09:00:00 2.6519e+05 m3 sexton cpu=72 2019-10-28T20:16:55 2019-12-31T12:00:00 8.7279e+04 m3 simple cpu=120 2019-10-28T20:16:55 2020-06-29T10:41:27 0.000000 m3 uow cpu=52 2019-10-28T20:16:55 2025-01-08T10:00:00 3.4818e+05
sacctmgr show reservation -p Cluster|Name|TRES|TimeStart|TimeEnd|UnusedWall| m3|AWX|cpu=48|2019-10-28T20:16:55|2019-12-31T12:00:00|348185.000000| m3|CryoemFacility|cpu=60|2019-10-28T20:16:55|2020-05-27T10:12:23|44692.000000| m3|cryosparc|cpu=28|2019-10-28T20:16:55|2020-11-08T11:33:37|348185.000000| m3|highmem|cpu=36|2019-10-28T20:16:55|2019-11-30T11:34:35|154400.000000| m3|M3-backup|cpu=48|2019-10-28T20:16:55|2020-03-14T08:43:14|348185.000000| m3|mxbeampostproc|cpu=48|2019-10-28T20:16:55|2025-01-08T11:38:57|348185.000000| m3|nfs|cpu=6|2019-10-28T20:16:55|2020-09-22T18:17:40|348185.000000| m3|p100testing|cpu=56|2019-10-28T20:16:55|2019-11-11T09:00:00|265186.571429| m3|sexton|cpu=72|2019-10-28T20:16:55|2019-12-31T12:00:00|87279.000000| m3|simple|cpu=120|2019-10-28T20:16:55|2020-06-29T10:41:27|0.000000| m3|uow|cpu=52|2019-10-28T20:16:55|2025-01-08T10:00:00|348185.000000|
Hi Dominik Looks like the direction is to upgrade to the latest stable v19.05.x, Just have to sort out any depreciated features. Can I conclude the root cause of this whole episode is that, some jobs being submitted into our production cluster with GPU requirements but without GRES values, which cause this bug to surface to both the slurmctld and slurmdbd ? Cheers Damien
Hi I assume that this was the root cause of segfault from bug 8006. Multiple restarts/crash of slurmctld and slurmdbd can cause situation like in this bug (spamming slurmdbd incorrect reservation update requests). Update to 19.05 will solve these issues, but it must be done quick otherways slurmctld start dropping accounting information. In my opinion, te easers solution, for now, is to use https://github.com/SchedMD/slurm/commit/89910fd50a Dominik
Hi Any news? Dominik
Hi Dominik Thanks for the follow up. I am preparing to rollup v19.05.03, Once the CR is tested and approved. I will do a rolling upgrade. Cheers Damien
Hi Dominik We are planning to upgrade to v19.05.03, One of the first task (as with the previous updates) is to update slurmdbd first and convert database (18.08.06 -> 19.05.03), As we are having this problem right now. 1) Do we need to fix this first slurmdbd/mySQL before this upgrade ? or 2) Just go ahead and upgrade to v19.05.03, it will fix this issue and existing database Kindly advise Thanks Damien
(In reply to Damien from comment #15) > Hi Dominik > > We are planning to upgrade to v19.05.03, > > One of the first task (as with the previous updates) is to update slurmdbd > first and convert database (18.08.06 -> 19.05.03), As we are having this > problem right now. > > 1) Do we need to fix this first slurmdbd/mySQL before this upgrade ? > > or > > 2) Just go ahead and upgrade to v19.05.03, it will fix this issue and > existing database > > > > Kindly advise > > > Thanks > > Damien Hi 1) no 2) yes You can just update slurmdbd without any extra /not standard steps. Notes: Bug 8029 all 19.05 releases have a known problem, which is not yet fixed. old sacct (18.08) has some problem with displaying new job records, but all dates are correctly storing in the database. Dominik
Hi Dominik Thanks for your info. So I can just upgrade directly to v19.05.03, and it will solve our current issues. To clarify, 1) Upgrade slurmDBD and convert/repair database An hour 2) Upgrade primary slurmctld and secondary slurmctld 30 minutes 3) Upgrade all compute nodes and restart slurmds An hour With this proposed timeline, I should not faced this problem https://bugs.schedmd.com/show_bug.cgi?id=8029 Right ? Cheers Damien
Hi Yes, I don't see any problem Dominik
Hi Could you tell me if the upgrade went well? Dominik
Hi Dominik Thanks for this follow up. Finally, I have the CR to upgrade to v19.05.04, I am doing it now. Cheers Damien
Hi Dominik I am seeing this error after the upgrade to v19.05.04: # sacctmgr show runaway > runaway.txt sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable See attached (runaway.txt) [ec2-user@m3-mgmt1 sbin]$ ./slurmctld -V slurm 19.05.4 [ec2-user@m3-mgmt1 sbin]$ sudo ./slurmdbd -V slurm 19.05.4 Kindly advise Thanks Damien
Created attachment 12361 [details] runaway jobs Kindly review
Hi The first error "Resource temporarily unavailable" occurred due to updating the database schema. During this process slurmdbd responses this way on requests. I think all these runaway jobs came from lost accounting data, and unfortunately, some of those can be lost permanently. Could you send me the output from sdiag? Dominik
[damienl@m3-login1 ~]$ [damienl@m3-login1 ~]$ sdiag ******************************************************* sdiag output at Thu Nov 21 19:51:43 2019 (1574326303) Data since Thu Nov 21 11:34:22 2019 (1574296462) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 1576 Jobs started: 1244 Jobs completed: 1298 Jobs canceled: 654 Jobs failed: 107 Job states ts: Thu Nov 21 19:51:17 2019 (1574326277) Jobs pending: 1790 Jobs running: 366 Main schedule statistics (microseconds): Last cycle: 6617 Max cycle: 689757 Total cycles: 1963 Mean cycle: 8939 Mean depth cycle: 301 Cycles per minute: 3 Last queue length: 1080 Backfilling stats Total backfilled jobs (since last slurm start): 850 Total backfilled jobs (since last stats cycle start): 850 Total backfilled heterogeneous job components: 0 Total cycles: 186 Last cycle when: Thu Nov 21 19:49:59 2019 (1574326199) Last cycle: 50962799 Max cycle: 139778412 Mean cycle: 88238873 Last depth cycle: 1122 Last depth cycle (try sched): 334 Depth Mean: 974 Depth Mean (try depth): 269 Last queue length: 1084 Queue length mean: 1013 Latency for gettimeofday() (x1000): 27 microseconds Remote Procedure Call statistics by message type REQUEST_FED_INFO ( 2049) count:261407 ave_time:111 total_time:29191258 REQUEST_PARTITION_INFO ( 2009) count:222390 ave_time:212 total_time:47164334 REQUEST_JOB_INFO_SINGLE ( 2021) count:197733 ave_time:469336 total_time:92803388501 REQUEST_JOB_USER_INFO ( 2039) count:63669 ave_time:177437 total_time:11297276984 REQUEST_NODE_INFO_SINGLE ( 2040) count:17828 ave_time:371567 total_time:6624311001 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2783 ave_time:45263 total_time:125969702 REQUEST_UPDATE_NODE ( 3002) count:2387 ave_time:13073 total_time:31206991 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:1694 ave_time:168319 total_time:285133339 MESSAGE_EPILOG_COMPLETE ( 6012) count:1577 ave_time:298427 total_time:470620147 REQUEST_STEP_COMPLETE ( 5016) count:1564 ave_time:64703 total_time:101195877 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:1547 ave_time:292122 total_time:451913175 REQUEST_COMPLETE_PROLOG ( 6018) count:1276 ave_time:424674 total_time:541884666 REQUEST_NODE_INFO ( 2007) count:961 ave_time:17452 total_time:16771928 REQUEST_KILL_JOB ( 5032) count:670 ave_time:43586 total_time:29202883 REQUEST_CONTROL_STATUS ( 2053) count:294 ave_time:109 total_time:32107 REQUEST_SHARE_INFO ( 2022) count:142 ave_time:4335 total_time:615659 REQUEST_JOB_INFO ( 2003) count:141 ave_time:274539 total_time:38710057 REQUEST_RESERVATION_INFO ( 2024) count:75 ave_time:34618 total_time:2596380 REQUEST_JOB_STEP_CREATE ( 5001) count:14 ave_time:6621 total_time:92706 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:13 ave_time:294840 total_time:3832931 REQUEST_CANCEL_JOB_STEP ( 5005) count:9 ave_time:79528 total_time:715755 REQUEST_JOB_READY ( 4019) count:7 ave_time:333120 total_time:2331845 REQUEST_PING ( 1008) count:6 ave_time:121 total_time:727 REQUEST_UPDATE_JOB ( 3001) count:3 ave_time:249064 total_time:747192 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:2 ave_time:187 total_time:374 REQUEST_TRIGGER_PULL ( 2030) count:1 ave_time:125 total_time:125 REQUEST_RESOURCE_ALLOCATION ( 4001) count:1 ave_time:4500 total_time:4500 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user spiper ( 12815) count:157833 ave_time:45376 total_time:7161941665 swil0005 ( 10264) count:129331 ave_time:231910 total_time:29993234988 root ( 0) count:46844 ave_time:178468 total_time:8360168179 a1058400 ( 11860) count:20405 ave_time:160558 total_time:3276190690 vpam0001 ( 11208) count:16821 ave_time:158968 total_time:2674014874 rukshana ( 10942) count:12049 ave_time:152603 total_time:1838720823 hfettke ( 11128) count:11799 ave_time:149586 total_time:1764969090 bgully ( 10652) count:10817 ave_time:151090 total_time:1634344561 tday0002 ( 12019) count:10533 ave_time:150989 total_time:1590377477 awri0001 ( 11597) count:10363 ave_time:148071 total_time:1534460168 nsingh ( 12834) count:9908 ave_time:153935 total_time:1525191685 asegal ( 11199) count:9452 ave_time:154914 total_time:1464249055 aafo0001 ( 11936) count:9107 ave_time:153237 total_time:1395533288 robe0002 ( 12508) count:9007 ave_time:154414 total_time:1390812825 mahmoudm ( 11193) count:8882 ave_time:155416 total_time:1380408567 hbhukya1 ( 10583) count:8579 ave_time:154224 total_time:1323089062 ksabaroe ( 10365) count:8556 ave_time:156796 total_time:1341550759 msam0006 ( 11491) count:8554 ave_time:155057 total_time:1326363120 dkor0003 ( 10806) count:8551 ave_time:156050 total_time:1334391860 ttha0011 ( 11105) count:8549 ave_time:156814 total_time:1340606888 vieth ( 11159) count:8548 ave_time:155134 total_time:1326093038 ska565 ( 10255) count:8547 ave_time:157061 total_time:1342406227 elia0001 ( 10680) count:8545 ave_time:155794 total_time:1331265156 emct0001 ( 11214) count:8544 ave_time:154893 total_time:1323410598 chaos ( 10272) count:8541 ave_time:155929 total_time:1331790547 hven0001 ( 10350) count:8541 ave_time:155931 total_time:1331814677 aazhar ( 11233) count:8541 ave_time:156106 total_time:1333303787 scle0003 ( 11552) count:8538 ave_time:157057 total_time:1340953256 aper0034 ( 11507) count:8538 ave_time:156919 total_time:1339781446 yche0241 ( 11880) count:8538 ave_time:157089 total_time:1341228250 csen0005 ( 10998) count:8538 ave_time:156955 total_time:1340084586 joshuamhardy ( 10156) count:8538 ave_time:156934 total_time:1339908677 strikm ( 12849) count:8515 ave_time:143787 total_time:1224347318 gmaj0001 ( 10708) count:8297 ave_time:156558 total_time:1298967447 sarahsyh ( 11365) count:8088 ave_time:158678 total_time:1283390121 xiaoliuz ( 10462) count:7794 ave_time:156480 total_time:1219608911 huiw ( 11363) count:7590 ave_time:160663 total_time:1219435862 stuarto ( 10273) count:7050 ave_time:156394 total_time:1102580241 wparker ( 11551) count:6003 ave_time:168641 total_time:1012357688 tallison ( 10896) count:5663 ave_time:156993 total_time:889051643 ychen ( 11648) count:5328 ave_time:160490 total_time:855093028 mfarrell ( 10679) count:5170 ave_time:171170 total_time:884949319 scottk ( 11656) count:5151 ave_time:148395 total_time:764384252 bsun ( 11451) count:4896 ave_time:148291 total_time:726033780 jawon5 ( 11240) count:4781 ave_time:142611 total_time:681826696 jcoxon ( 10835) count:4687 ave_time:166827 total_time:781921815 ademarco ( 10256) count:4542 ave_time:170087 total_time:772538509 lancew ( 10213) count:4532 ave_time:150366 total_time:681462011 rshi0007 ( 10423) count:4261 ave_time:145292 total_time:619091689 clam0015 ( 11607) count:4234 ave_time:172173 total_time:728983097 mbernhar ( 10398) count:4232 ave_time:143730 total_time:608267818 bktan12 ( 11006) count:4170 ave_time:164563 total_time:686230129 anirka ( 12061) count:3588 ave_time:129919 total_time:466151661 hwan0088 ( 11628) count:2787 ave_time:155925 total_time:434565615 wpit0001 ( 11017) count:2782 ave_time:119519 total_time:332503014 jinmingz ( 12653) count:2760 ave_time:126613 total_time:349454482 uqpmart6 ( 11694) count:2634 ave_time:122579 total_time:322875008 tnewing ( 10925) count:2533 ave_time:125210 total_time:317157403 melvinl ( 12907) count:2502 ave_time:129006 total_time:322773722 esomayeh ( 10379) count:2193 ave_time:100939 total_time:221359348 tsiejka ( 12091) count:2142 ave_time:118669 total_time:254190436 phoebeimms ( 10690) count:1919 ave_time:159973 total_time:306989524 arevans ( 11437) count:1878 ave_time:115411 total_time:216742760 mom008 ( 11465) count:1779 ave_time:40346 total_time:71777277 ctav0001 ( 12340) count:1779 ave_time:128953 total_time:229407799 sgujjari ( 12661) count:1671 ave_time:108665 total_time:181579846 khsmsampath ( 11028) count:1520 ave_time:174775 total_time:265659500 mju2 ( 11470) count:1232 ave_time:72626 total_time:89475773 titust ( 12865) count:1016 ave_time:116938 total_time:118809711 suzanm ( 11205) count:853 ave_time:137001 total_time:116862685 mkam0016 ( 12925) count:756 ave_time:77172 total_time:58342773 yaghmaien ( 12664) count:719 ave_time:152840 total_time:109892012 jpollock ( 12914) count:701 ave_time:31738 total_time:22248743 ctiw0001 ( 12015) count:513 ave_time:141064 total_time:72365989 wtsch1 ( 11038) count:415 ave_time:31441 total_time:13048043 land ( 10877) count:368 ave_time:41022 total_time:15096298 mbos0003 ( 11595) count:343 ave_time:57252 total_time:19637660 slurm ( 497) count:295 ave_time:109 total_time:32232 aklaus ( 10976) count:285 ave_time:117614 total_time:33520249 mziemann ( 10442) count:240 ave_time:16777 total_time:4026686 flii0002 ( 11596) count:230 ave_time:58724 total_time:13506725 sgwe0001 ( 10173) count:224 ave_time:63965 total_time:14328207 xuanlih ( 10644) count:182 ave_time:84236 total_time:15331060 lindacac ( 11464) count:170 ave_time:112589 total_time:19140257 taoh ( 11834) count:122 ave_time:40877 total_time:4987089 scohen1 ( 11404) count:121 ave_time:140231 total_time:16967970 jrigby ( 10065) count:114 ave_time:93654 total_time:10676590 shaunc ( 10436) count:113 ave_time:122155 total_time:13803593 lwan0025 ( 10803) count:102 ave_time:62875 total_time:6413274 hhew0002 ( 11689) count:76 ave_time:145397 total_time:11050182 jueqingl ( 12339) count:70 ave_time:16080 total_time:1125601 gholamrh ( 10097) count:66 ave_time:62349 total_time:4115065 cristobg ( 12021) count:63 ave_time:49781 total_time:3136230 amaligg ( 10873) count:55 ave_time:107446 total_time:5909576 wcao0006 ( 10768) count:54 ave_time:132485 total_time:7154223 xvuthith ( 10699) count:49 ave_time:54393 total_time:2665286 aher0013 ( 11326) count:45 ave_time:14942 total_time:672420 rkai0001 ( 10719) count:45 ave_time:25170 total_time:1132662 pt5684 ( 11786) count:45 ave_time:14605 total_time:657243 kwyr0001 ( 11370) count:44 ave_time:310054 total_time:13642378 damienl ( 10005) count:44 ave_time:93348 total_time:4107332 ec2-user ( 1000) count:41 ave_time:179807 total_time:7372101 trungn ( 10419) count:40 ave_time:114830 total_time:4593212 eyk ( 10865) count:30 ave_time:35397 total_time:1061918 benfulcher ( 10424) count:30 ave_time:215895 total_time:6476872 pzaremoo ( 10041) count:29 ave_time:83384 total_time:2418159 keving ( 10454) count:28 ave_time:48410 total_time:1355481 tmason1 ( 11592) count:21 ave_time:104538 total_time:2195314 andrewc ( 10452) count:20 ave_time:113029 total_time:2260589 dmpar7 ( 11685) count:18 ave_time:53242 total_time:958358 billyjow ( 10312) count:18 ave_time:166579 total_time:2998431 jafarl ( 10347) count:18 ave_time:94010 total_time:1692197 twang ( 11986) count:17 ave_time:157256 total_time:2673357 rgomi ( 11448) count:17 ave_time:156806 total_time:2665714 zhuy ( 11494) count:15 ave_time:114 total_time:1723 vchr0002 ( 11487) count:15 ave_time:118 total_time:1782 gahun3 ( 12820) count:15 ave_time:124 total_time:1873 tandrill ( 11162) count:15 ave_time:140 total_time:2113 debopris ( 11349) count:15 ave_time:56240 total_time:843607 narjesa ( 10666) count:15 ave_time:124 total_time:1871 dnguyen ( 11186) count:15 ave_time:10844 total_time:162674 jste0021 ( 10774) count:12 ave_time:70416 total_time:844993 andrewpe ( 10152) count:11 ave_time:212765 total_time:2340424 philipc ( 10003) count:11 ave_time:184018 total_time:2024204 kwong ( 12517) count:8 ave_time:1158 total_time:9264 hhartono ( 11775) count:6 ave_time:239071 total_time:1434428 dbw ( 10800) count:6 ave_time:122027 total_time:732163 kcorbett ( 12751) count:6 ave_time:141 total_time:850 mbelouso ( 10091) count:6 ave_time:588688 total_time:3532130 fruzicka ( 12906) count:4 ave_time:408643 total_time:1634572 leei ( 10874) count:4 ave_time:3808 total_time:15235 szhang ( 12694) count:4 ave_time:567 total_time:2268 xjia0032 ( 11502) count:4 ave_time:111882 total_time:447528 ctan ( 10189) count:4 ave_time:550 total_time:2200 zwan394 ( 11209) count:3 ave_time:126 total_time:378 rbeare ( 10299) count:3 ave_time:75592 total_time:226777 charlesbj ( 10297) count:3 ave_time:214 total_time:643 miguelg ( 10815) count:3 ave_time:232890 total_time:698671 khodgins ( 11538) count:2 ave_time:486800 total_time:973600 fairuzhh ( 10228) count:2 ave_time:827 total_time:1655 ywan0280 ( 11526) count:2 ave_time:680 total_time:1360 Pending RPC statistics No pending RPCs Is this good or bad ?
[ec2-user@m3-mgmt1 ~]$ [ec2-user@m3-mgmt1 ~]$ sdiag ******************************************************* sdiag output at Thu Nov 21 19:55:12 2019 (1574326512) Data since Thu Nov 21 11:34:22 2019 (1574296462) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 1579 Jobs started: 1249 Jobs completed: 1301 Jobs canceled: 654 Jobs failed: 107 Job states ts: Thu Nov 21 19:54:49 2019 (1574326489) Jobs pending: 1787 Jobs running: 367 Main schedule statistics (microseconds): Last cycle: 6105 Max cycle: 689757 Total cycles: 1972 Mean cycle: 8989 Mean depth cycle: 301 Cycles per minute: 3 Last queue length: 1079 Backfilling stats Total backfilled jobs (since last slurm start): 853 Total backfilled jobs (since last stats cycle start): 853 Total backfilled heterogeneous job components: 0 Total cycles: 188 Last cycle when: Thu Nov 21 19:53:33 2019 (1574326413) Last cycle: 48423258 Max cycle: 139778412 Mean cycle: 87833870 Last depth cycle: 1118 Last depth cycle (try sched): 330 Depth Mean: 975 Depth Mean (try depth): 270 Last queue length: 1080 Queue length mean: 1014 Latency for 1000 calls to gettimeofday(): 27 microseconds Remote Procedure Call statistics by message type REQUEST_FED_INFO ( 2049) count:263391 ave_time:111 total_time:29414907 REQUEST_PARTITION_INFO ( 2009) count:223889 ave_time:211 total_time:47380872 REQUEST_JOB_INFO_SINGLE ( 2021) count:199240 ave_time:468220 total_time:93288178721 REQUEST_JOB_USER_INFO ( 2039) count:64146 ave_time:176847 total_time:11344088999 REQUEST_NODE_INFO_SINGLE ( 2040) count:18008 ave_time:367854 total_time:6624329513 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2872 ave_time:44080 total_time:126597873 REQUEST_UPDATE_NODE ( 3002) count:2387 ave_time:13073 total_time:31206991 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:1697 ave_time:168467 total_time:285889254 MESSAGE_EPILOG_COMPLETE ( 6012) count:1580 ave_time:297870 total_time:470634828 REQUEST_STEP_COMPLETE ( 5016) count:1567 ave_time:65200 total_time:102169432 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:1550 ave_time:291558 total_time:451915394 REQUEST_COMPLETE_PROLOG ( 6018) count:1281 ave_time:424041 total_time:543197008 REQUEST_NODE_INFO ( 2007) count:977 ave_time:17193 total_time:16797772 REQUEST_KILL_JOB ( 5032) count:670 ave_time:43586 total_time:29202883 REQUEST_CONTROL_STATUS ( 2053) count:296 ave_time:109 total_time:32352 REQUEST_JOB_INFO ( 2003) count:144 ave_time:275460 total_time:39666367 REQUEST_SHARE_INFO ( 2022) count:142 ave_time:4335 total_time:615659 REQUEST_RESERVATION_INFO ( 2024) count:75 ave_time:34618 total_time:2596380 REQUEST_JOB_STEP_CREATE ( 5001) count:14 ave_time:6621 total_time:92706 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:13 ave_time:294840 total_time:3832931 REQUEST_CANCEL_JOB_STEP ( 5005) count:9 ave_time:79528 total_time:715755 REQUEST_JOB_READY ( 4019) count:7 ave_time:333120 total_time:2331845 REQUEST_PING ( 1008) count:6 ave_time:121 total_time:727 REQUEST_UPDATE_JOB ( 3001) count:3 ave_time:249064 total_time:747192 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:2 ave_time:187 total_time:374 REQUEST_TRIGGER_PULL ( 2030) count:1 ave_time:125 total_time:125 REQUEST_RESOURCE_ALLOCATION ( 4001) count:1 ave_time:4500 total_time:4500 REQUEST_STATS_INFO ( 2035) count:1 ave_time:171 total_time:171 Remote Procedure Call statistics by user spiper ( 12815) count:159003 ave_time:45318 total_time:7205759520 swil0005 ( 10264) count:130667 ave_time:231151 total_time:30203855369 root ( 0) count:47307 ave_time:176784 total_time:8363141346 a1058400 ( 11860) count:20465 ave_time:160384 total_time:3282269462 vpam0001 ( 11208) count:16895 ave_time:158633 total_time:2680106009 rukshana ( 10942) count:12049 ave_time:152603 total_time:1838720823 hfettke ( 11128) count:11859 ave_time:149461 total_time:1772460975 bgully ( 10652) count:10891 ave_time:150891 total_time:1643361843 tday0002 ( 12019) count:10593 ave_time:150707 total_time:1596443726 awri0001 ( 11597) count:10419 ave_time:147792 total_time:1539848281 nsingh ( 12834) count:9908 ave_time:153935 total_time:1525191685 asegal ( 11199) count:9512 ave_time:154723 total_time:1471728180 aafo0001 ( 11936) count:9167 ave_time:153006 total_time:1402611180 robe0002 ( 12508) count:9067 ave_time:154064 total_time:1396899926 mahmoudm ( 11193) count:8942 ave_time:155053 total_time:1386484435 hbhukya1 ( 10583) count:8639 ave_time:153849 total_time:1329105381 ksabaroe ( 10365) count:8616 ave_time:156409 total_time:1347626254 msam0006 ( 11491) count:8614 ave_time:154683 total_time:1332442486 dkor0003 ( 10806) count:8611 ave_time:155670 total_time:1340481775 ttha0011 ( 11105) count:8609 ave_time:156426 total_time:1346672171 vieth ( 11159) count:8608 ave_time:154761 total_time:1332182944 ska565 ( 10255) count:8607 ave_time:156672 total_time:1348481981 elia0001 ( 10680) count:8605 ave_time:155414 total_time:1337343602 emct0001 ( 11214) count:8604 ave_time:154520 total_time:1329495418 chaos ( 10272) count:8601 ave_time:155548 total_time:1337870388 hven0001 ( 10350) count:8601 ave_time:155550 total_time:1337887414 aazhar ( 11233) count:8601 ave_time:155725 total_time:1339396221 scle0003 ( 11552) count:8598 ave_time:156667 total_time:1347030704 aper0034 ( 11507) count:8598 ave_time:156530 total_time:1345852305 yche0241 ( 11880) count:8598 ave_time:156699 total_time:1347303495 csen0005 ( 10998) count:8598 ave_time:156565 total_time:1346150112 joshuamhardy ( 10156) count:8598 ave_time:156545 total_time:1345974817 strikm ( 12849) count:8575 ave_time:143653 total_time:1231830826 gmaj0001 ( 10708) count:8360 ave_time:156046 total_time:1304550653 sarahsyh ( 11365) count:8088 ave_time:158678 total_time:1283390121 xiaoliuz ( 10462) count:7854 ave_time:156058 total_time:1225686000 huiw ( 11363) count:7632 ave_time:160908 total_time:1228051874 stuarto ( 10273) count:7110 ave_time:155931 total_time:1108670010 wparker ( 11551) count:6003 ave_time:168641 total_time:1012357688 tallison ( 10896) count:5663 ave_time:156993 total_time:889051643 ychen ( 11648) count:5388 ave_time:159979 total_time:861969654 scottk ( 11656) count:5211 ave_time:147853 total_time:770463124 mfarrell ( 10679) count:5170 ave_time:171170 total_time:884949319 bsun ( 11451) count:4956 ave_time:147722 total_time:732110599 jawon5 ( 11240) count:4781 ave_time:142611 total_time:681826696 jcoxon ( 10835) count:4687 ave_time:166827 total_time:781921815 ademarco ( 10256) count:4542 ave_time:170087 total_time:772538509 lancew ( 10213) count:4532 ave_time:150366 total_time:681462011 rshi0007 ( 10423) count:4261 ave_time:145292 total_time:619091689 clam0015 ( 11607) count:4234 ave_time:172173 total_time:728983097 mbernhar ( 10398) count:4232 ave_time:143730 total_time:608267818 bktan12 ( 11006) count:4170 ave_time:164563 total_time:686230129 anirka ( 12061) count:3662 ave_time:129339 total_time:473642748 wpit0001 ( 11017) count:2842 ave_time:119629 total_time:339986124 jinmingz ( 12653) count:2820 ave_time:126576 total_time:356945877 hwan0088 ( 11628) count:2787 ave_time:155925 total_time:434565615 uqpmart6 ( 11694) count:2694 ave_time:122625 total_time:330354260 tnewing ( 10925) count:2593 ave_time:125200 total_time:324644160 melvinl ( 12907) count:2502 ave_time:129006 total_time:322773722 esomayeh ( 10379) count:2207 ave_time:100827 total_time:222527283 tsiejka ( 12091) count:2202 ave_time:118838 total_time:261681573 arevans ( 11437) count:1938 ave_time:115704 total_time:224234400 phoebeimms ( 10690) count:1919 ave_time:159973 total_time:306989524 mom008 ( 11465) count:1779 ave_time:40346 total_time:71777277 ctav0001 ( 12340) count:1779 ave_time:128953 total_time:229407799 sgujjari ( 12661) count:1671 ave_time:108665 total_time:181579846 khsmsampath ( 11028) count:1520 ave_time:174775 total_time:265659500 mju2 ( 11470) count:1240 ave_time:72158 total_time:89477051 titust ( 12865) count:1016 ave_time:116938 total_time:118809711 suzanm ( 11205) count:853 ave_time:137001 total_time:116862685 mkam0016 ( 12925) count:756 ave_time:77172 total_time:58342773 yaghmaien ( 12664) count:719 ave_time:152840 total_time:109892012 jpollock ( 12914) count:701 ave_time:31738 total_time:22248743 ctiw0001 ( 12015) count:513 ave_time:141064 total_time:72365989 wtsch1 ( 11038) count:415 ave_time:31441 total_time:13048043 land ( 10877) count:368 ave_time:41022 total_time:15096298 mbos0003 ( 11595) count:343 ave_time:57252 total_time:19637660 slurm ( 497) count:297 ave_time:109 total_time:32477 aklaus ( 10976) count:285 ave_time:117614 total_time:33520249 mziemann ( 10442) count:240 ave_time:16777 total_time:4026686 flii0002 ( 11596) count:230 ave_time:58724 total_time:13506725 sgwe0001 ( 10173) count:224 ave_time:63965 total_time:14328207 ywan0280 ( 11526) count:214 ave_time:8211 total_time:1757341 xuanlih ( 10644) count:192 ave_time:84847 total_time:16290776 lindacac ( 11464) count:170 ave_time:112589 total_time:19140257 taoh ( 11834) count:122 ave_time:40877 total_time:4987089 scohen1 ( 11404) count:121 ave_time:140231 total_time:16967970 jrigby ( 10065) count:114 ave_time:93654 total_time:10676590 shaunc ( 10436) count:113 ave_time:122155 total_time:13803593 lwan0025 ( 10803) count:102 ave_time:62875 total_time:6413274 hhew0002 ( 11689) count:76 ave_time:145397 total_time:11050182 jueqingl ( 12339) count:70 ave_time:16080 total_time:1125601 gholamrh ( 10097) count:66 ave_time:62349 total_time:4115065 cristobg ( 12021) count:63 ave_time:49781 total_time:3136230 rkai0001 ( 10719) count:60 ave_time:18946 total_time:1136795 amaligg ( 10873) count:55 ave_time:107446 total_time:5909576 wcao0006 ( 10768) count:54 ave_time:132485 total_time:7154223 xvuthith ( 10699) count:49 ave_time:54393 total_time:2665286 damienl ( 10005) count:45 ave_time:91277 total_time:4107503 aher0013 ( 11326) count:45 ave_time:14942 total_time:672420 pt5684 ( 11786) count:45 ave_time:14605 total_time:657243 kwyr0001 ( 11370) count:44 ave_time:310054 total_time:13642378 ec2-user ( 1000) count:41 ave_time:179807 total_time:7372101 trungn ( 10419) count:40 ave_time:114830 total_time:4593212 benfulcher ( 10424) count:30 ave_time:215895 total_time:6476872 eyk ( 10865) count:30 ave_time:35397 total_time:1061918 pzaremoo ( 10041) count:29 ave_time:83384 total_time:2418159 keving ( 10454) count:28 ave_time:48410 total_time:1355481 tmason1 ( 11592) count:21 ave_time:104538 total_time:2195314 andrewc ( 10452) count:20 ave_time:113029 total_time:2260589 billyjow ( 10312) count:18 ave_time:166579 total_time:2998431 jafarl ( 10347) count:18 ave_time:94010 total_time:1692197 dmpar7 ( 11685) count:18 ave_time:53242 total_time:958358 rgomi ( 11448) count:17 ave_time:156806 total_time:2665714 twang ( 11986) count:17 ave_time:157256 total_time:2673357 vchr0002 ( 11487) count:15 ave_time:118 total_time:1782 gahun3 ( 12820) count:15 ave_time:124 total_time:1873 tandrill ( 11162) count:15 ave_time:140 total_time:2113 debopris ( 11349) count:15 ave_time:56240 total_time:843607 narjesa ( 10666) count:15 ave_time:124 total_time:1871 dnguyen ( 11186) count:15 ave_time:10844 total_time:162674 zhuy ( 11494) count:15 ave_time:114 total_time:1723 jste0021 ( 10774) count:12 ave_time:70416 total_time:844993 philipc ( 10003) count:11 ave_time:184018 total_time:2024204 andrewpe ( 10152) count:11 ave_time:212765 total_time:2340424 kwong ( 12517) count:8 ave_time:1158 total_time:9264 dbw ( 10800) count:6 ave_time:122027 total_time:732163 kcorbett ( 12751) count:6 ave_time:141 total_time:850 mbelouso ( 10091) count:6 ave_time:588688 total_time:3532130 hhartono ( 11775) count:6 ave_time:239071 total_time:1434428 leei ( 10874) count:4 ave_time:3808 total_time:15235 szhang ( 12694) count:4 ave_time:567 total_time:2268 xjia0032 ( 11502) count:4 ave_time:111882 total_time:447528 ctan ( 10189) count:4 ave_time:550 total_time:2200 fruzicka ( 12906) count:4 ave_time:408643 total_time:1634572 rbeare ( 10299) count:3 ave_time:75592 total_time:226777 charlesbj ( 10297) count:3 ave_time:214 total_time:643 miguelg ( 10815) count:3 ave_time:232890 total_time:698671 zwan394 ( 11209) count:3 ave_time:126 total_time:378 fairuzhh ( 10228) count:2 ave_time:827 total_time:1655 khodgins ( 11538) count:2 ave_time:486800 total_time:973600 zdyson ( 11463) count:1 ave_time:24290 total_time:24290 Pending RPC statistics No pending RPCs
Hi Dominik If this is the case, how can we clear these runaway jobs now ? Cheers Damien
Hi Could you send me slurmdbd.log? Dominik
Created attachment 12363 [details] Current slurmdbd logs Kindly review this
Hi Which version of sacctmgr you use? Dominik
[ec2-user@m3-mgmt1 ~]$ [ec2-user@m3-mgmt1 ~]$ which sacctmgr /opt/slurm-19.05.4/bin/sacctmgr [ec2-user@m3-mgmt1 ~]$ sacctmgr -V slurm 19.05.4 [ec2-user@m3-mgmt1 ~]$
[root@m3-mgmt1 ~]# [root@m3-mgmt1 ~]# sacctmgr show runaway |more NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database ID Name Partition Cluster State TimeSubmit TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 11231563 TEMPOcatm+ comp m3 RUNNING 2019-09-09T10:42:00 2019-11-11T13:11:19 Unknown 11231565 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231566 TEMPOcatm+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231573 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231576 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:31 Unknown Unknown 11730893 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-10T10:58:55 Unknown 11764396 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-12T09:40:21 Unknown 11764903 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-12T11:53:57 Unknown 11800431 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-16T07:28:27 Unknown ..... ..... ..... 11870196 L037.slurm m3g m3 PENDING 2019-11-21T10:39:56 Unknown Unknown 11870220 Jupyter L+ comp m3 RUNNING 2019-11-21T10:52:57 2019-11-21T10:52:57 Unknown 11870230 desktop desktop m3 RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02 Unknown 11870239 interact+ comp m3 PENDING 2019-11-21T11:04:22 Unknown Unknown 11870251 desktop desktop m3 PENDING 2019-11-21T11:07:57 Unknown Unknown 11870262 desktop desktop m3 PENDING 2019-11-21T11:11:26 Unknown Unknown sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
Hi My hypothesis from comment 23 are false, the reason for this error must be different. Could you try one more time with debug output? sacctmgr -vvvv list runawayjobs Dominik
[root@m3-mgmt1 ~]# [root@m3-mgmt1 ~]# sacctmgr -vvvv list runawayjobs |more sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/accounting_storage_slurmdbd.so sacctmgr: Accounting storage SLURMDBD plugin loaded sacctmgr: debug3: Success. sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/auth_munge.so sacctmgr: debug: Munge authentication plugin loaded sacctmgr: debug3: Success. sacctmgr: debug: slurmdbd: Sent PersistInit msg sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cons_res.so sacctmgr: Consumable Resources (CR) Node Selection plugin loaded with argument 20 sacctmgr: debug3: Success. sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so sacctmgr: Cray/Aries node selection plugin loaded sacctmgr: debug3: Success. sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so sacctmgr: select/cons_tres loaded with argument 20 sacctmgr: debug3: Success. sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_linear.so sacctmgr: Linear node selection plugin loaded with argument 20 sacctmgr: debug3: Success. NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database ID Name Partition Cluster State TimeSubmit TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 11231563 TEMPOcatm+ comp m3 RUNNING 2019-09-09T10:42:00 2019-11-11T13:11:19 Unknown 11231565 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231566 TEMPOcatm+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231573 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:30 Unknown Unknown 11231576 TEMPOcatp+ comp m3 PENDING 2019-09-09T10:42:31 Unknown Unknown 11730893 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-10T10:58:55 Unknown 11764396 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-12T09:40:21 Unknown 11764903 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-12T11:53:57 Unknown 11800431 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-16T07:28:27 Unknown 11800732 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-16T07:36:07 Unknown 11802587 mm comp m3 RUNNING 2019-09-10T10:44:48 2019-11-16T13:32:26 Unknown ..... ..... ..... 11870196 L037.slurm m3g m3 PENDING 2019-11-21T10:39:56 Unknown Unknown 11870220 Jupyter L+ comp m3 RUNNING 2019-11-21T10:52:57 2019-11-21T10:52:57 Unknown 11870230 desktop desktop m3 RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02 Unknown 11870239 interact+ comp m3 PENDING 2019-11-21T11:04:22 Unknown Unknown 11870251 desktop desktop m3 PENDING 2019-11-21T11:07:57 Unknown Unknown 11870262 desktop desktop m3 PENDING 2019-11-21T11:11:26 Unknown Unknown sacctmgr: debug2: persistent connection is closed sacctmgr: debug2: persistent connection is closed sacctmgr: debug2: persistent connection is closed sacctmgr: debug2: persistent connection is closed sacctmgr: debug2: persistent connection is closed sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable sacctmgr: debug2: persistent connection is closed sacctmgr: debug: slurmdbd: Sent fini msg [root@m3-mgmt1 ~]#
Hi Can you try this patch? recompile and restart slurmdbd. Then one more time try "sacctmgr -vvvv list runawayjobs". It will get past the msg size limit (it looks that we don't expect so many runaway jobs). diff --git a/src/common/slurm_persist_conn.c b/src/common/slurm_persist_conn.c index c134c88a03..8f34a0ecfe 100644 --- a/src/common/slurm_persist_conn.c +++ b/src/common/slurm_persist_conn.c @@ -60,7 +60,7 @@ * Maximum message size. Messages larger than this value (in bytes) * will not be received. */ -#define MAX_MSG_SIZE (16*1024*1024) +#define MAX_MSG_SIZE (50*1024*1024) typedef struct { void *arg; Dominik
Hi Dominik Thanks for this. I will use the backup slurmdbd for this purpose, the patch seems to be missing from this thread. Cheers Damien
Created attachment 12367 [details] workaround
11870230 desktop desktop m3 RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02 Unknown 11870239 interact+ comp m3 PENDING 2019-11-21T11:04:22 Unknown Unknown 11870251 desktop desktop m3 PENDING 2019-11-21T11:07:57 Unknown Unknown 11870262 desktop desktop m3 PENDING 2019-11-21T11:11:26 Unknown Unknown Would you like to fix these runaway jobs? (This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. Once corrected, this will trigger the rollup to reroll usage from before the earliest submit time of all the runaway jobs.) (You have 30 seconds to decide) (N/y): y [root@m3-mgmt2 slurm-19.05.4]#
Hi Dominik Thanks, the patch works. Cheers Damien
Hi Great, good to hear this. Dominik
Hi The last issue with sacctmgr is already solved by: https://github.com/SchedMD/slurm/commit/41b5f95c936fc4931639697f47209ebe0dc63676 https://github.com/SchedMD/slurm/commit/5a1c0acc581ac9ae6886607c41fe32d8c353c631 Those patches will be included in 20.02. Let me know if we can close this ticket. Dominik
Thanks Our slurmdbd is alright now. Kindly close this ticket Damien
I'm closing this ticket as infogiven Dominik