Hello, For the last few days we kept noticing such messages in our slurmdbd.log file "[2019-04-28T14:06:10.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon' [2019-04-28T14:06:29.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon' [2019-04-28T14:06:48.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon' [2019-04-28T14:07:07.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon'" When I tried to check that reservation I found it's ended in 2017 # sacctmgr show Reservation format=cluster,ID,name%50,TRES,TimeStart,TimeEnd,assoc%100 where id=80 Cluster ID Name TRES TimeStart TimeEnd Assocs ---------- ------ -------------------------------------------------- -------------------- ------------------- ------------------- ----------------------------------------------------- dragon 80 GPU_BENCHMARKING cpu=16 2017-12-10T12:55:09 2017-12-10T16:55:09 2870,1607,2315 Best regards, Ahmed
Please send the output of sdiag. This error message could indicate a blocked DBD agent queue, which would lead to lost job accounting records after a while. How far back do these messages go? Could you send the slurmdbd log from a bit before these error messages started up until now? This is caused by an RPC referencing an invalid reservation getting stuck in the dbd agent queue. Fixing it requires adding a temporary reservation record to the database. Please send the output of `scontrol show reservations` and `sacctmgr show reservations` and from mysql mysql> select * from dragon_resv_table\G (\G prints vertical) Assuming this is blocking your dbd agent queue, I will send the mysql commands to resolve this quickly. Thanks
We need to get this fixed asap; basically, we need to give the failed RPC a reservation record so it can complete. There is currently a record, as you saw with `sacctmgr`, but the start time is different than the RPC is expecting. Run this query: mysql> update dragon_resv_table set time_start=1556370946 where id_resv=80; After running the query, make sure the DBD Agent queue size decreases. That time_start value comes from the log message you sent and is 2019-04-27 01:15:46 UTC/GMT. I would like to figure out where that time came from, since this reservation started and ended in 2017. I have seen this error before, but clearly there is still a bug that caused this. Do you have any information about reservation modifications on 27 April 2019? slurmctld logs from 26 & 27 April could help me see what happened. Thanks
MariaDB [slurm_acct_db]> select * from dragon_resv_table where id_resv= 80; +---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+ | id_resv | deleted | assoclist | flags | nodelist | node_inx | resv_name | time_start | time_end | tres | unused_wall | +---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+ | 80 | 0 | 2870,1607,2315 | 32768 | dgpu502-13-l | 656 | GPU_BENCHMARKING | 1512899709 | 1512914109 | 1=16 | 14400 | +---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+ 1 row in set (0.00 sec) MariaDB [slurm_acct_db]> update dragon_resv_table set time_start=1556370946 where id_resv=80; Query OK, 1 row affected (0.04 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [slurm_acct_db]>
This there appeared in the slurmdbd log file: [2019-04-30T08:52:01.003] 12(as_mysql_resv.c:435) query update "dragon_resv_table" set time_end=1556370946 where deleted=0 && id_resv=80 and time_start=1556370946;insert into "dragon_resv_table" (id_resv, assoclist, flags, resv_name, nodelist, node_inx, time_end, time_start, tres) values (80, '5362,4787,4788', 49217, 'GPU_BENCHMARKING', 'gpu601-03', '642', 1587906946, 1556370946, '1=32') on duplicate key update deleted=0, assoclist='5362,4787,4788', flags=49217, resv_name='GPU_BENCHMARKING', nodelist='gpu601-03', node_inx='642', time_end=1587906946, time_start=1556370946, tres='1=32'; . . . It took a couple of minutes, but "DBD Agent queue size" is now zero (0). -greg
Ever since this event was emitted by slurmdbd things have been "funny". [2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction update cluster_table set control_host='10.109.23.7', control_port=6917 where name='dragon'; [2019-04-24T17:39:42.698] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction insert into "dragon_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_ alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir ) values (2995951, UNIX_TIMESTAMP(), 2995949, 2, 0, 4294967294, 5374, 1, 157920, 1157920, 'db208-12-1', 0, 5, 1556113192, 1556113191, 15 56113192, 'Adjoint.sh', 0, 1, 338, 64, 1, 9223372036854777856, 'default', 'amd', '176', '', '', NULL, 0, '1=64,2=131072,3=18446744073709 551614,4=1,5=64', '1=64,2=131072,4=1,5=64', '/scratch/dragon/amd/espinda/TestFWI') on duplicate key update job_db_inx=LAST_INSERT_ID(job _db_inx), id_assoc=5374, id_user=157920, id_group=1157920, nodelist='db208-12-1', id_resv=0, timelimit=5, time_submit=1556113191, time_e ligible=1556113192, time_start=1556113192, mod_time=UNIX_TIMESTAMP(), job_name='Adjoint.sh', track_steps=0, id_qos=1, state=greatest(sta te, 1), priority=338, cpus_req=64, nodes_alloc=1, mem_req=9223372036854777856, id_array_job=2995949, id_array_task=2, pack_job_id=0, pac k_job_offset=4294967294, account='default', `partition`='amd', node_inx='176', gres_req='', gres_alloc='', array_task_str=NULL, array_ta sk_pending=0, tres_alloc='1=64,2=131072,3=18446744073709551614,4=1,5=64', tres_req='1=64,2=131072,4=1,5=64', work_dir='/scratch/dragon/a md/espinda/TestFWI' [2019-04-24T17:39:42.700] error: It looks like the storage has gone away trying to reconnect [2019-04-24T17:39:47.737] 17(as_mysql_resv.c:243) query insert into "dragon_resv_table" (id_resv, assoclist, flags, resv_name, nodelist, node_inx, time_end, time_start, tres) values (80, '5362,4787,4788', 49217, 'Rack_6.01_relocation-gpu601-03-2', 'gpu601-03', '642', 1587906946, 1556370946, '1=32') on duplicate key update deleted=0, assoclist='5362,4787,4788', flags=49217, resv_name='Rack_6.01_relocation-gpu601-03-2', nodelist='gpu601-03', node_inx='642', time_end=1587906946, time_start=1556370946, tres='1=32'; [2019-04-24T17:39:47.738] 17(as_mysql_resv.c:493) query delete from "dragon_resv_table" where time_start > 1556115676 and id_resv=73 and time_start=1556111916;update "dragon_resv_table" set time_end=1556115676, deleted=1 where deleted=0 and id_resv=73 and time_start=1556111916; [2019-04-24T19:39:43.542] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction insert into "dragon_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2995950, UNIX_TIMESTAMP(), 2995949, 1, 0, 4294967294, 5374, 1, 157920, 1157920, 'db203-12-3', 0, 5, 1556113192, 1556113191, 1556113192, 'Adjoint.sh', 0, 3, 338, 64, 1, 9223372036854777856, 'default', 'amd', '162', '', '', NULL, 0, '1=64,2=131072,4=1,5=64', '1=64,2=131072,4=1,5=64', '/scratch/dragon/amd/espinda/TestFWI') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=5374, id_user=157920, id_group=1157920, nodelist='db203-12-3', id_resv=0, timelimit=5, time_submit=1556113191, time_eligible=1556113192, time_start=1556113192, mod_time=UNIX_TIMESTAMP(), job_name='Adjoint.sh', track_steps=0, id_qos=1, state=greatest(state, 3), priority=338, cpus_req=64, nodes_alloc=1, mem_req=9223372036854777856, id_array_job=2995949, id_array_task=1, pack_job_id=0, pack_job_offset=4294967294, account='default', `partition`='amd', node_inx='162', gres_req='', gres_alloc='', array_task_str=NULL, array_task_pending=0, tres_alloc='1=64,2=131072,4=1,5=64', tres_req='1=64,2=131072,4=1,5=64', work_dir='/scratch/dragon/amd/espinda/TestFWI' [2019-04-24T19:39:43.542] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program slurmdbd stopped running at this time. It was restarted 20 minutes later: [2019-04-24T20:17:03.888] slurmdbd version 18.08.6-2 started In slurmctld, the following slurmdbd errors were reported: [2019-04-24T16:18:35.849] Registering slurmctld at port 6917 with slurmdbd. [2019-04-24T16:54:54.818] error: slurmdbd: Getting response to message type 1472 [2019-04-24T16:54:54.819] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error [2019-04-24T17:09:54.823] error: slurmdbd: Getting response to message type 1407 [2019-04-24T17:24:54.855] error: slurmdbd: Getting response to message type 1472 [2019-04-24T17:24:54.857] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error [2019-04-24T17:39:48.419] slurmdbd: agent queue size 2000 [2019-04-24T19:39:49.210] error: slurmdbd: Sending message type 1407: 11: Connection refused [2019-04-24T19:39:57.000] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:06.072] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:16.041] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:26.278] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:28.117] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:28.119] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2019-04-24T19:40:33.121] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-04-24T19:40:33.122] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2019-04-24T19:40:38.000] error: slurmdbd: Sending PersistInit msg: Connection refused (with the last two repeated continuously 5 seconds until slurmdbd was restarted). Since the SlurmDBD restart a few anomlous things have happened (this ticket, and ticket https://bugs.schedmd.com/show_bug.cgi?id=6921 )
Sorry for the silence on this issue. We have seen this a couple other times, and we are currently evaluating options to fix the bug that causes this. Please let me know if it happens again, including some of the same details you did for the first issue (slurmdbd.log, slurmctld.log, timeline of admin actions like restart, modifying reservations, etc) Thanks
*** Bug 7157 has been marked as a duplicate of this bug. ***
(In reply to Greg Wickham from comment #6) > Ever since this event was emitted by slurmdbd things have been "funny". > > [2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found > when trying to get lock; try restarting transaction Greg, Can you please call this in mysql: > SHOW ENGINE INNODB STATUS; --Nate
Created attachment 10612 [details] mysql-output.txt.bz2 Output attached. -greg ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, 3 June 2019 7:45 PM To: Greg Wickham Subject: [Bug 6922] Records for non-existing reservation in the logs Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=6922#c9> on bug 6922<https://bugs.schedmd.com/show_bug.cgi?id=6922> from Nate Rini<mailto:nate@schedmd.com> (In reply to Greg Wickham from comment #6<show_bug.cgi?id=6922#c6>) > Ever since this event was emitted by slurmdbd things have been "funny". > > [2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found > when trying to get lock; try restarting transaction Greg, Can you please call this in mysql: > SHOW ENGINE INNODB STATUS; --Nate ________________________________ You are receiving this mail because: * You are on the CC list for the bug.
(In reply to Greg Wickham from comment #11) > Created attachment 10612 [details] > mysql-output.txt.bz2 > > Output attached. Looks like mysql has been cycled since the deadlock and the deadlock information is not there anymore. Can you please make sure to call the command in comment #9 the next time there is a deadlock error?
Will do. -Greg -- On 14 Jun 2019, at 20:34, "bugs@schedmd.com<mailto:bugs@schedmd.com>" <bugs@schedmd.com<mailto:bugs@schedmd.com>> wrote: Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=6922#c12> on bug 6922<https://bugs.schedmd.com/show_bug.cgi?id=6922> from Nate Rini<mailto:nate@schedmd.com> (In reply to Greg Wickham from comment #11<show_bug.cgi?id=6922#c11>) > Created attachment 10612 [details]<attachment.cgi?id=10612> [details]<attachment.cgi?id=10612&action=edit> > mysql-output.txt.bz2 > > Output attached. Looks like mysql has been cycled since the deadlock and the deadlock information is not there anymore. Can you please make sure to call the command in comment #9<show_bug.cgi?id=6922#c9> the next time there is a deadlock error? ________________________________ You are receiving this mail because: * You are on the CC list for the bug.
(In reply to Greg Wickham from comment #13) > Will do. Greg, I'm going to close this ticket for now. If it happens again, please provide the output request and we can continue from there. Thanks, --Nate