Bug 6922 - Records for non-existing reservation in the logs
Summary: Records for non-existing reservation in the logs
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other bugs)
Version: 18.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-29 01:43 MDT by Ahmed Essam ElMazaty
Modified: 2019-10-18 11:06 MDT (History)
4 users (show)

See Also:
Site: KAUST
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
mysql-output.txt.bz2 (2.03 KB, application/x-bzip2)
2019-06-14 11:15 MDT, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ahmed Essam ElMazaty 2019-04-29 01:43:47 MDT
Hello,

For the last few days we kept noticing such messages in our slurmdbd.log file 

"[2019-04-28T14:06:10.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon'
[2019-04-28T14:06:29.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon'
[2019-04-28T14:06:48.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon'
[2019-04-28T14:07:07.002] error: There is no reservation by id 80, time_start 1556370946, and cluster 'dragon'"

When I tried to check that reservation I found it's ended in 2017

# sacctmgr show Reservation format=cluster,ID,name%50,TRES,TimeStart,TimeEnd,assoc%100 where id=80
   Cluster     ID                                               Name                 TRES           TimeStart             TimeEnd                                                                                               Assocs 
---------- ------ -------------------------------------------------- -------------------- ------------------- -------------------                                                ----------------------------------------------------- 
    dragon     80                                   GPU_BENCHMARKING               cpu=16 2017-12-10T12:55:09 2017-12-10T16:55:09                                                                                       2870,1607,2315

Best regards,
Ahmed
Comment 2 Broderick Gardner 2019-04-29 11:20:36 MDT
Please send the output of sdiag. This error message could indicate a blocked DBD agent queue, which would lead to lost job accounting records after a while.

How far back do these messages go? Could you send the slurmdbd log from a bit before these error messages started up until now?


This is caused by an RPC referencing an invalid reservation getting stuck in the dbd agent queue. Fixing it requires adding a temporary reservation record to the database.

Please send the output of
`scontrol show reservations`

and

`sacctmgr show reservations`

and from mysql
mysql> select * from dragon_resv_table\G

(\G prints vertical)

Assuming this is blocking your dbd agent queue, I will send the mysql commands to resolve this quickly.

Thanks
Comment 3 Broderick Gardner 2019-04-29 15:34:18 MDT
We need to get this fixed asap; basically, we need to give the failed RPC a reservation record so it can complete. There is currently a record, as you saw with `sacctmgr`, but the start time is different than the RPC is expecting. Run this query:
mysql> update dragon_resv_table set time_start=1556370946 where id_resv=80;

After running the query, make sure the DBD Agent queue size decreases.


That time_start value comes from the log message you sent and is 2019-04-27 01:15:46 UTC/GMT. I would like to figure out where that time came from, since this reservation started and ended in 2017. I have seen this error before, but clearly there is still a bug that caused this. Do you have any information about reservation modifications on 27 April 2019? 
slurmctld logs from 26 & 27 April could help me see what happened. 

Thanks
Comment 4 Greg Wickham 2019-04-29 23:52:17 MDT
MariaDB [slurm_acct_db]> select * from dragon_resv_table where id_resv= 80;
+---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+
| id_resv | deleted | assoclist      | flags | nodelist     | node_inx | resv_name        | time_start | time_end   | tres | unused_wall |
+---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+
|      80 |       0 | 2870,1607,2315 | 32768 | dgpu502-13-l | 656      | GPU_BENCHMARKING | 1512899709 | 1512914109 | 1=16 |       14400 |
+---------+---------+----------------+-------+--------------+----------+------------------+------------+------------+------+-------------+
1 row in set (0.00 sec)



MariaDB [slurm_acct_db]> update dragon_resv_table set time_start=1556370946 where id_resv=80;
Query OK, 1 row affected (0.04 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [slurm_acct_db]>
Comment 5 Greg Wickham 2019-04-29 23:55:39 MDT
This there appeared in the slurmdbd log file:

[2019-04-30T08:52:01.003] 12(as_mysql_resv.c:435) query
update "dragon_resv_table" set time_end=1556370946 where deleted=0 && id_resv=80 and time_start=1556370946;insert into "dragon_resv_table" (id_resv, assoclist, flags, resv_name, nodelist, node_inx, time_end, time_start, tres) values (80, '5362,4787,4788', 49217, 'GPU_BENCHMARKING', 'gpu601-03', '642', 1587906946, 1556370946, '1=32') on duplicate key update deleted=0, assoclist='5362,4787,4788', flags=49217, resv_name='GPU_BENCHMARKING', nodelist='gpu601-03', node_inx='642', time_end=1587906946, time_start=1556370946, tres='1=32';

. . .

It took a couple of minutes, but "DBD Agent queue size" is now zero (0).

   -greg
Comment 6 Greg Wickham 2019-04-30 00:02:34 MDT
Ever since this event was emitted by slurmdbd things have been "funny".

[2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update cluster_table set control_host='10.109.23.7', control_port=6917 where name='dragon';
[2019-04-24T17:39:42.698] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
insert into "dragon_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, 
id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_
alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir
) values (2995951, UNIX_TIMESTAMP(), 2995949, 2, 0, 4294967294, 5374, 1, 157920, 1157920, 'db208-12-1', 0, 5, 1556113192, 1556113191, 15
56113192, 'Adjoint.sh', 0, 1, 338, 64, 1, 9223372036854777856, 'default', 'amd', '176', '', '', NULL, 0, '1=64,2=131072,3=18446744073709
551614,4=1,5=64', '1=64,2=131072,4=1,5=64', '/scratch/dragon/amd/espinda/TestFWI') on duplicate key update job_db_inx=LAST_INSERT_ID(job
_db_inx), id_assoc=5374, id_user=157920, id_group=1157920, nodelist='db208-12-1', id_resv=0, timelimit=5, time_submit=1556113191, time_e
ligible=1556113192, time_start=1556113192, mod_time=UNIX_TIMESTAMP(), job_name='Adjoint.sh', track_steps=0, id_qos=1, state=greatest(sta
te, 1), priority=338, cpus_req=64, nodes_alloc=1, mem_req=9223372036854777856, id_array_job=2995949, id_array_task=2, pack_job_id=0, pac
k_job_offset=4294967294, account='default', `partition`='amd', node_inx='176', gres_req='', gres_alloc='', array_task_str=NULL, array_ta
sk_pending=0, tres_alloc='1=64,2=131072,3=18446744073709551614,4=1,5=64', tres_req='1=64,2=131072,4=1,5=64', work_dir='/scratch/dragon/a
md/espinda/TestFWI'
[2019-04-24T17:39:42.700] error: It looks like the storage has gone away trying to reconnect
[2019-04-24T17:39:47.737] 17(as_mysql_resv.c:243) query
insert into "dragon_resv_table" (id_resv, assoclist, flags, resv_name, nodelist, node_inx, time_end, time_start, tres) values (80, '5362,4787,4788', 49217, 'Rack_6.01_relocation-gpu601-03-2', 'gpu601-03', '642', 1587906946, 1556370946, '1=32') on duplicate key update deleted=0, assoclist='5362,4787,4788', flags=49217, resv_name='Rack_6.01_relocation-gpu601-03-2', nodelist='gpu601-03', node_inx='642', time_end=1587906946, time_start=1556370946, tres='1=32';
[2019-04-24T17:39:47.738] 17(as_mysql_resv.c:493) query
delete from "dragon_resv_table" where time_start > 1556115676 and id_resv=73 and time_start=1556111916;update "dragon_resv_table" set time_end=1556115676, deleted=1 where deleted=0 and id_resv=73 and time_start=1556111916;
[2019-04-24T19:39:43.542] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
insert into "dragon_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2995950, UNIX_TIMESTAMP(), 2995949, 1, 0, 4294967294, 5374, 1, 157920, 1157920, 'db203-12-3', 0, 5, 1556113192, 1556113191, 1556113192, 'Adjoint.sh', 0, 3, 338, 64, 1, 9223372036854777856, 'default', 'amd', '162', '', '', NULL, 0, '1=64,2=131072,4=1,5=64', '1=64,2=131072,4=1,5=64', '/scratch/dragon/amd/espinda/TestFWI') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=5374, id_user=157920, id_group=1157920, nodelist='db203-12-3', id_resv=0, timelimit=5, time_submit=1556113191, time_eligible=1556113192, time_start=1556113192, mod_time=UNIX_TIMESTAMP(), job_name='Adjoint.sh', track_steps=0, id_qos=1, state=greatest(state, 3), priority=338, cpus_req=64, nodes_alloc=1, mem_req=9223372036854777856, id_array_job=2995949, id_array_task=1, pack_job_id=0, pack_job_offset=4294967294, account='default', `partition`='amd', node_inx='162', gres_req='', gres_alloc='', array_task_str=NULL, array_task_pending=0, tres_alloc='1=64,2=131072,4=1,5=64', tres_req='1=64,2=131072,4=1,5=64', work_dir='/scratch/dragon/amd/espinda/TestFWI'
[2019-04-24T19:39:43.542] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program




slurmdbd stopped running at this time. It was restarted 20 minutes later:

[2019-04-24T20:17:03.888] slurmdbd version 18.08.6-2 started

In slurmctld, the following slurmdbd errors were reported:

[2019-04-24T16:18:35.849] Registering slurmctld at port 6917 with slurmdbd.
[2019-04-24T16:54:54.818] error: slurmdbd: Getting response to message type 1472
[2019-04-24T16:54:54.819] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error
[2019-04-24T17:09:54.823] error: slurmdbd: Getting response to message type 1407
[2019-04-24T17:24:54.855] error: slurmdbd: Getting response to message type 1472
[2019-04-24T17:24:54.857] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error
[2019-04-24T17:39:48.419] slurmdbd: agent queue size 2000
[2019-04-24T19:39:49.210] error: slurmdbd: Sending message type 1407: 11: Connection refused
[2019-04-24T19:39:57.000] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:06.072] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:16.041] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:26.278] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:28.117] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:28.119] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2019-04-24T19:40:33.121] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-04-24T19:40:33.122] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2019-04-24T19:40:38.000] error: slurmdbd: Sending PersistInit msg: Connection refused


(with the last two repeated continuously 5 seconds until slurmdbd was restarted).



Since the SlurmDBD restart a few anomlous things have happened (this ticket, and ticket 
https://bugs.schedmd.com/show_bug.cgi?id=6921 )
Comment 7 Broderick Gardner 2019-05-31 09:39:03 MDT
Sorry for the silence on this issue. We have seen this a couple other times, and we are currently evaluating options to fix the bug that causes this.

Please let me know if it happens again, including some of the same details you did for the first issue (slurmdbd.log, slurmctld.log, timeline of admin actions like restart, modifying reservations, etc)
Thanks
Comment 8 Jason Booth 2019-05-31 10:56:45 MDT
*** Bug 7157 has been marked as a duplicate of this bug. ***
Comment 9 Nate Rini 2019-06-03 10:45:23 MDT
(In reply to Greg Wickham from comment #6)
> Ever since this event was emitted by slurmdbd things have been "funny".
> 
> [2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found
> when trying to get lock; try restarting transaction

Greg,

Can you please call this in mysql:
> SHOW ENGINE INNODB STATUS;

--Nate
Comment 11 Greg Wickham 2019-06-14 11:15:01 MDT
Created attachment 10612 [details]
mysql-output.txt.bz2

Output attached.


   -greg



________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, 3 June 2019 7:45 PM
To: Greg Wickham
Subject: [Bug 6922] Records for non-existing reservation in the logs


Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=6922#c9> on bug 6922<https://bugs.schedmd.com/show_bug.cgi?id=6922> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Greg Wickham from comment #6<show_bug.cgi?id=6922#c6>)
> Ever since this event was emitted by slurmdbd things have been "funny".
>
> [2019-04-24T17:39:42.692] error: mysql_query failed: 1213 Deadlock found
> when trying to get lock; try restarting transaction

Greg,

Can you please call this in mysql:
> SHOW ENGINE INNODB STATUS;

--Nate

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
Comment 12 Nate Rini 2019-06-14 11:34:03 MDT
(In reply to Greg Wickham from comment #11)
> Created attachment 10612 [details]
> mysql-output.txt.bz2
> 
> Output attached.

Looks like mysql has been cycled since the deadlock and the deadlock information is not there anymore. Can you please make sure to call the command in comment #9 the next time there is a deadlock error?
Comment 13 Greg Wickham 2019-06-14 11:42:49 MDT
Will do.

   -Greg

--


On 14 Jun 2019, at 20:34, "bugs@schedmd.com<mailto:bugs@schedmd.com>" <bugs@schedmd.com<mailto:bugs@schedmd.com>> wrote:


Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=6922#c12> on bug 6922<https://bugs.schedmd.com/show_bug.cgi?id=6922> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Greg Wickham from comment #11<show_bug.cgi?id=6922#c11>)
> Created attachment 10612 [details]<attachment.cgi?id=10612> [details]<attachment.cgi?id=10612&action=edit>
> mysql-output.txt.bz2
>
> Output attached.

Looks like mysql has been cycled since the deadlock and the deadlock
information is not there anymore. Can you please make sure to call the command
in comment #9<show_bug.cgi?id=6922#c9> the next time there is a deadlock error?

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
Comment 14 Nate Rini 2019-10-18 11:06:56 MDT
(In reply to Greg Wickham from comment #13)
> Will do.

Greg,

I'm going to close this ticket for now. If it happens again, please provide the output request and we can continue from there.

Thanks,
--Nate