Ticket 8321 - Error appears in slurmdbd.log every minute, "There is no reservation by id 14"
Summary: Error appears in slurmdbd.log every minute, "There is no reservation by id 14"
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.8
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-01-12 12:04 MST by Ryan Novosielski
Modified: 2020-05-18 13:08 MDT (History)
1 user (show)

See Also:
Site: Rutgers
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: amarel, amareln
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sacctmgr show reservations format=cluster,name,id,start,end,tres (82.56 KB, text/plain)
2020-01-13 11:33 MST, Ryan Novosielski
Details
scontrol show reservations (30 bytes, text/plain)
2020-01-13 11:34 MST, Ryan Novosielski
Details
sdiag (4.16 KB, text/plain)
2020-01-13 11:36 MST, Ryan Novosielski
Details
sacctmgr show runawayjobs (32.48 KB, text/plain)
2020-01-13 15:33 MST, Ryan Novosielski
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Novosielski 2020-01-12 12:04:29 MST
Have not been able to figure out what's going wrong here. We are seeing repeated error messages in the slurmdbd.log:

[2020-01-12T13:52:05.312] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:52:15.135] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:52:25.950] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:52:39.386] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:52:51.298] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:53:03.070] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:53:13.035] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:53:26.536] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:53:43.538] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:53:53.661] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:54:07.260] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:54:17.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:54:28.533] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:54:39.844] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:54:49.657] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:55:00.238] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:55:11.724] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:55:23.094] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:55:40.004] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:55:54.142] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:56:10.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:56:29.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:56:39.167] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:56:58.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:57:15.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:57:25.464] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:57:42.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:58:01.007] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:58:12.357] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:58:29.004] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:58:48.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:58:49.684] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'
[2020-01-12T13:59:07.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln'

Seems like it might relate the these errors on the slurmctld associated with the amareln cluster:

[2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request
[2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request
[2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request
[2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request
[2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request

Any advice? Let me know what information you'd want me to collect.
Comment 5 Broderick Gardner 2020-01-13 09:40:00 MST
Capture the output of these to send to me:
$ sacctmgr show reservations format=cluster,name,id,start,end,tres
and
$ scontrol show reservations
and
$ sdiag

but immediately after running those, run this in mysql/mariadb (replacing <cluster> with the cluster name:
mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (14, 0, 1578385261, 'xxbadresv14xx');

Then watch sdiag to see if the dbd agent queue begins dropping, eventually to 0. If it doesn't check the logs for another version of the "There is no reservation by id" error. You will do the same thing to get that one to clear.


This is a known issue with mitigation code in a 19.05. There is a reservation-related RPC in the queue to the slurmdbd that references a reservation that doesn't exist. There is likely a reservation with the correct ID, but reservations are identified by id and start time together. 

This is a serious problem because with that RPC stuck in the queue, no further RPCs can get through. The queue fills, and the slurmctld is forced to start rejecting new RPCs. Accounting data is lost, and jobs become runaway. 

To fix this without the built-in mitigation, we have to manually add a reservation entry to the database with the correct id and start time. Note that there could be more than one bad RPC in the queue.
Comment 6 Broderick Gardner 2020-01-13 09:42:25 MST
I just realized that part of what I wrote is confusing. 
>This is a known issue with mitigation code...

The known issue is that the RPC queue can get into this state. We have mitigation code for this issue in 19.05 so that the RPC queue does not get blocked.
Comment 7 Ryan Novosielski 2020-01-13 10:07:14 MST
Is this bug new in 18.08 or existed in 17.11 as well?

Thanks!

On Jan 13, 2020, at 11:41, "bugs@schedmd.com" <bugs@schedmd.com> wrote:

 Broderick Gardner<mailto:broderick@schedmd.com> changed bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321>
What    Removed Added
Assignee        support@schedmd.com     broderick@schedmd.com
CC              broderick@schedmd.com

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=8321#c5> on bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321> from Broderick Gardner<mailto:broderick@schedmd.com>

Capture the output of these to send to me:
$ sacctmgr show reservations format=cluster,name,id,start,end,tres
and
$ scontrol show reservations
and
$ sdiag

but immediately after running those, run this in mysql/mariadb (replacing
<cluster> with the cluster name:
mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start,
resv_name) values (14, 0, 1578385261, 'xxbadresv14xx');

Then watch sdiag to see if the dbd agent queue begins dropping, eventually to
0. If it doesn't check the logs for another version of the "There is no
reservation by id" error. You will do the same thing to get that one to clear.


This is a known issue with mitigation code in a 19.05. There is a
reservation-related RPC in the queue to the slurmdbd that references a
reservation that doesn't exist. There is likely a reservation with the correct
ID, but reservations are identified by id and start time together.

This is a serious problem because with that RPC stuck in the queue, no further
RPCs can get through. The queue fills, and the slurmctld is forced to start
rejecting new RPCs. Accounting data is lost, and jobs become runaway.

To fix this without the built-in mitigation, we have to manually add a
reservation entry to the database with the correct id and start time. Note that
there could be more than one bad RPC in the queue.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 8 Broderick Gardner 2020-01-13 10:09:00 MST
It was in 17.11 as well. It is just fairly rare; enough that we have not yet tracked down the cause.
Comment 9 Broderick Gardner 2020-01-13 11:07:35 MST
Let me know when you have cleared the errors; you will continue to lose data until it is fixed. 

Once this is fixed, we will also remove any bad stub reservation records that were added to the database and fix runaway jobs. You can also increase the size of the DBD agent queue by increasing MaxJobCount in the slurm.conf. This gives more time before accounting data is lost were this to happen again.

Thanks
Comment 10 Ryan Novosielski 2020-01-13 11:32:38 MST
Inserted that record about 15 minutes ago and the queue dropped to 0 in less than 10 seconds.

Uploading the files now.
Comment 11 Ryan Novosielski 2020-01-13 11:33:36 MST
Created attachment 12720 [details]
sacctmgr show reservations format=cluster,name,id,start,end,tres
Comment 12 Ryan Novosielski 2020-01-13 11:34:45 MST
Created attachment 12721 [details]
scontrol show reservations
Comment 13 Ryan Novosielski 2020-01-13 11:36:43 MST
Created attachment 12722 [details]
sdiag
Comment 14 Ryan Novosielski 2020-01-13 11:49:52 MST
Can I assume this also relates to *** RESTART SLURMDCTLD NOW *** or similar that I sometimes see?
Comment 15 Broderick Gardner 2020-01-13 13:31:08 MST
"*** RESTART SLURMDBD NOW ***" is logged when the slurmdbd agent queue is filling up, so yes. It starts printing this when the agent queue size reaches half the max. When it hits the max, it starts discarding step records and job start records from the queue because they aren't as important (job start records are partially redundant).

It says to restart the slurmdbd because that would fix it if the problem were that the slurmdbd was hung. In the case at hand, the only fix is to add the reservation record.
Comment 16 Broderick Gardner 2020-01-13 13:40:29 MST
Now that the queue is no longer blocked, you can fix runaway jobs, and please let me know how many runaway jobs are shown.

Do you have any information on reservation operations being done recently? Please attach the slurmctld.log. I'm hoping to catch when the reservation RPCs were sent so I can determine what sequence caused this.
Comment 17 Ryan Novosielski 2020-01-13 15:32:48 MST
There were 333 runaway jobs. I'll attach the output.
Comment 18 Ryan Novosielski 2020-01-13 15:33:32 MST
Created attachment 12726 [details]
sacctmgr show runawayjobs
Comment 19 Ryan Novosielski 2020-01-13 15:41:47 MST
We created a handful of reservations for our maintenance window, one of which appears to be the reservation listed below.

The log for that would be where, in slurmctld on the affected cluster below? If so, unfortunately we only have logs going back as far as 2019-12-31 and we know that the reservations were created on 2019-12-23. We appear to have logs going back way farther on all of the other clusters (of course).

During this maintenance window, we upgraded from SLURM 17.11.x to 18.08.x, if that's somehow helpful information.
Comment 20 Broderick Gardner 2020-01-27 10:08:26 MST
Okay, the deleted logs are the ones I am looking for. Do you have a database backup from before the upgrade? I may be able to figure out what happened from there. If not, there isn't a lot more I can do, so if there are no more problems, I will close this ticket.

Thanks
Comment 21 Ryan Novosielski 2020-01-27 10:16:15 MST
Yes, I believe we do! Let me look into that and I’ll get back to you.

On Jan 27, 2020, at 12:09, "bugs@schedmd.com" <bugs@schedmd.com> wrote:



Comment # 20<https://bugs.schedmd.com/show_bug.cgi?id=8321#c20> on bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321> from Broderick Gardner<mailto:broderick@schedmd.com>

Okay, the deleted logs are the ones I am looking for. Do you have a database
backup from before the upgrade? I may be able to figure out what happened from
there. If not, there isn't a lot more I can do, so if there are no more
problems, I will close this ticket.

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 22 Broderick Gardner 2020-03-31 09:28:47 MDT
Closing this ticket. 

If you ever do get around to locating a backup of the database, feel free to post it here anyway. I can make it private as needed.

Thanks
Comment 23 Ryan Novosielski 2020-03-31 10:11:46 MDT
I forgot about this and didn't realize there was something waiting on me. I'll have a look right now.