Have not been able to figure out what's going wrong here. We are seeing repeated error messages in the slurmdbd.log: [2020-01-12T13:52:05.312] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:52:15.135] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:52:25.950] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:52:39.386] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:52:51.298] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:53:03.070] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:53:13.035] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:53:26.536] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:53:43.538] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:53:53.661] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:54:07.260] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:54:17.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:54:28.533] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:54:39.844] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:54:49.657] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:55:00.238] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:55:11.724] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:55:23.094] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:55:40.004] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:55:54.142] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:56:10.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:56:29.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:56:39.167] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:56:58.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:57:15.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:57:25.464] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:57:42.006] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:58:01.007] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:58:12.357] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:58:29.004] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:58:48.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:58:49.684] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' [2020-01-12T13:59:07.005] error: There is no reservation by id 14, time_start 1578385261, and cluster 'amareln' Seems like it might relate the these errors on the slurmctld associated with the amareln cluster: [2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request [2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request [2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request [2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request [2020-01-08T04:14:47.446] error: slurmdbd: agent queue is full (20408), discarding DBD_NODE_STATE:1432 request Any advice? Let me know what information you'd want me to collect.
Capture the output of these to send to me: $ sacctmgr show reservations format=cluster,name,id,start,end,tres and $ scontrol show reservations and $ sdiag but immediately after running those, run this in mysql/mariadb (replacing <cluster> with the cluster name: mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (14, 0, 1578385261, 'xxbadresv14xx'); Then watch sdiag to see if the dbd agent queue begins dropping, eventually to 0. If it doesn't check the logs for another version of the "There is no reservation by id" error. You will do the same thing to get that one to clear. This is a known issue with mitigation code in a 19.05. There is a reservation-related RPC in the queue to the slurmdbd that references a reservation that doesn't exist. There is likely a reservation with the correct ID, but reservations are identified by id and start time together. This is a serious problem because with that RPC stuck in the queue, no further RPCs can get through. The queue fills, and the slurmctld is forced to start rejecting new RPCs. Accounting data is lost, and jobs become runaway. To fix this without the built-in mitigation, we have to manually add a reservation entry to the database with the correct id and start time. Note that there could be more than one bad RPC in the queue.
I just realized that part of what I wrote is confusing. >This is a known issue with mitigation code... The known issue is that the RPC queue can get into this state. We have mitigation code for this issue in 19.05 so that the RPC queue does not get blocked.
Is this bug new in 18.08 or existed in 17.11 as well? Thanks! On Jan 13, 2020, at 11:41, "bugs@schedmd.com" <bugs@schedmd.com> wrote: Broderick Gardner<mailto:broderick@schedmd.com> changed bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321> What Removed Added Assignee support@schedmd.com broderick@schedmd.com CC broderick@schedmd.com Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=8321#c5> on bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321> from Broderick Gardner<mailto:broderick@schedmd.com> Capture the output of these to send to me: $ sacctmgr show reservations format=cluster,name,id,start,end,tres and $ scontrol show reservations and $ sdiag but immediately after running those, run this in mysql/mariadb (replacing <cluster> with the cluster name: mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (14, 0, 1578385261, 'xxbadresv14xx'); Then watch sdiag to see if the dbd agent queue begins dropping, eventually to 0. If it doesn't check the logs for another version of the "There is no reservation by id" error. You will do the same thing to get that one to clear. This is a known issue with mitigation code in a 19.05. There is a reservation-related RPC in the queue to the slurmdbd that references a reservation that doesn't exist. There is likely a reservation with the correct ID, but reservations are identified by id and start time together. This is a serious problem because with that RPC stuck in the queue, no further RPCs can get through. The queue fills, and the slurmctld is forced to start rejecting new RPCs. Accounting data is lost, and jobs become runaway. To fix this without the built-in mitigation, we have to manually add a reservation entry to the database with the correct id and start time. Note that there could be more than one bad RPC in the queue. ________________________________ You are receiving this mail because: * You reported the bug.
It was in 17.11 as well. It is just fairly rare; enough that we have not yet tracked down the cause.
Let me know when you have cleared the errors; you will continue to lose data until it is fixed. Once this is fixed, we will also remove any bad stub reservation records that were added to the database and fix runaway jobs. You can also increase the size of the DBD agent queue by increasing MaxJobCount in the slurm.conf. This gives more time before accounting data is lost were this to happen again. Thanks
Inserted that record about 15 minutes ago and the queue dropped to 0 in less than 10 seconds. Uploading the files now.
Created attachment 12720 [details] sacctmgr show reservations format=cluster,name,id,start,end,tres
Created attachment 12721 [details] scontrol show reservations
Created attachment 12722 [details] sdiag
Can I assume this also relates to *** RESTART SLURMDCTLD NOW *** or similar that I sometimes see?
"*** RESTART SLURMDBD NOW ***" is logged when the slurmdbd agent queue is filling up, so yes. It starts printing this when the agent queue size reaches half the max. When it hits the max, it starts discarding step records and job start records from the queue because they aren't as important (job start records are partially redundant). It says to restart the slurmdbd because that would fix it if the problem were that the slurmdbd was hung. In the case at hand, the only fix is to add the reservation record.
Now that the queue is no longer blocked, you can fix runaway jobs, and please let me know how many runaway jobs are shown. Do you have any information on reservation operations being done recently? Please attach the slurmctld.log. I'm hoping to catch when the reservation RPCs were sent so I can determine what sequence caused this.
There were 333 runaway jobs. I'll attach the output.
Created attachment 12726 [details] sacctmgr show runawayjobs
We created a handful of reservations for our maintenance window, one of which appears to be the reservation listed below. The log for that would be where, in slurmctld on the affected cluster below? If so, unfortunately we only have logs going back as far as 2019-12-31 and we know that the reservations were created on 2019-12-23. We appear to have logs going back way farther on all of the other clusters (of course). During this maintenance window, we upgraded from SLURM 17.11.x to 18.08.x, if that's somehow helpful information.
Okay, the deleted logs are the ones I am looking for. Do you have a database backup from before the upgrade? I may be able to figure out what happened from there. If not, there isn't a lot more I can do, so if there are no more problems, I will close this ticket. Thanks
Yes, I believe we do! Let me look into that and I’ll get back to you. On Jan 27, 2020, at 12:09, "bugs@schedmd.com" <bugs@schedmd.com> wrote: Comment # 20<https://bugs.schedmd.com/show_bug.cgi?id=8321#c20> on bug 8321<https://bugs.schedmd.com/show_bug.cgi?id=8321> from Broderick Gardner<mailto:broderick@schedmd.com> Okay, the deleted logs are the ones I am looking for. Do you have a database backup from before the upgrade? I may be able to figure out what happened from there. If not, there isn't a lot more I can do, so if there are no more problems, I will close this ticket. Thanks ________________________________ You are receiving this mail because: * You reported the bug.
Closing this ticket. If you ever do get around to locating a backup of the database, feel free to post it here anyway. I can make it private as needed. Thanks
I forgot about this and didn't realize there was something waiting on me. I'll have a look right now.