I adjusted the EndTime on a reservation on a single compute node yesterday afternoon, and we've just noticed that no jobs have been going to the SLURM database since then, i.e. we have an accumulating list of runaway jobs. In the SLURM database logs we see messages like the following: [2017-05-17T14:54:01.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre' [2017-05-17T14:54:01.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre' [2017-05-17T14:54:11.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre' Here is info about the reservations: $ scontrol show res ReservationName=g_bordenstein_lab-RT37379 StartTime=2017-05-01T00:00:00 EndTime=2017-07-01T00:00:00 Duration=61-00:00:00 Nodes=vmp[202-203] NodeCnt=2 CoreCnt=24 Features=(null) PartitionName=production Flags= TRES=cpu=48 Users=(null) Accounts=g_bordenstein_lab Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=samtest StartTime=2017-05-17T14:43:24 EndTime=2017-08-25T14:43:24 Duration=100-00:00:00 Nodes=vmp1242 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=production Flags=SPEC_NODES TRES=cpu=12 Users=(null) Accounts=cms_samtest Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a It's the second one that we modified yesterday and seems to be creating the problem. I initially tried deleting the reservation and rapidly recreating it. (no change) I then deleted the reservation, and waited...no change. I also restarted slurmdbd and still we see the same complaint about the reservation with id 377. Help?
> [2017-05-17T14:54:01.003] error: There is no reservation by id 377, > time_start 1479843561, and cluster 'accre' > [2017-05-17T14:54:01.003] error: There is no reservation by id 377, > time_start 1479843561, and cluster 'accre' > [2017-05-17T14:54:11.003] error: There is no reservation by id 377, > time_start 1479843561, and cluster 'accre' > ReservationName=samtest StartTime=2017-05-17T14:43:24 > EndTime=2017-08-25T14:43:24 Duration=100-00:00:00 > Nodes=vmp1242 NodeCnt=1 CoreCnt=12 Features=(null) > PartitionName=production Flags=SPEC_NODES > TRES=cpu=12 > Users=(null) Accounts=cms_samtest Licenses=(null) State=ACTIVE > BurstBuffer=(null) Watts=n/a Just to clarify...the reservation as shown above was a modified version from this afternoon. The original start time of the samtest reservation matched the epoch start time shown in the error message from the logs.
The issue is that the (apparently invalid) update message from the slurmctld to the slurmdbd will be continually replayed until successful. Until that happens, all job history is being queued by slurmctld - this can lead to increased memory consumption on there until this gets resolved. If not fixed soon, this can eventually lead to loss of some accounting info. I'm bumping the priority up a notch, and will get you some additional info shortly. - Tim
... this has bit you before. I think we've fixed a few more potential paths into this issue on 17.02, but I don't want you to upgrade the slurmdbd in haste. Comment 11 on bug 2741 walks you through creating the missing entry in the MySQL database directly - if you fix the values to match the ones at fault here that should get you past this issue and get accounting values moving forward again which I think is the most important bit. Troubleshooting how this happened, and/or verifying some fixes on 17.02 would prevent this longer term, can wait until your accounting data is safe. https://bugs.schedmd.com/show_bug.cgi?id=2741#c11 If you can create that entry things should start updating immediately - please let me know if that's successful, or if this shifts to a new error. - Tim
(In reply to Tim Wickberg from comment #3) > ... this has bit you before. I think we've fixed a few more potential paths > into this issue on 17.02, but I don't want you to upgrade the slurmdbd in > haste. > > Comment 11 on bug 2741 walks you through creating the missing entry in the > MySQL database directly - if you fix the values to match the ones at fault > here that should get you past this issue and get accounting values moving > forward again which I think is the most important bit. > > Troubleshooting how this happened, and/or verifying some fixes on 17.02 > would prevent this longer term, can wait until your accounting data is safe. > > https://bugs.schedmd.com/show_bug.cgi?id=2741#c11 > > If you can create that entry things should start updating immediately - > please let me know if that's successful, or if this shifts to a new error. > > - Tim Hey Tim, I had some vague recollection of this happening before, thanks for the quick fix. After inserting the dummy reservation in the database we immediately saw messages like these in the slurmdbd logs: [2017-05-17T16:28:49.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre' [2017-05-17T16:29:01.633] debug: Need to reroll usage from Tue May 16 15:24:59 2017 Job 14648594 from accre started then and we are just now hearing about it. [2017-05-17T16:29:48.893] debug: as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue [2017-05-17T16:29:48.894] debug: as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue [2017-05-17T16:29:48.894] debug: as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue [2017-05-17T16:29:49.352] debug: as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue which I think is a good sign, and running "sacctmgr show runawayjobs" confirms that all the runaways since yesterday have made it to the database.
Glad that worked. Moving back down to Sev3 - I'll try to sort out what could have caused this, but I do recall we'd fixed a few related issues in 17.02 that may prevent this in the future. I don't know if you have a plan to upgrade at some point soin, but the 17.02 release has proven to be quite stable thus far.
(In reply to Tim Wickberg from comment #5) > Glad that worked. Moving back down to Sev3 - I'll try to sort out what could > have caused this, but I do recall we'd fixed a few related issues in 17.02 > that may prevent this in the future. > > I don't know if you have a plan to upgrade at some point soin, but the 17.02 > release has proven to be quite stable thus far. Our current plan is to keep our CentOS 6 hardware on 16.05.8 and stand up a separate instance of SLURM 17.02 for hardware that is upgraded to CentOS 7. We may reconsider but for now that's the plan.
Do you happen to have the slurmctld and slurmdbd logs from the time that reservation was created? There might be a hint as to what happened there that would narrow down the cause.
(In reply to Tim Wickberg from comment #8) > Do you happen to have the slurmctld and slurmdbd logs from the time that > reservation was created? There might be a hint as to what happened there > that would narrow down the cause. Originally created? Unfortunately we do not since it was over six months ago.
Do you know roughly what version of the slurmctld you would have been running at that point? I'm guessing this may be related to a fix made in 15.08.9 by commit 5c43d754fb016a. If the reservation was made by a slurmctld older than that it would match up to that fix, and explain how it got into this state.
(In reply to Tim Wickberg from comment #10) > Do you know roughly what version of the slurmctld you would have been > running at that point? > > I'm guessing this may be related to a fix made in 15.08.9 by commit > 5c43d754fb016a. If the reservation was made by a slurmctld older than that > it would match up to that fix, and explain how it got into this state. We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running 15.08.11 for a few months, so I'm fairly certain that the reservation was created after the fix in 15.08.9.
(In reply to Will French from comment #11) > (In reply to Tim Wickberg from comment #10) > > Do you know roughly what version of the slurmctld you would have been > > running at that point? > > > > I'm guessing this may be related to a fix made in 15.08.9 by commit > > 5c43d754fb016a. If the reservation was made by a slurmctld older than that > > it would match up to that fix, and explain how it got into this state. > > We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running > 15.08.11 for a few months, so I'm fairly certain that the reservation was > created after the fix in 15.08.9. Hmm. Without the logs from that time period I don't have much to go on here, and doubt I'd be able to reproduce this easily. It it's okay with you I'll mark this as resolved/infogiven, as I don't think it's likely to recur. - Tim
(In reply to Tim Wickberg from comment #12) > (In reply to Will French from comment #11) > > (In reply to Tim Wickberg from comment #10) > > > Do you know roughly what version of the slurmctld you would have been > > > running at that point? > > > > > > I'm guessing this may be related to a fix made in 15.08.9 by commit > > > 5c43d754fb016a. If the reservation was made by a slurmctld older than that > > > it would match up to that fix, and explain how it got into this state. > > > > We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running > > 15.08.11 for a few months, so I'm fairly certain that the reservation was > > created after the fix in 15.08.9. > > Hmm. Without the logs from that time period I don't have much to go on here, > and doubt I'd be able to reproduce this easily. > > It it's okay with you I'll mark this as resolved/infogiven, as I don't think > it's likely to recur. > > - Tim That's fine, Tim. I understand there's only so much you can do with the info I've provided. Thanks for your help!
> That's fine, Tim. I understand there's only so much you can do with the info > I've provided. Thanks for your help! Certainly. While I'd much rather have a conclusive answer, I don't want to waste your time chasing after a phantom issue like this. Please re-open if this does recur. - Tim
We called: insert into geyser_caldera_resv_table (id_resv, deleted, time_start, resv_name, assoclist, nodelist, node_inx, tres) values (30, 0, 1529964274, 'xxxbugxxx', '', 'anemone', 0, 0);
The suggested fix did not work: DBD Agent queue size: 87183
Updated wrong ticket.
Putting back to resolved/infogiven. Apologies for the noise.