Bug 3819 - There is no reservation by id XXX, job info not making it to database
Summary: There is no reservation by id XXX, job info not making it to database
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other bugs)
Version: 16.05.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-05-17 14:02 MDT by Will French
Modified: 2020-01-13 10:11 MST (History)
2 users (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: -
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Will French 2017-05-17 14:02:05 MDT
I adjusted the EndTime on a reservation on a single compute node yesterday afternoon, and we've just noticed that no jobs have been going to the SLURM database since then, i.e. we have an accumulating list of runaway jobs.

In the SLURM database logs we see messages like the following:

[2017-05-17T14:54:01.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre'
[2017-05-17T14:54:01.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre'
[2017-05-17T14:54:11.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre'

Here is info about the reservations:

$ scontrol show res
ReservationName=g_bordenstein_lab-RT37379 StartTime=2017-05-01T00:00:00 EndTime=2017-07-01T00:00:00 Duration=61-00:00:00
   Nodes=vmp[202-203] NodeCnt=2 CoreCnt=24 Features=(null) PartitionName=production Flags=
   TRES=cpu=48
   Users=(null) Accounts=g_bordenstein_lab Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=samtest StartTime=2017-05-17T14:43:24 EndTime=2017-08-25T14:43:24 Duration=100-00:00:00
   Nodes=vmp1242 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=production Flags=SPEC_NODES
   TRES=cpu=12
   Users=(null) Accounts=cms_samtest Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

It's the second one that we modified yesterday and seems to be creating the problem. I initially tried deleting the reservation and rapidly recreating it. (no change) I then deleted the reservation, and waited...no change. I also restarted slurmdbd and still we see the same complaint about the reservation with id 377. 

Help?
Comment 1 Will French 2017-05-17 14:47:11 MDT
> [2017-05-17T14:54:01.003] error: There is no reservation by id 377,
> time_start 1479843561, and cluster 'accre'
> [2017-05-17T14:54:01.003] error: There is no reservation by id 377,
> time_start 1479843561, and cluster 'accre'
> [2017-05-17T14:54:11.003] error: There is no reservation by id 377,
> time_start 1479843561, and cluster 'accre'

> ReservationName=samtest StartTime=2017-05-17T14:43:24
> EndTime=2017-08-25T14:43:24 Duration=100-00:00:00
>    Nodes=vmp1242 NodeCnt=1 CoreCnt=12 Features=(null)
> PartitionName=production Flags=SPEC_NODES
>    TRES=cpu=12
>    Users=(null) Accounts=cms_samtest Licenses=(null) State=ACTIVE
> BurstBuffer=(null) Watts=n/a


Just to clarify...the reservation as shown above was a modified version from this afternoon. The original start time of the samtest reservation matched the epoch start time shown in the error message from the logs.
Comment 2 Tim Wickberg 2017-05-17 15:00:45 MDT
The issue is that the (apparently invalid) update message from the slurmctld to the slurmdbd will be continually replayed until successful. Until that happens, all job history is being queued by slurmctld - this can lead to increased memory consumption on there until this gets resolved.

If not fixed soon, this can eventually lead to loss of some accounting info. I'm bumping the priority up a notch, and will get you some additional info shortly.

- Tim
Comment 3 Tim Wickberg 2017-05-17 15:10:44 MDT
... this has bit you before. I think we've fixed a few more potential paths into this issue on 17.02, but I don't want you to upgrade the slurmdbd in haste.

Comment 11 on bug 2741 walks you through creating the missing entry in the MySQL database directly - if you fix the values to match the ones at fault here that should get you past this issue and get accounting values moving forward again which I think is the most important bit.

Troubleshooting how this happened, and/or verifying some fixes on 17.02 would prevent this longer term, can wait until your accounting data is safe.

https://bugs.schedmd.com/show_bug.cgi?id=2741#c11

If you can create that entry things should start updating immediately - please let me know if that's successful, or if this shifts to a new error.

- Tim
Comment 4 Will French 2017-05-17 15:41:46 MDT
(In reply to Tim Wickberg from comment #3)
> ... this has bit you before. I think we've fixed a few more potential paths
> into this issue on 17.02, but I don't want you to upgrade the slurmdbd in
> haste.
> 
> Comment 11 on bug 2741 walks you through creating the missing entry in the
> MySQL database directly - if you fix the values to match the ones at fault
> here that should get you past this issue and get accounting values moving
> forward again which I think is the most important bit.
> 
> Troubleshooting how this happened, and/or verifying some fixes on 17.02
> would prevent this longer term, can wait until your accounting data is safe.
> 
> https://bugs.schedmd.com/show_bug.cgi?id=2741#c11
> 
> If you can create that entry things should start updating immediately -
> please let me know if that's successful, or if this shifts to a new error.
> 
> - Tim

Hey Tim,

I had some vague recollection of this happening before, thanks for the quick fix. After inserting the dummy reservation in the database we immediately saw messages like these in the slurmdbd logs:

[2017-05-17T16:28:49.003] error: There is no reservation by id 377, time_start 1479843561, and cluster 'accre'
[2017-05-17T16:29:01.633] debug:  Need to reroll usage from Tue May 16 15:24:59 2017 Job 14648594 from accre started then and we are just now hearing about it.
[2017-05-17T16:29:48.893] debug:  as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue
[2017-05-17T16:29:48.894] debug:  as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue
[2017-05-17T16:29:48.894] debug:  as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue
[2017-05-17T16:29:49.352] debug:  as_mysql_node_down: no change needed 546 == 546 and fixing reservation issue == fixing reservation issue

which I think is a good sign, and running "sacctmgr show runawayjobs" confirms that all the runaways since yesterday have made it to the database.
Comment 5 Tim Wickberg 2017-05-17 15:58:16 MDT
Glad that worked. Moving back down to Sev3 - I'll try to sort out what could have caused this, but I do recall we'd fixed a few related issues in 17.02 that may prevent this in the future.

I don't know if you have a plan to upgrade at some point soin, but the 17.02 release has proven to be quite stable thus far.
Comment 7 Will French 2017-05-17 16:16:20 MDT
(In reply to Tim Wickberg from comment #5)
> Glad that worked. Moving back down to Sev3 - I'll try to sort out what could
> have caused this, but I do recall we'd fixed a few related issues in 17.02
> that may prevent this in the future.
> 
> I don't know if you have a plan to upgrade at some point soin, but the 17.02
> release has proven to be quite stable thus far.

Our current plan is to keep our CentOS 6 hardware on 16.05.8 and stand up a separate instance of SLURM 17.02 for hardware that is upgraded to CentOS 7. We may reconsider but for now that's the plan.
Comment 8 Tim Wickberg 2017-05-17 16:32:52 MDT
Do you happen to have the slurmctld and slurmdbd logs from the time that reservation was created? There might be a hint as to what happened there that would narrow down the cause.
Comment 9 Will French 2017-05-17 16:47:24 MDT
(In reply to Tim Wickberg from comment #8)
> Do you happen to have the slurmctld and slurmdbd logs from the time that
> reservation was created? There might be a hint as to what happened there
> that would narrow down the cause.

Originally created? Unfortunately we do not since it was over six months ago.
Comment 10 Tim Wickberg 2017-05-17 16:56:26 MDT
Do you know roughly what version of the slurmctld you would have been running at that point?

I'm guessing this may be related to a fix made in 15.08.9 by commit 5c43d754fb016a. If the reservation was made by a slurmctld older than that it would match up to that fix, and explain how it got into this state.
Comment 11 Will French 2017-05-18 15:02:57 MDT
(In reply to Tim Wickberg from comment #10)
> Do you know roughly what version of the slurmctld you would have been
> running at that point?
> 
> I'm guessing this may be related to a fix made in 15.08.9 by commit
> 5c43d754fb016a. If the reservation was made by a slurmctld older than that
> it would match up to that fix, and explain how it got into this state.

We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running 15.08.11 for a few months, so I'm fairly certain that the reservation was created after the fix in 15.08.9.
Comment 12 Tim Wickberg 2017-05-22 15:36:55 MDT
(In reply to Will French from comment #11)
> (In reply to Tim Wickberg from comment #10)
> > Do you know roughly what version of the slurmctld you would have been
> > running at that point?
> > 
> > I'm guessing this may be related to a fix made in 15.08.9 by commit
> > 5c43d754fb016a. If the reservation was made by a slurmctld older than that
> > it would match up to that fix, and explain how it got into this state.
> 
> We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running
> 15.08.11 for a few months, so I'm fairly certain that the reservation was
> created after the fix in 15.08.9.

Hmm. Without the logs from that time period I don't have much to go on here, and doubt I'd be able to reproduce this easily.

It it's okay with you I'll mark this as resolved/infogiven, as I don't think it's likely to recur.

- Tim
Comment 13 Will French 2017-05-22 15:41:48 MDT
(In reply to Tim Wickberg from comment #12)
> (In reply to Will French from comment #11)
> > (In reply to Tim Wickberg from comment #10)
> > > Do you know roughly what version of the slurmctld you would have been
> > > running at that point?
> > > 
> > > I'm guessing this may be related to a fix made in 15.08.9 by commit
> > > 5c43d754fb016a. If the reservation was made by a slurmctld older than that
> > > it would match up to that fix, and explain how it got into this state.
> > 
> > We upgraded to 16.05.4 on Sept 1, 2016. Prior to that we were running
> > 15.08.11 for a few months, so I'm fairly certain that the reservation was
> > created after the fix in 15.08.9.
> 
> Hmm. Without the logs from that time period I don't have much to go on here,
> and doubt I'd be able to reproduce this easily.
> 
> It it's okay with you I'll mark this as resolved/infogiven, as I don't think
> it's likely to recur.
> 
> - Tim

That's fine, Tim. I understand there's only so much you can do with the info I've provided. Thanks for your help!
Comment 14 Tim Wickberg 2017-05-22 15:44:17 MDT
> That's fine, Tim. I understand there's only so much you can do with the info
> I've provided. Thanks for your help!

Certainly. While I'd much rather have a conclusive answer, I don't want to waste your time chasing after a phantom issue like this. Please re-open if this does recur.

- Tim
Comment 15 NCAR SSG 2018-06-25 16:22:49 MDT
We called: 
insert into geyser_caldera_resv_table (id_resv, deleted, time_start, resv_name, assoclist, nodelist, node_inx, tres) values (30, 0, 1529964274, 'xxxbugxxx', '', 'anemone', 0, 0);
Comment 16 NCAR SSG 2018-06-25 16:25:40 MDT
The suggested fix did not work:

DBD Agent queue size: 87183
Comment 17 NCAR SSG 2018-06-25 16:43:01 MDT
Updated wrong ticket.
Comment 18 Tim Wickberg 2018-06-27 11:34:29 MDT
Putting back to resolved/infogiven. Apologies for the noise.