109 – Possible enhancements to prevent accounting record loss when slurmdbd is not running.

Ticket 109 - Possible enhancements to prevent accounting record loss when slurmdbd is not running.

Summary: Possible enhancements to prevent accounting record loss when slurmdbd is not ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	2.5.x
Hardware:	Linux Linux

Importance:	--- 5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-08-10 06:33 MDT by Don Albert
Modified:	2019-11-05 10:51 MST (History)
CC List:	3 users (show)

See Also:	5534
Site:	CEA
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.0-pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Albert 2012-08-10 06:33:50 MDT

When the slurmdbd daemon is unavailable, the slurmctld will cache job
and step accounting records. The number of records that can be cached
is at least MAX_AGENT_QUEUE (currently defined as 10000) or a number
based on the slurm.conf parameter MaxJobCount.  The code to set this
is in module "slurmdbd_defs.c":

    /* Whatever our max job count is times that by 2 or
     * MAX_AGENT_QUEUE which ever is bigger */
    if (!max_agent_queue)
	    max_agent_queue =
		    MAX(MAX_AGENT_QUEUE, slurmctld_conf.max_job_cnt * 2);

The Tera100 cluster at CEA has seen the problem that the number of
records that can be cached is too small to keep more than a few
minutes worth of job records when slurmdbd is not running.  In their
experience, their workload is such that only about 10 minutes worth of
records can be preserved before slurmctld starts to discard them.

They are suggesting a couple of changes:

  1. a tuning parameter, other than MaxJobCount, that could increase
     the caching of job and step records to avoid loss when slurmdbd
     is not available, 

and/or,

  2. instead of discarding records when the cache is full, providing a
     way to write the records to a flat file and retrieve them when
     slurmdbd comes back up, so they can be sent to slurmdbd.

What do you think about these possible enhancements?

  -Don Albert-

Comment 1 Danny Auble 2012-08-10 07:41:28 MDT

(In reply to comment #0)
> When the slurmdbd daemon is unavailable, the slurmctld will cache job
> and step accounting records. The number of records that can be cached
> is at least MAX_AGENT_QUEUE (currently defined as 10000) or a number
> based on the slurm.conf parameter MaxJobCount.  The code to set this
> is in module "slurmdbd_defs.c":
> 
>     /* Whatever our max job count is times that by 2 or
>      * MAX_AGENT_QUEUE which ever is bigger */
>     if (!max_agent_queue)
> 	    max_agent_queue =
> 		    MAX(MAX_AGENT_QUEUE, slurmctld_conf.max_job_cnt * 2);
> 
> The Tera100 cluster at CEA has seen the problem that the number of
> records that can be cached is too small to keep more than a few
> minutes worth of job records when slurmdbd is not running.  In their
> experience, their workload is such that only about 10 minutes worth of
> records can be preserved before slurmctld starts to discard them.
> 
> They are suggesting a couple of changes:
> 
>   1. a tuning parameter, other than MaxJobCount, that could increase
>      the caching of job and step records to avoid loss when slurmdbd
>      is not available, 
> 

The main reason this exists is to prevent running out of memory.  (Same with MaxJobCount).  But it isn't a horrible idea.  We felt at the time using MaxJobCount as the factor would be sufficient, but if something else was needed to make it bigger then that would probably be fine.

> and/or,
> 
>   2. instead of discarding records when the cache is full, providing a
>      way to write the records to a flat file and retrieve them when
>      slurmdbd comes back up, so they can be sent to slurmdbd.

This may be more cumbersome just because at this point the file is getting very big and will be continually changing.

Since the old data is removed currently. One idea would be be if this scenario happened to flush the list after a write to the file tacking it on to whatever was already there instead of replacing the existing file like what happens when the slurmctld ends.  Then when the slurmdbd come back up open the file and process that before the existing list.  It seems like a lot of overhead for something like this.  But the safest.

I was unaware tera100 ran so many short lived jobs.

So with this said I don't know if I have stong feelings one way or the other.  The latter is the safest but much more complicated to write and assure works correctly.  But the former is easy, but only raises the ceiling.  What do you think?

> 
> What do you think about these possible enhancements?
> 
>   -Don Albert-

Comment 2 Danny Auble 2012-09-17 08:02:08 MDT

Don, ping.

Comment 3 Don Albert 2012-09-18 05:13:58 MDT

Danny,

Yiannis asked me to enter this enhancement request in response to a
bug report from CEA. He has been on holidays for the past couple of
weeks, and we have not had a chance to discuss this in our meetings.

I gather from your initial response that you (SchedMD) would be
amenable to some changes/enhancements in this area, but have no plans
to undertake them yourselves.

I will raise this at our (Phoenix) next meeting and see how we want to
proceed.

  -Don-

Comment 4 Puenlap Lee 2013-02-05 03:09:35 MST

Danny,
   Have you designed what to do about this problem yet?  Are you waiting for more information from us or ...?
   Don has reitred since last Nov., and now I was assigned to keep track of this problem.
    Puenlap.

Comment 5 Matthieu Hautreux 2013-02-05 03:40:10 MST

(In reply to comment #1)
> (In reply to comment #0)
> > When the slurmdbd daemon is unavailable, the slurmctld will cache job
> > and step accounting records. The number of records that can be cached
> > is at least MAX_AGENT_QUEUE (currently defined as 10000) or a number
> > based on the slurm.conf parameter MaxJobCount.  The code to set this
> > is in module "slurmdbd_defs.c":
> > 
> >     /* Whatever our max job count is times that by 2 or
> >      * MAX_AGENT_QUEUE which ever is bigger */
> >     if (!max_agent_queue)
> > 	    max_agent_queue =
> > 		    MAX(MAX_AGENT_QUEUE, slurmctld_conf.max_job_cnt * 2);
> > 
> > The Tera100 cluster at CEA has seen the problem that the number of
> > records that can be cached is too small to keep more than a few
> > minutes worth of job records when slurmdbd is not running.  In their
> > experience, their workload is such that only about 10 minutes worth of
> > records can be preserved before slurmctld starts to discard them.
> > 
> > They are suggesting a couple of changes:
> > 
> >   1. a tuning parameter, other than MaxJobCount, that could increase
> >      the caching of job and step records to avoid loss when slurmdbd
> >      is not available, 
> > 
> 
> The main reason this exists is to prevent running out of memory.  (Same with
> MaxJobCount).  But it isn't a horrible idea.  We felt at the time using
> MaxJobCount as the factor would be sufficient, but if something else was
> needed to make it bigger then that would probably be fine.
> 
> > and/or,
> > 
> >   2. instead of discarding records when the cache is full, providing a
> >      way to write the records to a flat file and retrieve them when
> >      slurmdbd comes back up, so they can be sent to slurmdbd.
> 
> This may be more cumbersome just because at this point the file is getting
> very big and will be continually changing.
> 
> Since the old data is removed currently. One idea would be be if this
> scenario happened to flush the list after a write to the file tacking it on
> to whatever was already there instead of replacing the existing file like
> what happens when the slurmctld ends.  Then when the slurmdbd come back up
> open the file and process that before the existing list.  It seems like a
> lot of overhead for something like this.  But the safest.
> 
> I was unaware tera100 ran so many short lived jobs.
> 
> So with this said I don't know if I have stong feelings one way or the
> other.  The latter is the safest but much more complicated to write and
> assure works correctly.  But the former is easy, but only raises the
> ceiling.  What do you think?
> 

It took more than 10 minutes to fill the agent queue, something more about 1 to 2 hours. The problem is that the agent queue is not only used to store the job/step informations but also to store events from the whole cluster.

In the case we had, we experimented the issue after a maintenance period after the non regression jobs execution. A large number of the events were only due to nodes added to or removed from production (state transition down->idle or reverse) while slurmdbd was not available.

After a while, _purge_job_start_req() was called  and about 1000 jobs were remove from the agent queue. 2 minutes later, the agent queue was again on high pressure and a new call to _purge_job_start_req() was not able to remove jobs as no more jobs were added in the meantime. 2 minutes later, agent queue is full and starts discarding request. (during these 4 minutes, about 500 additionnal nodes are put back in production)

IMHO, I would rather have node events removed from the DB instead of job events in case of high pressure on the agent queue. Don't you think ?

Concerning the 2 possibilities, we could certainly use MaxJobCount as a workaround but at the end, the _purge_job_start_req() will remove all the entries and that is a problem.

Having a file to store the entries would be better but can be quite difficult to manage. 

In both case, I would try to keep the job/step entries information the most I can and purge the remaining events when there is no choice.

HTH
matthieu

> > 
> > What do you think about these possible enhancements?
> > 
> >   -Don Albert-

Comment 6 Danny Auble 2013-02-05 04:48:39 MST

(In reply to comment #5)
> 
> It took more than 10 minutes to fill the agent queue, something more about 1
> to 2 hours. The problem is that the agent queue is not only used to store
> the job/step informations but also to store events from the whole cluster.
> 
> In the case we had, we experimented the issue after a maintenance period
> after the non regression jobs execution. A large number of the events were
> only due to nodes added to or removed from production (state transition
> down->idle or reverse) while slurmdbd was not available.
> 
> After a while, _purge_job_start_req() was called  and about 1000 jobs were
> remove from the agent queue. 2 minutes later, the agent queue was again on
> high pressure and a new call to _purge_job_start_req() was not able to
> remove jobs as no more jobs were added in the meantime. 2 minutes later,
> agent queue is full and starts discarding request. (during these 4 minutes,
> about 500 additionnal nodes are put back in production)

Thanks for the explanation, this makes much more sense on the scenario.

> 
> IMHO, I would rather have node events removed from the DB instead of job
> events in case of high pressure on the agent queue. Don't you think ?

Yeah, I would rather go the other way :).  If throwing away events happen you could easily get out of sync.  

But this could be a good idea on a new parameter to set so to let the user decide what to keep or throw away in a defined order.

In 2.6 there are new enforce options like nosteps,nojobs which will not record any of that information on the cluster.  If you think it is helpful I could also add a noevents option which would do the same thing for events.  That would make it so no logging of events ever happen though.

> 
> Concerning the 2 possibilities, we could certainly use MaxJobCount as a
> workaround but at the end, the _purge_job_start_req() will remove all the
> entries and that is a problem.

Well it once the queue is full it whould start throwing things away.  Currently it only throws away sort of redundant information like job start and step records I believe.

> 
> Having a file to store the entries would be better but can be quite
> difficult to manage.

The file seems like it would be extremely difficult to handle.
 
> 
> In both case, I would try to keep the job/step entries information the most
> I can and purge the remaining events when there is no choice.
> 
> HTH
> matthieu
> 
> > > 
> > > What do you think about these possible enhancements?
> > > 
> > >   -Don Albert-

Comment 7 Matthieu Hautreux 2013-02-19 20:20:30 MST

> > 
> > IMHO, I would rather have node events removed from the DB instead of job
> > events in case of high pressure on the agent queue. Don't you think ?
> 
> Yeah, I would rather go the other way :).  If throwing away events happen
> you could easily get out of sync.  
> 
> But this could be a good idea on a new parameter to set so to let the user
> decide what to keep or throw away in a defined order.
> 
> In 2.6 there are new enforce options like nosteps,nojobs which will not
> record any of that information on the cluster.  If you think it is helpful I
> could also add a noevents option which would do the same thing for events. 
> That would make it so no logging of events ever happen though.

Having an option to identify what should be removed first could be sufficient to keep most of the job information in the db. We are more in favor of keeping that information because it is more difficult to reconstruct job info after loss than other informations. I agree that some events have to be kept also.

> 
> > 
> > Concerning the 2 possibilities, we could certainly use MaxJobCount as a
> > workaround but at the end, the _purge_job_start_req() will remove all the
> > entries and that is a problem.
> 
> Well it once the queue is full it whould start throwing things away. 
> Currently it only throws away sort of redundant information like job start
> and step records I believe.
> 
> > 
> > Having a file to store the entries would be better but can be quite
> > difficult to manage.
> 
> The file seems like it would be extremely difficult to handle.

What about a kind of configuratble write-only dump file where purged records could be written instead of just being lost ? This file would only be usable by administrator if necessary for later __manual__ insertion in the DB ?

>  
> > 
> > In both case, I would try to keep the job/step entries information the most
> > I can and purge the remaining events when there is no choice.
> > 
> > HTH
> > matthieu
> > 
> > > > 
> > > > What do you think about these possible enhancements?
> > > > 
> > > >   -Don Albert-

Comment 8 Danny Auble 2013-02-21 09:10:49 MST

(In reply to comment #7)

> Having an option to identify what should be removed first could be
> sufficient to keep most of the job information in the db. We are more in
> favor of keeping that information because it is more difficult to
> reconstruct job info after loss than other informations. I agree that some
> events have to be kept also.

OK, I think this is a good way to handle most instances. 

> What about a kind of configuratble write-only dump file where purged records
> could be written instead of just being lost ? This file would only be usable
> by administrator if necessary for later __manual__ insertion in the DB ?
 
I haven't looked at what this would mean, but one problem would be things would no longer be in chronological order since purged records are taken and some are left in the list.

The file would need to be appended to as well.  But the chronological order issue seems concerning unless we did a moving window and just dumped the first n records to the file and then when the dbd comes back it reads from the file first?  In any case it seems scary.

Comment 9 Matthieu Hautreux 2013-02-26 22:39:14 MST

(In reply to comment #8)
> (In reply to comment #7)
> 
> > Having an option to identify what should be removed first could be
> > sufficient to keep most of the job information in the db. We are more in
> > favor of keeping that information because it is more difficult to
> > reconstruct job info after loss than other informations. I agree that some
> > events have to be kept also.
> 
> OK, I think this is a good way to handle most instances. 

We can say that such a solution would close this bug.

> 
> > What about a kind of configuratble write-only dump file where purged records
> > could be written instead of just being lost ? This file would only be usable
> > by administrator if necessary for later __manual__ insertion in the DB ?
>  
> I haven't looked at what this would mean, but one problem would be things
> would no longer be in chronological order since purged records are taken and
> some are left in the list.
> 
> The file would need to be appended to as well.  But the chronological order
> issue seems concerning unless we did a moving window and just dumped the
> first n records to the file and then when the dbd comes back it reads from
> the file first?  In any case it seems scary.

The idea was just to keep a dump of what was trashed in case something very necessary was removed. Thus with some patience and additionnal manual extraction from that file, an administrator could still insert valuable losses. There is no need to automate the reading of that kind of files, you could see that as the "lost+found" directory at the root of some type of FS where tools like fsck put inconsistent files that can potentially be reconstructed if needed. We could have an option in slurm.conf to enable the generation of this kind of dump when something wrong happens (with a default behavior of disabled). It was just an idea, not something really necessary nor mandatory for a resolution of this bug.

Comment 10 Danny Auble 2019-11-05 10:51:33 MST

I am calling this bug closed based off enhancements given in bug 5534.

I don't think we will plan on going any further in with this.