10753 – More time than possible

Ticket 10753 - More time than possible

Summary: More time than possible

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	20.11.3
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-02-01 07:10 MST by Paul Edmon
Modified:	2021-02-08 09:05 MST (History)
CC List:	0 users

See Also:
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2021-02-01 07:10:26 MST

I'm seeing the following error in my slurmdbd logs after I upgraded to 20.11.3:

Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (887595120000+860675998487+0)(1748271118487) > 1290775849200 for cluster odyssey(358548847) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 2
Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (291733200+333814968+0)(625548168) > 501386400 for cluster odyssey(139274) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 5
Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (1713600+1344534+0)(3058134) > 2008800 for cluster odyssey(558) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 1003

I'm guessing a few jobs or something got screwed up.  Is there anyway to resolve this?  It's too late to roll back and slurmdbd is working properly aside from this error.  Thanks.

-Paul Edmon-

Comment 1 Albert Gil 2021-02-04 04:09:57 MST

Hi Paul,

This error means that when slurmdbd computes the usage information shown by sreport from the jobs and reservations that run on the cluster last hour (aka rollup process), it detected that the sum of all the CPU time allocated was bigger than the actual amount of CPUs in the cluster. So, something is wrong.

There are some reasons that could explain those error messages.
Some of them even legitimate if OverSubscribe is setup in some specific ways.

But the most typical reason is due runaway jobs.
That is, jobs that are not anymore running on the system, but for some reason slurmdbd was not notified and thinks that they are running.
Runaway jobs would mess with the accounting information and, if your system is actually quite busy, adding nonexisting/runaway jobs could lead to "more time than possible" detected.

You can run "sacctmgr show runaway" to see if you are facing this problem, and it could also fix them.
Note that to fix runaways means also that slurmdbd will start a new rollup to recompute the usage/sreport info from the oldest runaway detected. If the runaway jobs are very old, the rollup could be take time to complete, and some sreport info won't be available until completed.

If you don't feel confident enough to say Yes when asked to fix the runaways, please say No and attach the output of the command to help you further.

If you don't have runaways, we'll look for other reasons.

Regards,
Albert

Comment 2 Paul Edmon 2021-02-08 07:33:33 MST

Thanks.  I suspected as much.  I'm usually pretty regular about running 
the runaway check because we do pick up errant jobs from time to time.  
I ran the check today and it found a few more from Feb 1st when we did 
the upgrade.  I will keep an eye on it and see if after the roll up this 
error goes away.

-Paul Edmon-

On 2/4/2021 6:09 AM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c1> on 
> bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Hi Paul,
>
> This error means that when slurmdbd computes the usage information shown by
> sreport from the jobs and reservations that run on the cluster last hour (aka
> rollup process), it detected that the sum of all the CPU time allocated was
> bigger than the actual amount of CPUs in the cluster. So, something is wrong.
>
> There are some reasons that could explain those error messages.
> Some of them even legitimate if OverSubscribe is setup in some specific ways.
>
> But the most typical reason is due runaway jobs.
> That is, jobs that are not anymore running on the system, but for some reason
> slurmdbd was not notified and thinks that they are running.
> Runaway jobs would mess with the accounting information and, if your system is
> actually quite busy, adding nonexisting/runaway jobs could lead to "more time
> than possible" detected.
>
> You can run "sacctmgr show runaway" to see if you are facing this problem, and
> it could also fix them.
> Note that to fix runaways means also that slurmdbd will start a new rollup to
> recompute the usage/sreport info from the oldest runaway detected. If the
> runaway jobs are very old, the rollup could be take time to complete, and some
> sreport info won't be available until completed.
>
> If you don't feel confident enough to say Yes when asked to fix the runaways,
> please say No and attach the output of the command to help you further.
>
> If you don't have runaways, we'll look for other reasons.
>
> Regards,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 Albert Gil 2021-02-08 08:32:42 MST

Hi Paul,

> Thanks.  I suspected as much.  I'm usually pretty regular about running 
> the runaway check because we do pick up errant jobs from time to time.  

Ok.
It should be pretty exceptional due the system being on some bad conditions at some point.

> I ran the check today and it found a few more from Feb 1st when we did 
> the upgrade.

Thanks for the information.
I'll try to reproduce that specific case to see if we can fix/avoid those runaways on the first place.

>  I will keep an eye on it and see if after the roll up this 
> error goes away.

The rollup should not take much.
Unless you really have very high throughput, the rollup shouldn't take much time to complete.
A simple way to check it is quering sreport.
Now it's probably empty from Feb 1st, and in some hours the info will be there again.

Regrds,
Albert

Comment 4 Paul Edmon 2021-02-08 09:02:54 MST

Looks like that did it. I'm not seeing any other errors.

-Paul Edmon-

On 2/8/2021 10:32 AM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c3> on 
> bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Hi Paul,
>
> > Thanks.  I suspected as much.  I'm usually pretty regular about running > the runaway check because we do pick up errant jobs from time to time.
>
> Ok.
> It should be pretty exceptional due the system being on some bad conditions at
> some point.
>
> > I ran the check today and it found a few more from Feb 1st when we did > the upgrade.
>
> Thanks for the information.
> I'll try to reproduce that specific case to see if we can fix/avoid those
> runaways on the first place.
>
> >  I will keep an eye on it and see if after the roll up this > error goes away.
>
> The rollup should not take much.
> Unless you really have very high throughput, the rollup shouldn't take much
> time to complete.
> A simple way to check it is quering sreport.
> Now it's probably empty from Feb 1st, and in some hours the info will be there
> again.
>
> Regrds,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Albert Gil 2021-02-08 09:05:50 MST

Great!
Closing as infogiven.