Ticket 6839 - slurmdbd error: We have more time than is possible
Summary: slurmdbd error: We have more time than is possible
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact: Danny Auble
URL:
: 8121 9798 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-04-11 03:09 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2020-09-15 07:20 MDT (History)
11 users (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11 pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (5.49 KB, text/plain)
2019-04-12 00:41 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details
log file of sacctmgr show event (72.37 KB, text/plain)
2019-04-14 09:55 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details
slurmdbd.log since April 8 (32.18 KB, text/plain)
2019-04-14 09:55 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2019-04-11 03:09:57 MDT
We have added 192 new nodes in two new partitions "xeon40,xeon40_768" in our cluster on Monday, April 8.  The partition initially was defined with State=DRAIN, and subsequently I could add a 30-day reservation on the new nodes:

# scontrol show reservation Newnodes
ReservationName=Newnodes StartTime=2019-04-08T11:08:03 EndTime=2019-05-08T11:08:03 Duration=30-00:00:00
   Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=7680
   Users=ohni Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

After creating the reservation I changed the partitions to State=UP so that we can begin to run test jobs for a few selected users.  The cluster seems to be running perfectly after the addition of the new partitions and nodes.

However, starting from the hour following the creation of the reservation, we get every hour an error message in slurmdbd.log:

[2019-04-08T12:00:00.648] error: We have more time than is possible (55452791+19996320+0)(75449111) > 59385600 for cluster niflheim(16496) from 2019-04-08T11:00:00 - 2019-04-08T12:00:00 tres 1

I have looked at the other similar bugs in the database, so I think you may ask for this information:

# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster niflheim

# scontrol show partition xeon40
PartitionName=xeon40
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c[001-180],b[001-012]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7680 TotalNodes=192 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9000 MaxMemPerCPU=9500
   TRESBillingWeights=CPU=1.67

# scontrol show partition xeon40_768
PartitionName=xeon40_768
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=b[001-012]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=480 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=18000 MaxMemPerCPU=19000
   TRESBillingWeights=CPU=1.67

It would appear that the addition of 192 new nodes, which are all in a Reserved state, confuses slurmdbd's calculation of available CPU time.

Since we plan to release the 192 nodes for general use during next week, it would be great if we could debug the issue while I can still uphold the reservation on the nodes.

Thanks,
Ole
Comment 1 Broderick Gardner 2019-04-11 13:54:24 MDT
Thanks for the information, I'm investigating this issue for you.

I'm going to try to replicate this, so I want to clarify what you did:

Added nodes to slurm.conf
Added partition to slurm.conf with those nodes and State=DRAIN
Created 30 day reservation for those nodes
Changed partition to State=UP
The next hourly rollup produced the error, and all the following rollups

Did you do anything else? Were there any overlapping reservations or partitions, or were the reservation and partition with these new nodes completely isolated? Did you run any jobs in the reservation at any point?

I'm just looking for all the details that might help me reproduce this. Once I have, I should be able to pinpoint the problem quickly. 

Thanks
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2019-04-12 00:41:16 MDT
(In reply to Broderick Gardner from comment #1)
> I'm going to try to replicate this, so I want to clarify what you did:
> 
> Added nodes to slurm.conf
> Added partition to slurm.conf with those nodes and State=DRAIN
> Created 30 day reservation for those nodes
> Changed partition to State=UP
> The next hourly rollup produced the error, and all the following rollups

Correct. 

> Did you do anything else? Were there any overlapping reservations or
> partitions, or were the reservation and partition with these new nodes
> completely isolated? Did you run any jobs in the reservation at any point?

We have overlapping partitions due to different node memory sizes.  The 192 new nodes overlap in the two partitions as shown previously. I'll attach our slurm.conf file.  

Interestingly, when I run ~100 jobs in the reservation the error message is not printed, but it does come back on an hourly basis when these jobs have completed.

> I'm just looking for all the details that might help me reproduce this. Once
> I have, I should be able to pinpoint the problem quickly. 

I'll try to provide any additional information which you will need.
I'm wondering which code/calculation is behind the message "error: We have more time than is possible ", since I don't really know the significance of the message.
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2019-04-12 00:41:48 MDT
Created attachment 9870 [details]
slurm.conf file
Comment 4 Broderick Gardner 2019-04-12 09:47:49 MDT
I don't know how familiar you are with slurm administration, but I'll explain broadly. 

Context:
Every hour, the slurmdbd runs the hourly rollup to collect usage from that hour into the database. In addition to TRES (cpu, memory, billing, etc) usage, it calculates the allocated time, the down time, and the planned down time. All this information is available from the sreport client command. For longer periods of time, the hourly usage is regularly aggregated into daily usage, which is aggregated into monthly usage. Sorry if you knew that; I just wanted to make sure that is clear.

The error message shows this:
(55452791+19996320+0)(75449111) > 59385600
these are
(time_alloc + time_down + time_planned_down)(total_used) > total_time

So the sum of those fields should not be greater than the total time possible in the time period, hence the error message. The reason I'm having trouble is that the code behind the calculation of these fields is actually quite complicated - hard to follow. 


On that note, are these numbers the same each hour, or are they changing/growing? If they are different, could you send me a few more error messages?

Thanks
Comment 5 Broderick Gardner 2019-04-12 16:12:41 MDT
Please attach the output of this command:
sacctmgr show event start=<START> end=<END>
where START is the time just before adding the newnodes to the slurm.conf
and END is either 'now' or a time just after the first hourly rollup after creating the reservation/partitions or the new nodes. 

This output could be very large depending on the activity on the cluster. This shows node state changes, including down states.

Thanks
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2019-04-14 09:55:03 MDT
Created attachment 9906 [details]
log file of sacctmgr show event
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2019-04-14 09:55:47 MDT
Created attachment 9907 [details]
slurmdbd.log since April 8
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2019-04-14 10:28:21 MDT
(In reply to Broderick Gardner from comment #4)
> The error message shows this:
> (55452791+19996320+0)(75449111) > 59385600
> these are
> (time_alloc + time_down + time_planned_down)(total_used) > total_time
> 
> So the sum of those fields should not be greater than the total time
> possible in the time period, hence the error message. The reason I'm having
> trouble is that the code behind the calculation of these fields is actually
> quite complicated - hard to follow. 

Thanks for explaining the meaning of the numbers.

The time_down is mostly 1728000 while no jobs run, but other values appear while some jobs were running in the reservation, and several hourly error messages were absent in this case.  

The total_time=59385600 seconds is exactly 1 hour on 16.496 cores, which is the new size of the cluster after adding 192 new nodes.

The time_down=1728000 seconds corresponds to 1 hour on 480 cores.  The 480 cores corresponds to the 12 40-cores nodes which are drained (due to a Node Health Check error):

# sinfo -t drain -p xeon40
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
xeon40       up 2-02:00:00     12  drain b[001-012]

For example, in the slurmdbd.log at 2019-04-14T17:00:00 the time_alloc=59105208 corresponding to 16.378,5 cores.  But the xeon40 partition with 7.200 reserved cores and 480 drained cores is totally idle (squeue -p xeon40)! Hence the maximum possible number of allocated cores would be 8.816, and the upper bound on time_alloc must be 31.737.600.  The calculated time_alloc=59105208 is therefore definitely wrong!

Since time_planned_down=0 (always), the conclusion must be that time_alloc is calculated incorrectly.

Could you look into why time_alloc seems to include incorrectly time on idle/reserved nodes which have zero allocated time?

> On that note, are these numbers the same each hour, or are they
> changing/growing? If they are different, could you send me a few more error
> messages?

I've attached the slurmdbd.log from this week which includes all error messages.
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2019-04-16 06:04:13 MDT
(In reply to Broderick Gardner from comment #4)
> The error message shows this:
> (55452791+19996320+0)(75449111) > 59385600
> these are
> (time_alloc + time_down + time_planned_down)(total_used) > total_time

I was searching in the Slurm source code for the text strings in the message:

 error: We have more time than is possible ... for cluster 

but there are zero hits for strings such as "is possible" etc.

Could you point me to the source file which calculates and prints the error message?

Thanks,
Ole
Comment 10 Broderick Gardner 2019-04-16 09:24:35 MDT
Thanks for the information so far, particularly about the down time and the nodes that are currently drained. Allocated time is what is miscalculated, which to me indicates a reservation problem because you don't have runaway jobs. This is because nodes in an active reservation (that is not maintenance) are considered allocated in this calculation. However, overlapping reservations and running jobs in a reservation should be handled correctly. Are you able to show me all of your reservations?
$ scontrol show reservations


> I was searching in the Slurm source code for the text strings in the message:
> 
>  error: We have more time than is possible ... for cluster 
> 
> but there are zero hits for strings such as "is possible" etc.
> 
> Could you point me to the source file which calculates and prints the error
> message?

src/plugins/accounting_storage/mysql/as_mysql_rollup.c:549
(That string is split across lines. There are still some of these from older code, and it is obnoxious)

The main thing you could do that I can't from here until I replicate this is to attach with gdb and set a watchpoint on the time_alloc field during an hourly rollup.


> Since we plan to release the 192 nodes for general use during next week, it 
> would be great if we could debug the issue while I can still uphold the 
> reservation on the nodes.
I _expect_ this problem to go away when the reservation on the new nodes is removed. Let me know if/when you are going to release these nodes this week.

Thanks
Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2019-04-17 01:08:47 MDT
(In reply to Broderick Gardner from comment #10)
> Thanks for the information so far, particularly about the down time and the
> nodes that are currently drained. Allocated time is what is miscalculated,
> which to me indicates a reservation problem because you don't have runaway
> jobs. This is because nodes in an active reservation (that is not
> maintenance) are considered allocated in this calculation. However,
> overlapping reservations and running jobs in a reservation should be handled
> correctly. Are you able to show me all of your reservations?
> $ scontrol show reservations

We only have 1 reservation, plus a Maintenance which was added much later and does not seem to change the error message:

# scontrol show reservations
ReservationName=Newnodes StartTime=2019-04-08T11:08:03 EndTime=2019-05-08T11:08:03 Duration=30-00:00:00
   Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=7680
 Users=ohni,alexty,schiotz,jensj,alexbag,logiar,heih,haowana,manand,zwang,heenen,kasv,chrsop Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=Maintenance StartTime=2019-04-23T08:00:00 EndTime=2019-04-24T20:00:00 Duration=1-12:00:00
   Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES
   TRES=cpu=7680
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

So it seems that running jobs in a reservation are NOT calculated correctly, and perhaps such jobs are counted twice so that total_time exceeds the available resources.

> > I was searching in the Slurm source code for the text strings in the message:
> > 
> >  error: We have more time than is possible ... for cluster 
> > 
> > but there are zero hits for strings such as "is possible" etc.
> > 
> > Could you point me to the source file which calculates and prints the error
> > message?
> 
> src/plugins/accounting_storage/mysql/as_mysql_rollup.c:549
> (That string is split across lines. There are still some of these from older
> code, and it is obnoxious)
> 
> The main thing you could do that I can't from here until I replicate this is
> to attach with gdb and set a watchpoint on the time_alloc field during an
> hourly rollup.

Unfortunately I'm rather rusty on the use of debuggers on running daemons and I don't want to risk corrupting the database or interrupting slurmdbd or slurmctld.  But I can try to understand a little bit the code and the assignments to the loc_tres->time_alloc field.

> > Since we plan to release the 192 nodes for general use during next week, it 
> > would be great if we could debug the issue while I can still uphold the 
> > reservation on the nodes.
> I _expect_ this problem to go away when the reservation on the new nodes is
> removed. Let me know if/when you are going to release these nodes this week.

I agree: If the miscalculated reservation time goes to zero, 2*zero is also zero and the error will be swept under the rug.
Comment 13 Broderick Gardner 2019-04-19 10:33:21 MDT
Could you send a database dump so I can directly try to replicate these errors and examine your database for issues? Once you attach it, we can make it private.

mysqldump -p <cluster_db_name> >  6839_database_dump.sql

Thanks
Comment 15 Ole.H.Nielsen@fysik.dtu.dk 2019-04-23 02:40:13 MDT
(In reply to Broderick Gardner from comment #13)
> Could you send a database dump so I can directly try to replicate these
> errors and examine your database for issues? Once you attach it, we can make
> it private.
> 
> mysqldump -p <cluster_db_name> >  6839_database_dump.sql

The file has been uploaded now, please make it private.

Thanks,
Ole
Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2019-07-19 05:05:09 MDT
We currently have 25 nodes (out of 750) which are drained, and this once more triggers the incorrect error message:

[2019-07-19T13:00:00.089] error: We have more time than is possible (69990838+2073600+0)(72064438) > 72028800 for cluster niflheim(20008) from 2019-07-19T12:00:00 - 2019-07-19T13:00:00 tres 5

We run Slurm 18.08.8 on all nodes.

Have you had time to look into this?

Thanks,
Ole
Comment 18 Broderick Gardner 2019-07-19 14:31:35 MDT
I am still analyzing your database; I was busy with other issues, but I have gotten back to it now. Thanks for letting me know that it is still occurring. 

Have your active reservations changed since you sent me your database? What is the state now? I have narrowed the problem down to reservation records; when I remove a certain one, the extra time_alloc disappears. 

There is a new record each time a reservation changes; the record in question is the first in this list: (if you ran that, you would see more records; I deleted them until the error went away)
MariaDB [slurm_acct_db]> select * from niflheim_resv_table;
+---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+
| id_resv | deleted | assoclist                                                    | flags | nodelist              | node_inx | resv_name | time_start | time_end   | tres   | unused_wall  |
+---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+
|      19 |       0 | 58                                                           | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1554714483 | 1554998058 | 1=7680 |       296186 |
|      19 |       0 | 58,128,75                                                    | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1554998058 | 1555049603 | 1=7680 |   48716.8125 |
|      19 |       0 | 58,128,75,417,414,59                                         | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1555049603 | 1555311323 | 1=7680 |       261779 |
|      19 |       0 | 58,128,75,417,414,59,325,250,236,439,377,388                 | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1555311323 | 1555312904 | 1=7680 |         1581 |
|      19 |       0 | 58,128,75,417,414,59,325,250,236,439,377,388,412,322,124     | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1555312904 | 1555484373 | 1=7680 | 119837.28125 |
|      19 |       0 | 58,128,75,417,414,59,325,250,236,439,377,388,412,322,124,416 | 32768 | b[001-012],c[001-180] | 140-331  | Newnodes  | 1555484373 | 1557306483 | 1=7680 |      1590453 |
+---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+
6 rows in set (0.001 sec)

Look up that user with:
select * from niflheim_assoc_table where id_assoc=58;

I am guessing that that is your user by the name. You then added more users to the reservation. I don't know yet what is wrong with it, but I've narrowed it down enough that I can figure it out. That's my update; I should have more information within a few days.

Thanks
Comment 19 Ole.H.Nielsen@fysik.dtu.dk 2019-07-19 14:31:54 MDT
I'm out of the office until August 12th.
Jeg er ikke pƄ kontoret, tilbage igen 12. marts.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 20 Ole.H.Nielsen@fysik.dtu.dk 2019-08-15 01:09:46 MDT
(In reply to Broderick Gardner from comment #18)
> Have your active reservations changed since you sent me your database? 

We have no reservations as of 2019-07-19.

> What is the state now?

As of august 15 we are still seeing the issue, the reason being that 5-10 nodes have been drained (due to some scheduled hardware service):

[2019-08-14T21:00:00.741] error: We have more time than is possible (71156595+964800+0)(72121395) > 72028800 for cluster niflheim(20008) from 2019-08-14T20:00:00 - 2019-08-14T21:00:00 tres 5
[2019-08-15T00:00:00.861] error: We have more time than is possible (71069307+964800+0)(72034107) > 72028800 for cluster niflheim(20008) from 2019-08-14T23:00:00 - 2019-08-15T00:00:00 tres 5

> I have narrowed the problem down to reservation records;
> when I remove a certain one, the extra time_alloc disappears. 

We have had no reservations in the past weeks!

> There is a new record each time a reservation changes; the record in
> question is the first in this list: (if you ran that, you would see more
> records; I deleted them until the error went away)
...
> Look up that user with:
> select * from niflheim_assoc_table where id_assoc=58;
> 
> I am guessing that that is your user by the name. You then added more users
> to the reservation. I don't know yet what is wrong with it, but I've
> narrowed it down enough that I can figure it out. That's my update; I should
> have more information within a few days.

I have no idea what id_assoc=58 refers to, it's probably be a Slurm database index?  Anyway, users are irrelevant in this context, although user jobs would be involved in calculating the total available time.

Thanks for your efforts,
Ole
Comment 21 Broderick Gardner 2019-08-15 01:39:10 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #20)
> > I have narrowed the problem down to reservation records;
> > when I remove a certain one, the extra time_alloc disappears. 
> 
> We have had no reservations in the past weeks!
I see. What is the output of
mysql> select * from niflheim_resv_table;
?

> I have no idea what id_assoc=58 refers to, it's probably be a Slurm database
> index?  Anyway, users are irrelevant in this context, although user jobs
> would be involved in calculating the total available time.

It is a slurm database index, which is why I included how to look up which user it refers to; I did not want to publicly post a username, even though it is almost certainly yours. It is only relevant because the difference between the reservation record that appears to cause the problem and the others is the associations assigned to it (they are all the same reservation). That is the only reason I brought it up. 


Because of what I found in my testing, I have focused my attention on the part of the rollup code that handles reservations. This new information is concerning but hopefully enlightening.
Comment 22 Ole.H.Nielsen@fysik.dtu.dk 2019-08-15 02:09:27 MDT
(In reply to Broderick Gardner from comment #21)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #20)
> > > I have narrowed the problem down to reservation records;
> > > when I remove a certain one, the extra time_alloc disappears. 
> > 
> > We have had no reservations in the past weeks!
> I see. What is the output of
> mysql> select * from niflheim_resv_table;
> ?

MariaDB [(none)]> use slurm_acct_db
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select * from niflheim_resv_table;
+---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+
| id_resv | deleted | assoclist | flags | nodelist   | node_inx | resv_name   | time_start | time_end   | tres   | unused_wall |
+---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+
|      24 |       0 | 2         | 32833 | x[001-192] | 558-749  | Maintenance | 1562569200 | 1562578794 | 1=4608 |        9594 |
|      24 |       0 | 2         | 32833 | x[001-170] | 558-727  | Maintenance | 1562578794 | 1562579267 | 1=4080 |         473 |
|      24 |       1 | 2         | 32833 | x[001-130] | 558-687  | Maintenance | 1562579267 | 1562580486 | 1=3120 |        1219 |
+---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+


> > I have no idea what id_assoc=58 refers to, it's probably be a Slurm database
> > index?  Anyway, users are irrelevant in this context, although user jobs
> > would be involved in calculating the total available time.
> 
> It is a slurm database index, which is why I included how to look up which
> user it refers to; I did not want to publicly post a username, even though
> it is almost certainly yours. It is only relevant because the difference
> between the reservation record that appears to cause the problem and the
> others is the associations assigned to it (they are all the same
> reservation). That is the only reason I brought it up. 

I see.  I did:
MariaDB [slurm_acct_db]> select * from niflheim_assoc_table where id_assoc=58;
and this indeed displays my username.  I rarely run jobs under this username, so I don't understand why my username pops up here.

> Because of what I found in my testing, I have focused my attention on the
> part of the rollup code that handles reservations. This new information is
> concerning but hopefully enlightening.

Thanks for your support!

/Ole
Comment 25 Ole.H.Nielsen@fysik.dtu.dk 2019-08-30 05:09:39 MDT
FYI: We have made a new maintenance reservation for the entire cluster:
# scontrol show reservations
ReservationName=Electrical StartTime=2019-09-04T08:00:00 EndTime=2019-09-04T15:00:00 Duration=07:00:00
   Nodes=a[001-140],b[001-012],c[001-180],d[001-019,021-068],g[001-021,024-110],h[001-002],i[003-050],x[001-192] NodeCnt=749 CoreCnt=16472 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=16472
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

and now the usual warning pops up in in the slurmdbd log again:

[2019-08-30T03:00:00.153] error: We have more allocated time than is possible (72097682 > 72014400) for cluster niflheim(20004) from 2019-08-30T02:00:00 - 2019-08-30T03:00:00 tres 5
[2019-08-30T04:00:00.199] error: We have more allocated time than is possible (72134160 > 72014400) for cluster niflheim(20004) from 2019-08-30T03:00:00 - 2019-08-30T04:00:00 tres 5
[2019-08-30T08:00:00.360] error: We have more allocated time than is possible (72099652 > 72014400) for cluster niflheim(20004) from 2019-08-30T07:00:00 - 2019-08-30T08:00:00 tres 5
[2019-08-30T09:00:00.400] error: We have more allocated time than is possible (72055810 > 72014400) for cluster niflheim(20004) from 2019-08-30T08:00:00 - 2019-08-30T09:00:00 tres 5
[2019-08-30T10:00:00.440] error: We have more allocated time than is possible (72172120 > 72014400) for cluster niflheim(20004) from 2019-08-30T09:00:00 - 2019-08-30T10:00:00 tres 5
[2019-08-30T11:00:00.481] error: We have more allocated time than is possible (72101704 > 72014400) for cluster niflheim(20004) from 2019-08-30T10:00:00 - 2019-08-30T11:00:00 tres 5
Comment 26 Douglas Wightman 2019-08-30 13:22:53 MDT
We've been able to reproduce the first error message: "We have more time than is possible" and determine what the cause is. In this case the entire reservation (because it does not have the IGNORE_JOBS flag) is counted as Allocated and the down nodes are also counted towards the Down Time. Because the reservation was over the entire partition once these are both added up they overfow the total time for the partition. 

This second message, "We have more allocated time than is possible", is slightly different but is almost definitely related to reservation accounting like the first.

We are researching the best way to enhance reservation accounting to avoid these errors.
Comment 28 Gavin D. Howard 2019-12-02 13:58:14 MST
*** Ticket 8121 has been marked as a duplicate of this ticket. ***
Comment 37 Douglas Wightman 2020-02-21 10:24:53 MST
Because of the tricky nature of the patch the decision was made to put this into 20.11 early to allow as much time for testing as possible. The patch has now been committed: 

commit 71e7577efe6289a17f111c7cce2af03220343a82
Author: Douglas Wightman <wightman@schedmd.com>

    Correctly handle down nodes in a reservation.
Comment 38 surendra 2020-02-21 10:25:04 MST
I am currently out of office and will return back on Feb 27.
Comment 39 Albert Gil 2020-09-15 01:44:23 MDT
*** Ticket 9798 has been marked as a duplicate of this ticket. ***