We have added 192 new nodes in two new partitions "xeon40,xeon40_768" in our cluster on Monday, April 8. The partition initially was defined with State=DRAIN, and subsequently I could add a 30-day reservation on the new nodes: # scontrol show reservation Newnodes ReservationName=Newnodes StartTime=2019-04-08T11:08:03 EndTime=2019-05-08T11:08:03 Duration=30-00:00:00 Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=7680 Users=ohni Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a After creating the reservation I changed the partitions to State=UP so that we can begin to run test jobs for a few selected users. The cluster seems to be running perfectly after the addition of the new partitions and nodes. However, starting from the hour following the creation of the reservation, we get every hour an error message in slurmdbd.log: [2019-04-08T12:00:00.648] error: We have more time than is possible (55452791+19996320+0)(75449111) > 59385600 for cluster niflheim(16496) from 2019-04-08T11:00:00 - 2019-04-08T12:00:00 tres 1 I have looked at the other similar bugs in the database, so I think you may ask for this information: # sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found on cluster niflheim # scontrol show partition xeon40 PartitionName=xeon40 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=c[001-180],b[001-012] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7680 TotalNodes=192 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9000 MaxMemPerCPU=9500 TRESBillingWeights=CPU=1.67 # scontrol show partition xeon40_768 PartitionName=xeon40_768 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=12:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=b[001-012] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=480 TotalNodes=12 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=18000 MaxMemPerCPU=19000 TRESBillingWeights=CPU=1.67 It would appear that the addition of 192 new nodes, which are all in a Reserved state, confuses slurmdbd's calculation of available CPU time. Since we plan to release the 192 nodes for general use during next week, it would be great if we could debug the issue while I can still uphold the reservation on the nodes. Thanks, Ole
Thanks for the information, I'm investigating this issue for you. I'm going to try to replicate this, so I want to clarify what you did: Added nodes to slurm.conf Added partition to slurm.conf with those nodes and State=DRAIN Created 30 day reservation for those nodes Changed partition to State=UP The next hourly rollup produced the error, and all the following rollups Did you do anything else? Were there any overlapping reservations or partitions, or were the reservation and partition with these new nodes completely isolated? Did you run any jobs in the reservation at any point? I'm just looking for all the details that might help me reproduce this. Once I have, I should be able to pinpoint the problem quickly. Thanks
(In reply to Broderick Gardner from comment #1) > I'm going to try to replicate this, so I want to clarify what you did: > > Added nodes to slurm.conf > Added partition to slurm.conf with those nodes and State=DRAIN > Created 30 day reservation for those nodes > Changed partition to State=UP > The next hourly rollup produced the error, and all the following rollups Correct. > Did you do anything else? Were there any overlapping reservations or > partitions, or were the reservation and partition with these new nodes > completely isolated? Did you run any jobs in the reservation at any point? We have overlapping partitions due to different node memory sizes. The 192 new nodes overlap in the two partitions as shown previously. I'll attach our slurm.conf file. Interestingly, when I run ~100 jobs in the reservation the error message is not printed, but it does come back on an hourly basis when these jobs have completed. > I'm just looking for all the details that might help me reproduce this. Once > I have, I should be able to pinpoint the problem quickly. I'll try to provide any additional information which you will need. I'm wondering which code/calculation is behind the message "error: We have more time than is possible ", since I don't really know the significance of the message.
Created attachment 9870 [details] slurm.conf file
I don't know how familiar you are with slurm administration, but I'll explain broadly. Context: Every hour, the slurmdbd runs the hourly rollup to collect usage from that hour into the database. In addition to TRES (cpu, memory, billing, etc) usage, it calculates the allocated time, the down time, and the planned down time. All this information is available from the sreport client command. For longer periods of time, the hourly usage is regularly aggregated into daily usage, which is aggregated into monthly usage. Sorry if you knew that; I just wanted to make sure that is clear. The error message shows this: (55452791+19996320+0)(75449111) > 59385600 these are (time_alloc + time_down + time_planned_down)(total_used) > total_time So the sum of those fields should not be greater than the total time possible in the time period, hence the error message. The reason I'm having trouble is that the code behind the calculation of these fields is actually quite complicated - hard to follow. On that note, are these numbers the same each hour, or are they changing/growing? If they are different, could you send me a few more error messages? Thanks
Please attach the output of this command: sacctmgr show event start=<START> end=<END> where START is the time just before adding the newnodes to the slurm.conf and END is either 'now' or a time just after the first hourly rollup after creating the reservation/partitions or the new nodes. This output could be very large depending on the activity on the cluster. This shows node state changes, including down states. Thanks
Created attachment 9906 [details] log file of sacctmgr show event
Created attachment 9907 [details] slurmdbd.log since April 8
(In reply to Broderick Gardner from comment #4) > The error message shows this: > (55452791+19996320+0)(75449111) > 59385600 > these are > (time_alloc + time_down + time_planned_down)(total_used) > total_time > > So the sum of those fields should not be greater than the total time > possible in the time period, hence the error message. The reason I'm having > trouble is that the code behind the calculation of these fields is actually > quite complicated - hard to follow. Thanks for explaining the meaning of the numbers. The time_down is mostly 1728000 while no jobs run, but other values appear while some jobs were running in the reservation, and several hourly error messages were absent in this case. The total_time=59385600 seconds is exactly 1 hour on 16.496 cores, which is the new size of the cluster after adding 192 new nodes. The time_down=1728000 seconds corresponds to 1 hour on 480 cores. The 480 cores corresponds to the 12 40-cores nodes which are drained (due to a Node Health Check error): # sinfo -t drain -p xeon40 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST xeon40 up 2-02:00:00 12 drain b[001-012] For example, in the slurmdbd.log at 2019-04-14T17:00:00 the time_alloc=59105208 corresponding to 16.378,5 cores. But the xeon40 partition with 7.200 reserved cores and 480 drained cores is totally idle (squeue -p xeon40)! Hence the maximum possible number of allocated cores would be 8.816, and the upper bound on time_alloc must be 31.737.600. The calculated time_alloc=59105208 is therefore definitely wrong! Since time_planned_down=0 (always), the conclusion must be that time_alloc is calculated incorrectly. Could you look into why time_alloc seems to include incorrectly time on idle/reserved nodes which have zero allocated time? > On that note, are these numbers the same each hour, or are they > changing/growing? If they are different, could you send me a few more error > messages? I've attached the slurmdbd.log from this week which includes all error messages.
(In reply to Broderick Gardner from comment #4) > The error message shows this: > (55452791+19996320+0)(75449111) > 59385600 > these are > (time_alloc + time_down + time_planned_down)(total_used) > total_time I was searching in the Slurm source code for the text strings in the message: error: We have more time than is possible ... for cluster but there are zero hits for strings such as "is possible" etc. Could you point me to the source file which calculates and prints the error message? Thanks, Ole
Thanks for the information so far, particularly about the down time and the nodes that are currently drained. Allocated time is what is miscalculated, which to me indicates a reservation problem because you don't have runaway jobs. This is because nodes in an active reservation (that is not maintenance) are considered allocated in this calculation. However, overlapping reservations and running jobs in a reservation should be handled correctly. Are you able to show me all of your reservations? $ scontrol show reservations > I was searching in the Slurm source code for the text strings in the message: > > error: We have more time than is possible ... for cluster > > but there are zero hits for strings such as "is possible" etc. > > Could you point me to the source file which calculates and prints the error > message? src/plugins/accounting_storage/mysql/as_mysql_rollup.c:549 (That string is split across lines. There are still some of these from older code, and it is obnoxious) The main thing you could do that I can't from here until I replicate this is to attach with gdb and set a watchpoint on the time_alloc field during an hourly rollup. > Since we plan to release the 192 nodes for general use during next week, it > would be great if we could debug the issue while I can still uphold the > reservation on the nodes. I _expect_ this problem to go away when the reservation on the new nodes is removed. Let me know if/when you are going to release these nodes this week. Thanks
(In reply to Broderick Gardner from comment #10) > Thanks for the information so far, particularly about the down time and the > nodes that are currently drained. Allocated time is what is miscalculated, > which to me indicates a reservation problem because you don't have runaway > jobs. This is because nodes in an active reservation (that is not > maintenance) are considered allocated in this calculation. However, > overlapping reservations and running jobs in a reservation should be handled > correctly. Are you able to show me all of your reservations? > $ scontrol show reservations We only have 1 reservation, plus a Maintenance which was added much later and does not seem to change the error message: # scontrol show reservations ReservationName=Newnodes StartTime=2019-04-08T11:08:03 EndTime=2019-05-08T11:08:03 Duration=30-00:00:00 Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=7680 Users=ohni,alexty,schiotz,jensj,alexbag,logiar,heih,haowana,manand,zwang,heenen,kasv,chrsop Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=Maintenance StartTime=2019-04-23T08:00:00 EndTime=2019-04-24T20:00:00 Duration=1-12:00:00 Nodes=b[001-012],c[001-180] NodeCnt=192 CoreCnt=7680 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES TRES=cpu=7680 Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a So it seems that running jobs in a reservation are NOT calculated correctly, and perhaps such jobs are counted twice so that total_time exceeds the available resources. > > I was searching in the Slurm source code for the text strings in the message: > > > > error: We have more time than is possible ... for cluster > > > > but there are zero hits for strings such as "is possible" etc. > > > > Could you point me to the source file which calculates and prints the error > > message? > > src/plugins/accounting_storage/mysql/as_mysql_rollup.c:549 > (That string is split across lines. There are still some of these from older > code, and it is obnoxious) > > The main thing you could do that I can't from here until I replicate this is > to attach with gdb and set a watchpoint on the time_alloc field during an > hourly rollup. Unfortunately I'm rather rusty on the use of debuggers on running daemons and I don't want to risk corrupting the database or interrupting slurmdbd or slurmctld. But I can try to understand a little bit the code and the assignments to the loc_tres->time_alloc field. > > Since we plan to release the 192 nodes for general use during next week, it > > would be great if we could debug the issue while I can still uphold the > > reservation on the nodes. > I _expect_ this problem to go away when the reservation on the new nodes is > removed. Let me know if/when you are going to release these nodes this week. I agree: If the miscalculated reservation time goes to zero, 2*zero is also zero and the error will be swept under the rug.
Could you send a database dump so I can directly try to replicate these errors and examine your database for issues? Once you attach it, we can make it private. mysqldump -p <cluster_db_name> > 6839_database_dump.sql Thanks
(In reply to Broderick Gardner from comment #13) > Could you send a database dump so I can directly try to replicate these > errors and examine your database for issues? Once you attach it, we can make > it private. > > mysqldump -p <cluster_db_name> > 6839_database_dump.sql The file has been uploaded now, please make it private. Thanks, Ole
We currently have 25 nodes (out of 750) which are drained, and this once more triggers the incorrect error message: [2019-07-19T13:00:00.089] error: We have more time than is possible (69990838+2073600+0)(72064438) > 72028800 for cluster niflheim(20008) from 2019-07-19T12:00:00 - 2019-07-19T13:00:00 tres 5 We run Slurm 18.08.8 on all nodes. Have you had time to look into this? Thanks, Ole
I am still analyzing your database; I was busy with other issues, but I have gotten back to it now. Thanks for letting me know that it is still occurring. Have your active reservations changed since you sent me your database? What is the state now? I have narrowed the problem down to reservation records; when I remove a certain one, the extra time_alloc disappears. There is a new record each time a reservation changes; the record in question is the first in this list: (if you ran that, you would see more records; I deleted them until the error went away) MariaDB [slurm_acct_db]> select * from niflheim_resv_table; +---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+ | id_resv | deleted | assoclist | flags | nodelist | node_inx | resv_name | time_start | time_end | tres | unused_wall | +---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+ | 19 | 0 | 58 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1554714483 | 1554998058 | 1=7680 | 296186 | | 19 | 0 | 58,128,75 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1554998058 | 1555049603 | 1=7680 | 48716.8125 | | 19 | 0 | 58,128,75,417,414,59 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1555049603 | 1555311323 | 1=7680 | 261779 | | 19 | 0 | 58,128,75,417,414,59,325,250,236,439,377,388 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1555311323 | 1555312904 | 1=7680 | 1581 | | 19 | 0 | 58,128,75,417,414,59,325,250,236,439,377,388,412,322,124 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1555312904 | 1555484373 | 1=7680 | 119837.28125 | | 19 | 0 | 58,128,75,417,414,59,325,250,236,439,377,388,412,322,124,416 | 32768 | b[001-012],c[001-180] | 140-331 | Newnodes | 1555484373 | 1557306483 | 1=7680 | 1590453 | +---------+---------+--------------------------------------------------------------+-------+-----------------------+----------+-----------+------------+------------+--------+--------------+ 6 rows in set (0.001 sec) Look up that user with: select * from niflheim_assoc_table where id_assoc=58; I am guessing that that is your user by the name. You then added more users to the reservation. I don't know yet what is wrong with it, but I've narrowed it down enough that I can figure it out. That's my update; I should have more information within a few days. Thanks
I'm out of the office until August 12th. Jeg er ikke pƄ kontoret, tilbage igen 12. marts. Best regards / Venlig hilsen, Ole Holm Nielsen
(In reply to Broderick Gardner from comment #18) > Have your active reservations changed since you sent me your database? We have no reservations as of 2019-07-19. > What is the state now? As of august 15 we are still seeing the issue, the reason being that 5-10 nodes have been drained (due to some scheduled hardware service): [2019-08-14T21:00:00.741] error: We have more time than is possible (71156595+964800+0)(72121395) > 72028800 for cluster niflheim(20008) from 2019-08-14T20:00:00 - 2019-08-14T21:00:00 tres 5 [2019-08-15T00:00:00.861] error: We have more time than is possible (71069307+964800+0)(72034107) > 72028800 for cluster niflheim(20008) from 2019-08-14T23:00:00 - 2019-08-15T00:00:00 tres 5 > I have narrowed the problem down to reservation records; > when I remove a certain one, the extra time_alloc disappears. We have had no reservations in the past weeks! > There is a new record each time a reservation changes; the record in > question is the first in this list: (if you ran that, you would see more > records; I deleted them until the error went away) ... > Look up that user with: > select * from niflheim_assoc_table where id_assoc=58; > > I am guessing that that is your user by the name. You then added more users > to the reservation. I don't know yet what is wrong with it, but I've > narrowed it down enough that I can figure it out. That's my update; I should > have more information within a few days. I have no idea what id_assoc=58 refers to, it's probably be a Slurm database index? Anyway, users are irrelevant in this context, although user jobs would be involved in calculating the total available time. Thanks for your efforts, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #20) > > I have narrowed the problem down to reservation records; > > when I remove a certain one, the extra time_alloc disappears. > > We have had no reservations in the past weeks! I see. What is the output of mysql> select * from niflheim_resv_table; ? > I have no idea what id_assoc=58 refers to, it's probably be a Slurm database > index? Anyway, users are irrelevant in this context, although user jobs > would be involved in calculating the total available time. It is a slurm database index, which is why I included how to look up which user it refers to; I did not want to publicly post a username, even though it is almost certainly yours. It is only relevant because the difference between the reservation record that appears to cause the problem and the others is the associations assigned to it (they are all the same reservation). That is the only reason I brought it up. Because of what I found in my testing, I have focused my attention on the part of the rollup code that handles reservations. This new information is concerning but hopefully enlightening.
(In reply to Broderick Gardner from comment #21) > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #20) > > > I have narrowed the problem down to reservation records; > > > when I remove a certain one, the extra time_alloc disappears. > > > > We have had no reservations in the past weeks! > I see. What is the output of > mysql> select * from niflheim_resv_table; > ? MariaDB [(none)]> use slurm_acct_db Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select * from niflheim_resv_table; +---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+ | id_resv | deleted | assoclist | flags | nodelist | node_inx | resv_name | time_start | time_end | tres | unused_wall | +---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+ | 24 | 0 | 2 | 32833 | x[001-192] | 558-749 | Maintenance | 1562569200 | 1562578794 | 1=4608 | 9594 | | 24 | 0 | 2 | 32833 | x[001-170] | 558-727 | Maintenance | 1562578794 | 1562579267 | 1=4080 | 473 | | 24 | 1 | 2 | 32833 | x[001-130] | 558-687 | Maintenance | 1562579267 | 1562580486 | 1=3120 | 1219 | +---------+---------+-----------+-------+------------+----------+-------------+------------+------------+--------+-------------+ > > I have no idea what id_assoc=58 refers to, it's probably be a Slurm database > > index? Anyway, users are irrelevant in this context, although user jobs > > would be involved in calculating the total available time. > > It is a slurm database index, which is why I included how to look up which > user it refers to; I did not want to publicly post a username, even though > it is almost certainly yours. It is only relevant because the difference > between the reservation record that appears to cause the problem and the > others is the associations assigned to it (they are all the same > reservation). That is the only reason I brought it up. I see. I did: MariaDB [slurm_acct_db]> select * from niflheim_assoc_table where id_assoc=58; and this indeed displays my username. I rarely run jobs under this username, so I don't understand why my username pops up here. > Because of what I found in my testing, I have focused my attention on the > part of the rollup code that handles reservations. This new information is > concerning but hopefully enlightening. Thanks for your support! /Ole
FYI: We have made a new maintenance reservation for the entire cluster: # scontrol show reservations ReservationName=Electrical StartTime=2019-09-04T08:00:00 EndTime=2019-09-04T15:00:00 Duration=07:00:00 Nodes=a[001-140],b[001-012],c[001-180],d[001-019,021-068],g[001-021,024-110],h[001-002],i[003-050],x[001-192] NodeCnt=749 CoreCnt=16472 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES TRES=cpu=16472 Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a and now the usual warning pops up in in the slurmdbd log again: [2019-08-30T03:00:00.153] error: We have more allocated time than is possible (72097682 > 72014400) for cluster niflheim(20004) from 2019-08-30T02:00:00 - 2019-08-30T03:00:00 tres 5 [2019-08-30T04:00:00.199] error: We have more allocated time than is possible (72134160 > 72014400) for cluster niflheim(20004) from 2019-08-30T03:00:00 - 2019-08-30T04:00:00 tres 5 [2019-08-30T08:00:00.360] error: We have more allocated time than is possible (72099652 > 72014400) for cluster niflheim(20004) from 2019-08-30T07:00:00 - 2019-08-30T08:00:00 tres 5 [2019-08-30T09:00:00.400] error: We have more allocated time than is possible (72055810 > 72014400) for cluster niflheim(20004) from 2019-08-30T08:00:00 - 2019-08-30T09:00:00 tres 5 [2019-08-30T10:00:00.440] error: We have more allocated time than is possible (72172120 > 72014400) for cluster niflheim(20004) from 2019-08-30T09:00:00 - 2019-08-30T10:00:00 tres 5 [2019-08-30T11:00:00.481] error: We have more allocated time than is possible (72101704 > 72014400) for cluster niflheim(20004) from 2019-08-30T10:00:00 - 2019-08-30T11:00:00 tres 5
We've been able to reproduce the first error message: "We have more time than is possible" and determine what the cause is. In this case the entire reservation (because it does not have the IGNORE_JOBS flag) is counted as Allocated and the down nodes are also counted towards the Down Time. Because the reservation was over the entire partition once these are both added up they overfow the total time for the partition. This second message, "We have more allocated time than is possible", is slightly different but is almost definitely related to reservation accounting like the first. We are researching the best way to enhance reservation accounting to avoid these errors.
*** Ticket 8121 has been marked as a duplicate of this ticket. ***
Because of the tricky nature of the patch the decision was made to put this into 20.11 early to allow as much time for testing as possible. The patch has now been committed: commit 71e7577efe6289a17f111c7cce2af03220343a82 Author: Douglas Wightman <wightman@schedmd.com> Correctly handle down nodes in a reservation.
I am currently out of office and will return back on Feb 27.
*** Ticket 9798 has been marked as a duplicate of this ticket. ***