Created attachment 17868 [details] sshare output We just updated our Grace cluster from 20.02.3 to 20.02.6 and are now seeing many many impossibly high numbers in sshare for RawUsage and nan values for EffectvUsage (see attached). During the update we also moved from cons_res to cons_tres (not sure if that is relevant, but it was one of the only configuration changes made). Also, We have noticed a pattern where it is appears that many (if not all) of the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0. Our issue seems perhaps similar to this issue just reported, although we are on a different version: https://bugs.schedmd.com/show_bug.cgi?id=10824 Kaylea
Created attachment 17869 [details] current conf
We also found that prior to a slurmctld and slurmdbd restart on 2/8, there are many error similar to error: We have more time than is possible (634982400+582615985152+0)(583250967552) > 583027891200 for cluster grace(161952192) from 2021-02-02T22:00:00 - 2021-02-02T23:00:00 tres 2 error: We have more time than is possible (115200+62791336+0)(62906536) > 62802000 for cluster grace(17445) from 2021-02-03T13:00:00 - 2021-02-03T14:00:00 tres 5 The cluster was undergoing maintenance from 2/2-2/4, so there were no users on the system but Yale staff may have been running test jobs for some of that time.
Hi Kaylea, Yes, I'm already tracking your case on bug 10824, although you have a different version than Harvard and Princeton, the root error seems to be the same. Actually it could also be some clue that you also have those "more time than is possible" errors because Harvard also had them on bug 10753. If this is ok for you I'm closing this bug as duplicate of bug 10824 con concentrate our investigation there. If we finally see that the problem is not share between versions, I'll reopen this one. Regards, Albert
Marking as duplicate of bug 10824. *** This ticket has been marked as a duplicate of ticket 10824 ***