10836 – RawUsage numbers suddenly impossibly high after upgrade

Ticket 10836 - RawUsage numbers suddenly impossibly high after upgrade

Summary: RawUsage numbers suddenly impossibly high after upgrade

Status:	RESOLVED DUPLICATE of ticket 10824

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-02-10 12:42 MST by Kaylea Nelson
Modified:	2021-02-12 02:36 MST (History)
CC List:	1 user (show)

See Also:
Site:	Yale
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	RHEL
Machine Name:	Grace
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
sshare output (2.33 KB, text/plain) 2021-02-10 12:42 MST, Kaylea Nelson	Details
current conf (8.19 KB, text/plain) 2021-02-10 12:43 MST, Kaylea Nelson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kaylea Nelson 2021-02-10 12:42:32 MST

Created attachment 17868 [details]
sshare output

We just updated our Grace cluster from 20.02.3 to 20.02.6 and are now seeing many many impossibly high numbers in sshare for RawUsage and nan values for EffectvUsage (see attached). 

During the update we also moved from cons_res to cons_tres (not sure if that is relevant, but it was one of the only configuration changes made).


Also, We have noticed a pattern where it is appears that many (if not all) of the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0.

Our issue seems perhaps similar to this issue just reported, although we are on a different version: https://bugs.schedmd.com/show_bug.cgi?id=10824


Kaylea

Comment 1 Kaylea Nelson 2021-02-10 12:43:00 MST

Created attachment 17869 [details]
current conf

Comment 2 Kaylea Nelson 2021-02-10 12:51:37 MST

We also found that prior to a slurmctld and slurmdbd restart on 2/8, there are many error similar to 

error: We have more time than is possible (634982400+582615985152+0)(583250967552) > 583027891200 for cluster grace(161952192) from 2021-02-02T22:00:00 - 2021-02-02T23:00:00 tres 2

error: We have more time than is possible (115200+62791336+0)(62906536) > 62802000 for cluster grace(17445) from 2021-02-03T13:00:00 - 2021-02-03T14:00:00 tres 5

The cluster was undergoing maintenance from 2/2-2/4, so there were no users on the system but Yale staff may have been running test jobs for some of that time.

Comment 3 Albert Gil 2021-02-12 02:35:23 MST

Hi Kaylea,

Yes, I'm already tracking your case on bug 10824, although you have a different version than Harvard and Princeton, the root error seems to be the same.

Actually it could also be some clue that you also have those "more time than is possible" errors because Harvard also had them on bug 10753.

If this is ok for you I'm closing this bug as duplicate of bug 10824 con concentrate our investigation there.

If we finally see that the problem is not share between versions, I'll reopen this one.

Regards,
Albert

Comment 4 Albert Gil 2021-02-12 02:36:36 MST

Marking as duplicate of bug 10824.

*** This ticket has been marked as a duplicate of ticket 10824 ***