10824 – sshare command returning nan values for fairshare

Ticket 10824 - sshare command returning nan values for fairshare

Summary: sshare command returning nan values for fairshare

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	20.11.3
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Albert Gil
QA Contact:

URL:

Duplicates (3):	10831 10836 10919 (view as ticket list)
Depends on:
Blocks:

Reported:	2021-02-09 12:25 MST by Josko Plazonic
Modified:	2021-02-25 13:22 MST (History)
CC List:	5 users (show)

See Also:
Site:	Princeton (PICSciE)
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.11.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Yale 'sshare -a -l' output (727.12 KB, text/plain) 2021-02-10 07:20 MST, Adam	Details
yale_scontrol_show_assoc.gz (171.02 KB, application/x-gzip) 2021-02-11 06:15 MST, Adam	Details
show assoc output (363.46 KB, text/plain) 2021-02-11 07:40 MST, Josko Plazonic	Details
slurmctld log (11.59 MB, text/plain) 2021-02-11 07:50 MST, Josko Plazonic	Details
Harvard sacctmgr show assoc (582.31 KB, text/plain) 2021-02-11 08:08 MST, Paul Edmon	Details
yale_scontrol_log_feb_11.gz (1020.30 KB, application/x-gzip) 2021-02-11 08:14 MST, Adam	Details
yale_scontrol_show_assoc.gz (171.02 KB, application/x-gzip) 2021-02-11 08:14 MST, Adam	Details
Harvard slurmctld log with Priority logging on (2.71 MB, application/x-bzip) 2021-02-11 08:17 MST, Paul Edmon	Details
yale_sprio_sshare_with_debugflagpriority.txt (267.08 KB, text/plain) 2021-02-11 08:22 MST, Adam	Details
Debug patch for 20.11.3 to detect NaN on raw usage (v1) (4.26 KB, patch) 2021-02-12 04:44 MST, Albert Gil	Details \| Diff
Debug patch for 20.02.6 to detect NaN on assoc usage (v1) (4.10 KB, patch) 2021-02-12 04:46 MST, Albert Gil	Details \| Diff
Yale_slurm_conf_paramemters (2.95 KB, text/plain) 2021-02-12 06:47 MST, Adam	Details
Harvard slurm.conf (73.11 KB, text/x-matlab) 2021-02-12 07:29 MST, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Josko Plazonic 2021-02-09 12:25:15 MST

We updated today to 20.11.3 and after update we noticed:

sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000 9223372036854775808      1.000000            
 root                      root          1    0.000913 9223372036854775808           nan   0.000917 
 astro                                  70    0.063927 9223372036854775808           nan            
  eost                              parent    0.063927 9223372036854775808           nan            
  kunz                              parent    0.063927 9223372036854775808           nan            
 cbe                                    42    0.038356 9223372036854775808           nan            
 cee                                     9    0.008219 9223372036854775808           nan            
 chem                                   45    0.041096 9223372036854775808           nan            
 eac                                   220    0.200913 9223372036854775808           nan            
 geo                                    90    0.082192 9223372036854775808           nan        
...

At least rawusage is wrong and likely the reason why other numbers, ilke fairshare, effectiveusage and others are wrong too.

I have no idea what went wrong here - downgrading the client to 20.11.2 does not help, nor does slurmdbd or slurmctld - it still reports bad numbers (though I did not try downgrading all of it at the same time). Which is why I am putting this in "Component: Other" - don't know what is wrong here.

Josko

Comment 1 Kaylea Nelson 2021-02-09 18:47:06 MST

We (Yale) are also seeing this issue with nan values in sshare, and we just upgraded from 20.02.3 to 20.02.6. 

We have noticed that the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0.

Comment 2 Adam 2021-02-10 07:20:34 MST

Created attachment 17857 [details]
Yale 'sshare -a -l' output

Here's the 'sshare -a -o "RawShares,NormShares,RawUsage,EffectvUsage,FairShare,NormUsage,GrpTRESMins,GrpTRESRaw,TRESRunMins,LevelFS"' output from our impacted system at Yale. From this we can see that the GrpTRESRaw cpu values are reaching 2^64.

It may be interesting to compare this with Princeton. 

Best,
Adam

Comment 3 Albert Gil 2021-02-11 01:47:49 MST

Hi all,

Thanks for reporting the issue and for the information already attached.
It seems that same issue has been also reported on bug 10831.
I'll concentrate my investigation here.

A couple of things:

Could you check if you are also getting this error reported on bug 10831:

Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority '9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1)

Also, could you dump into a file the output of "scontrol show assoc" and attach it?

Thanks,
Albert

Comment 4 Albert Gil 2021-02-11 01:51:38 MST

*** Ticket 10831 has been marked as a duplicate of this ticket. ***

Comment 6 Albert Gil 2021-02-11 02:05:57 MST

Hi,

Could you do this:
1) Run "scontrol setdebugflags +Priority"
2) Run sprio and sshare
3) Wait for the time that you have on PriorityCalcPeriod
4) Run sprio and sshare again
5) Run "scontrol setdebugflags -Priority"
6) Attach your slurmctld log

Thanks,
Albert

Comment 7 Adam 2021-02-11 06:15:39 MST

Created attachment 17880 [details]
yale_scontrol_show_assoc.gz

Hi Albert,

I checked that yesterday but we have not seen the same “exceeds” error that Harvard reported. In our case the ‘user’ fairshare values are normal (not nan, between 0-1) and our sprio values appear to be normal. I don’t know for sure if our priority calculations are being negatively impacted (our ‘account’ fairshare values are undefined) but we do not appear to be doing any FIFO scheduling as a result of all this.

Our ‘scontrol show assoc’ is attached.

..we also noticed this:


[root@mgt2.grace ~]# sshare -a -o "GrpTRESRaw%250" | head -5

                                                                                                                                                                                                                                                GrpTRESRaw

                                                                                                                                                                                                     -----------------------------------------------------

                                                               cpu=465375763,mem=2919562476443,energy=9223372036854775808,node=123386296,billing=338831697,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=1090738

               cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808

               cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808


Thank you,
Adam



From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, February 11, 2021 at 3:47 AM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 10824] sshare command returning nan values for fairshare

Comment # 3<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c3&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734203543%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o0ghy2D9JrMSQq1Cxrmi%2FCyyDmM7KIHLYjBEX0DlcgM%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=78L7kFdS%2BmI1vk245%2Fe4Fm1095bV7LkBpL3T0M174E8%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi all,



Thanks for reporting the issue and for the information already attached.

It seems that same issue has been also reported on bug 10831<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10831&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vxRrVaa9Fs3R2fGffdgkltP8GIy3Ktj%2FDEA899m3omc%3D&reserved=0>.

I'll concentrate my investigation here.



A couple of things:



Could you check if you are also getting this error reported on bug 10831<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10831&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vxRrVaa9Fs3R2fGffdgkltP8GIy3Ktj%2FDEA899m3omc%3D&reserved=0>:



Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority

'9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1)



Also, could you dump into a file the output of "scontrol show assoc" and attach

it?



Thanks,

Albert

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 8 Josko Plazonic 2021-02-11 07:40:57 MST

Created attachment 17882 [details]
show assoc output

As requested - show assoc output from one of the affected clusters (and all of them are...).

Comment 9 Josko Plazonic 2021-02-11 07:50:41 MST

Created attachment 17883 [details]
slurmctld log

As requested in comment #6. I had to lift debugging level (debug3) - with our default on error it was not outputting anything.

Comment 10 Paul Edmon 2021-02-11 08:04:24 MST

Just FYI, I'm going to tag our logs with harvard in the name so you know which is which.  I will pull the requested data as soon as I am able.

Comment 11 Josko Plazonic 2021-02-11 08:05:00 MST

I am raising this to high impact - we are getting many more than usual complaints about job scheduling issues - right after we upgraded to 20.11.3. While we are not certain that this is the reason it is hard to not be suspicious of it.

BTW, only users that have not run any jobs since update have meaningful numbers.

Thanks.

Comment 12 Paul Edmon 2021-02-11 08:08:13 MST

Created attachment 17885 [details]
Harvard sacctmgr show assoc

Results of sacctmgr show assoc for Harvard.

Comment 13 Adam 2021-02-11 08:14:55 MST

Created attachment 17886 [details]
yale_scontrol_log_feb_11.gz

Hi Albert,

Attached as requested (Yale, Grace cluster).

Thank you,
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, February 11, 2021 at 4:06 AM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 10824] sshare command returning nan values for fairshare

Comment # 6<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c6&data=04%7C01%7Cadam.munro%40yale.edu%7C06d813f9a0f649d3287f08d8ce6c401a%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486311614462635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zMf9J85Dh3jC4H8AOMKPc5EsqsAm8sIEluM4sO%2FI7zw%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C06d813f9a0f649d3287f08d8ce6c401a%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486311614462635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Rmlr6lXQMjT6tLm4vqY1nx9kU%2Bj5S0AjbLPZB3IG1xY%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi,



Could you do this:

1) Run "scontrol setdebugflags +Priority"

2) Run sprio and sshare

3) Wait for the time that you have on PriorityCalcPeriod

4) Run sprio and sshare again

5) Run "scontrol setdebugflags -Priority"

6) Attach your slurmctld log



Thanks,

Albert

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 14 Adam 2021-02-11 08:14:55 MST

Created attachment 17887 [details]
yale_scontrol_show_assoc.gz

Comment 15 Paul Edmon 2021-02-11 08:17:18 MST

Created attachment 17888 [details]
Harvard slurmctld log with Priority logging on

Our PriorityCalcPeriod is set to 1 minute.  I followed this pattern:

1) Run "scontrol setdebugflags +Priority"
2) Run sprio and sshare
3) Wait for the time that you have on PriorityCalcPeriod
4) Run sprio and sshare again
5) Run "scontrol setdebugflags -Priority"
6) Attach your slurmctld log

Comment 16 Adam 2021-02-11 08:22:36 MST

Created attachment 17889 [details]
yale_sprio_sshare_with_debugflagpriority.txt

Hi Albert,

In case you wanted to see them, here are the before/after sshare/sprio outputs.

Thank you,
Adam

Comment 20 Albert Gil 2021-02-11 11:01:33 MST

Thank you all for the information.

We've been digging and seems clear that the problem is that one internal variable storing the usage is being corrupted for some reason and we get that NaN values.
Some commands show those NaN values (internally floats) to big integers, but that's just a printing issue.
The important thing is the internal wrong/NaN values in slurmctld, and this is impacting fairshare for sure.
The fact that several of you noted that is happening to users/assocs with 0 usage before will help. I've not being able to reproduce it, though.

Could you confirm the versions that you upgrade from?

Yale: 20.02.3 -> 20.02.6
Princeton: 20.11.2 (?) -> 20.11.3
Harvard: ??? -> 20.11.3 (not upgrade, just restart?)

I'll keep you posted,
Albert

Comment 21 Paul Edmon 2021-02-11 11:11:51 MST

Harvard upgraded from 20.11.2 -> 20.11.3 on February 1st.  I filled a 
ticket then regarding a slurmdbd issue: 
https://bugs.schedmd.com/show_bug.cgi?id=10753  However that error went 
away.  We had restarted slurmctld multiple times in between Feb. 1st and 
when we first saw this error yesterday.

-Paul Edmon-

On 2/11/2021 1:01 PM, bugs@schedmd.com wrote:
>
> *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c20> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Thank you all for the information.
>
> We've been digging and seems clear that the problem is that one internal
> variable storing the usage is being corrupted for some reason and we get that
> NaN values.
> Some commands show those NaN values (internally floats) to big integers, but
> that's just a printing issue.
> The important thing is the internal wrong/NaN values in slurmctld, and this is
> impacting fairshare for sure.
> The fact that several of you noted that is happening to users/assocs with 0
> usage before will help. I've not being able to reproduce it, though.
>
> Could you confirm the versions that you upgrade from?
>
> Yale: 20.02.3 -> 20.02.6
> Princeton: 20.11.2 (?) -> 20.11.3
> Harvard: ??? -> 20.11.3 (not upgrade, just restart?)
>
> I'll keep you posted,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 22 Josko Plazonic 2021-02-11 11:14:34 MST

Indeed, 20.11.2 -> 20.11.3 for us.

Note that I said that the only users that do not seem to have been affected are those that did *not* run any jobs recently.

On the other hand, we have a small test setup that ran no jobs recently and yet I now find it messed up as well...

Comment 23 Adam 2021-02-11 11:15:30 MST

20.02.3 (with bug8847_20023_v13.patch applied) -> 20.02.6 confirmed for Yale.

This is happening on our largest cluster (Grace) but we have three other clusters not showing this problem (one already running 20.02.6 and two running 20.02.5).

Thank you,
Adam


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, February 11, 2021 at 1:01 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 10824] sshare command returning nan values for fairshare

Comment # 20<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com>

Thank you all for the information.



We've been digging and seems clear that the problem is that one internal

variable storing the usage is being corrupted for some reason and we get that

NaN values.

Some commands show those NaN values (internally floats) to big integers, but

that's just a printing issue.

The important thing is the internal wrong/NaN values in slurmctld, and this is

impacting fairshare for sure.

The fact that several of you noted that is happening to users/assocs with 0

usage before will help. I've not being able to reproduce it, though.



Could you confirm the versions that you upgrade from?



Yale: 20.02.3 -> 20.02.6

Princeton: 20.11.2 (?) -> 20.11.3

Harvard: ??? -> 20.11.3 (not upgrade, just restart?)



I'll keep you posted,

Albert

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 24 Paul Edmon 2021-02-11 11:17:07 MST

Additional point on Harvards end.  We have one other cluster that went 
through the same upgrade route but it hasn't seen this error.  It's 
unique to our main production cluster.

-Paul Edmon-

On 2/11/2021 1:15 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c23> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Adam 
> <mailto:adam.munro@yale.edu> *
> 20.02.3 (withbug8847  <show_bug.cgi?id=8847>_20023_v13.patch applied) -> 20.02.6 confirmed for Yale.
>
> This is happening on our largest cluster (Grace) but we have three other
> clusters not showing this problem (one already running 20.02.6 and two running
> 20.02.5).
>
> Thank you,
> Adam
>
>
> From: "bugs@schedmd.com  <mailto:bugs@schedmd.com>" <bugs@schedmd.com  <mailto:bugs@schedmd.com>>
> Date: Thursday, February 11, 2021 at 1:01 PM
> To: "Munro, Adam" <adam.munro@yale.edu  <mailto:adam.munro@yale.edu>>
> Subject: [Bug 10824  <show_bug.cgi?id=10824>] sshare command returning nan values for fairshare
>
> Comment #
> 20<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0  <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0>>
> on bug
> 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0  <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0>>
> from Albert Gil<mailto:albert.gil@schedmd.com  <mailto:albert.gil@schedmd.com>>
>
> Thank you all for the information.
>
>
>
> We've been digging and seems clear that the problem is that one internal
>
> variable storing the usage is being corrupted for some reason and we get that
>
> NaN values.
>
> Some commands show those NaN values (internally floats) to big integers, but
>
> that's just a printing issue.
>
> The important thing is the internal wrong/NaN values in slurmctld, and this is
>
> impacting fairshare for sure.
>
> The fact that several of you noted that is happening to users/assocs with 0
>
> usage before will help. I've not being able to reproduce it, though.
>
>
>
> Could you confirm the versions that you upgrade from?
>
>
>
> Yale: 20.02.3 -> 20.02.6
>
> Princeton: 20.11.2 (?) -> 20.11.3
>
> Harvard: ??? -> 20.11.3 (not upgrade, just restart?)
>
>
>
> I'll keep you posted,
>
> Albert
>
> ________________________________
> You are receiving this mail because:
>
>    *   You are on the CC list for the bug.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 27 Albert Gil 2021-02-12 02:36:36 MST

*** Ticket 10836 has been marked as a duplicate of this ticket. ***

Comment 28 Albert Gil 2021-02-12 03:56:16 MST

Hi,

I've not being able to reproduce your issue yet, but I'm already working on a debug/workaround patch to identify what is leading to your situation, and to try to avoid it in the first place with extra defensive code.

In the meantime, I think that we could try some simpler workaround to allow your clusters to recover a bit of your fairshare.

Could you try to run these commands using your accounts and users with wrong values as $WRONG_ACCT and/or $WRONG_USER:

$ sacctmgr modify account $WRONG_ACCT set RawUsage=0
$ sacctmgr modify user $WRONG_USER where account=$WRONG_ACCT set RawUsage=0

You can do for all your accounts if you want to do a full reset of the fairshare usage if you want.

I think that it won't be enough to make it work fine again, though. Is it?
But in some cases it may and if it restores some sane values from sshare, and after a while (or restart) wrong values appear again, please send me new logs because that could help us to find the root cause. Also slurmdbd logs.

In the meantime, I'll keep working on a more specific patch.
Also, to help me to reproduce it, could you attach your slurm.conf?

Thanks,
Albert

Comment 29 Albert Gil 2021-02-12 04:44:14 MST

Created attachment 17914 [details]
Debug patch for 20.11.3 to detect NaN on raw usage (v1)

Hi,

This is an initial/debug/workaround patch that detects and tries to avoid association usage being set to NaN.

At this point the patch is focused main assoc usage, not on specific TRES usage.
When the patched version of slurmctld is restarted, any association usage that was saved as NaN (in StateSaveLocation) will be set to 0, not their specific TRES usage values.

To also set all wrong usage on specific TRES, please use the sacctmgr commands mentioned on comment 28 before applying the patch.

The patch will log new error messages starting with "NAN".
Please send me the slurmctld logs with the patch applied and let me know if it helped.

Please note that this patch is not an actual fix.
I'm focusing on reproducing the issue now to find the root cause, to write an actual fix.

Regards,
Albert

Comment 30 Albert Gil 2021-02-12 04:46:35 MST

Created attachment 17915 [details]
Debug patch for 20.02.6 to detect NaN on assoc usage (v1)

This is the same patch for 20.02.6 (as the previous one doesn't apply directly.
Please use the right patch for your version.

Regards,
Albert

Comment 31 Adam 2021-02-12 06:06:23 MST

Thank you,
We’ll try this on our test cluster first and then switch over to using the patched version on our main production cluster. Does the patch negate the need to perform the sacctmgr workaround?
Cheers,
Adam

Comment 32 Adam 2021-02-12 06:42:05 MST

We've applied the patch to our 20.02.6 build. This has improved things but we still see the 2^64 values show up in the GrpTRESRaw columns.

[root@mgt2.grace ~]# sshare | head -10
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000 21507212375      1.000000            
 root                      root          1    0.002020           0      0.000000   0.834494 
 abadi                                   1    0.002020           0      0.000000            
 abaluck                                 1    0.002020           0      0.000000            
 acar                                    1    0.002020         539      0.000000            
 admins                                  1    0.002020      775989      0.000036            
  admins                   root          1    0.045455         104      0.000134   0.520201 
 ague                                    1    0.002020   149707450      0.006961 

[root@mgt2.grace ~]# sshare -a -o "RawUsage%20,GrpTRESRaw%250" | head -10
            RawUsage                                                                                                                                                                                                                                                 GrpTRESRaw 
--------------------                                                                                                                                                                                                      ----------------------------------------------------- 
         21507194467                                                                cpu=470440048,mem=2947845411827,energy=9223372036854775808,node=123276933,billing=339443226,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=1129485 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                                                                                        cpu=0,mem=0,energy=9223372036854775808,node=0,billing=0,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 
                   0                cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808

Comment 33 Adam 2021-02-12 06:47:04 MST

Created attachment 17917 [details]
Yale_slurm_conf_paramemters

Our slurmctld/slurmd configuration is broken up into four config files, including this one. The other 3 relate to the node definitions, partition definitions, and cluster definition.

Comment 34 Paul Edmon 2021-02-12 07:29:42 MST

Created attachment 17919 [details]
Harvard slurm.conf

Here is Harvard's slurm.conf

Comment 35 Paul Edmon 2021-02-12 07:43:05 MST

After running the sacctmgr RawUsage=0 command to zero out bad accounts and restart slurmctld that cleared all the NaN's in fairshare and the errors we were seeing in our logs about nonsensical priorities.

I haven't applied the patch yet but I will do so and let you know if I continue to see problems.

Comment 36 Albert Gil 2021-02-12 08:53:30 MST

Adam,

> We’ll try this on our test cluster first and then switch over to using the
> patched version on our main production cluster.

Great!

> Does the patch negate the
> need to perform the sacctmgr workaround?

No, as you mentioned in comment 32, the v1 of the patch is not workarounding (yet) the GrpTRESRaw NaNs. We still need the sacctmgr command to clean those.
What I want to discover is if they appear again or not after the command + patch, ie the root cause.

Please send me the slurmctld logs once the patch is applied, and specially if any NaN or insane number appears again with the patch and the sacctmgr command run.

Thanks,
Albert


BTW, in case you are curious this is small test confirms that the "magic" 9223372036854775808 shown is just a float NAN cast to integer (we use float internally for the fairshare maths, and integers only to show the values in the commands):

#include <stdio.h>
#include <inttypes.h>
#include <math.h>

int main(void)
{
	long double ld = NAN;
	uint64_t    ui = (uint64_t)ld;

	printf("ld = %Lf\n", ld);
	printf("ui = %"PRIu64" (0x%"PRIx64")\n", ui, ui );

	return 0;
}

Ouput:
ld = nan
ui = 9223372036854775808 (0x8000000000000000)

Comment 37 Albert Gil 2021-02-12 08:57:39 MST

Paul,

> After running the sacctmgr RawUsage=0 command to zero out bad accounts and
> restart slurmctld that cleared all the NaN's in fairshare and the errors we
> were seeing in our logs about nonsensical priorities.

Good!

> I haven't applied the patch yet but I will do so and let you know if I
> continue to see problems.

The patch adds extra defensive code that will:
- Avoid some NaNs to be propagated
- Log some attempts of a NaN being set.

The first will help the cluster stability, and the second will help me to find out more clues about the root cause.
So, if you can apply the patch it will help us.

Thanks,
Albert

Comment 38 Paul Edmon 2021-02-12 08:59:03 MST

Yup, I've applied it to our cluster and I will let you know if we see a 
reoccurrence of the error and what the logs say.

-Paul Edmon-

On 2/12/2021 10:57 AM, bugs@schedmd.com wrote:
>
> *Comment # 37 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c37> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Paul,
>
> > After running the sacctmgr RawUsage=0 command to zero out bad accounts and > restart slurmctld that cleared all the NaN's in fairshare and the 
> errors we > were seeing in our logs about nonsensical priorities.
>
> Good!
>
> > I haven't applied the patch yet but I will do so and let you know if I > continue to see problems.
>
> The patch adds extra defensive code that will:
> - Avoid some NaNs to be propagated
> - Log some attempts of a NaN being set.
>
> The first will help the cluster stability, and the second will help me to find
> out more clues about the root cause.
> So, if you can apply the patch it will help us.
>
> Thanks,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 39 Josko Plazonic 2021-02-12 09:01:56 MST

Hm, now that you mentioned that casting/printf test - maybe this is related to:

https://access.redhat.com/errata/RHBA-2021:0439

We upgraded glibc to the problematic version at the same time as we upgrades slurm to 20.11.3. 

We did apply the patch as well BUT we also just updated glibc to the version mentioned in the errata (2.17-323.el7_9). 

Was everyone that saw this issue on RHEL7 with glibc version 2.17-322.el7_9?

Comment 40 Josko Plazonic 2021-02-12 09:09:06 MST

Actually - pretty sure it is related to the glibc. 

On one of our smaller clusters - applied both slurm and glibc updates to newest versions (with patch). Restart slurmctld - no more "error: NAN ignored loading assoc usage". I can restart slurmctld all I want, no messages.

Downgrade glibc to 322. Restart slurmctld (to get it to use 322 glibc). Restart it again (now that it runs with 322 version)  - NAN ignored messages appear... Every time:

[root@eddy-nfs slurm]# tail slurmctld.log
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
[root@eddy-nfs slurm]# systemctl restart slurmctld
[root@eddy-nfs slurm]# tail slurmctld.log
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage
[root@eddy-nfs slurm]# rpm -q glibc
glibc-2.17-322.el7_9.x86_64
glibc-2.17-322.el7_9.i686


Once I am back on 323, after first restart (to clear the problem and start using 323 libs):

[root@eddy-nfs slurm]# systemctl restart slurmctld
[root@eddy-nfs slurm]# tail slurmctld.log
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[root@eddy-nfs slurm]# systemctl restart slurmctld
[root@eddy-nfs slurm]# tail slurmctld.log
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[root@eddy-nfs slurm]# systemctl restart slurmctld
[root@eddy-nfs slurm]# tail slurmctld.log
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
[2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage

Comment 41 Albert Gil 2021-02-12 09:13:49 MST

(In reply to Josko Plazonic from comment #40)
> Actually - pretty sure it is related to the glibc. 
> 
> On one of our smaller clusters - applied both slurm and glibc updates to
> newest versions (with patch). Restart slurmctld - no more "error: NAN
> ignored loading assoc usage". I can restart slurmctld all I want, no
> messages.
> 
> Downgrade glibc to 322. Restart slurmctld (to get it to use 322 glibc).
> Restart it again (now that it runs with 322 version)  - NAN ignored messages
> appear... Every time:
> 
> [root@eddy-nfs slurm]# tail slurmctld.log
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage
> [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage

Wow!
That sounds like a great finding!

That would explain all my thinkings (so different versions with same problem, no code flaws detected, no way to reproduce it...)

I'll take a closer look.

Could the rest confirm your glibc version?

Thanks Josko!
Albert

Comment 42 Paul Edmon 2021-02-12 09:15:31 MST

This is what we are running:

[root@holy-slurm02 ~]# rpm -qa | grep glibc
glibc-2.17-323.el7_9.x86_64
glibc-devel-2.17-323.el7_9.x86_64
glibc-2.17-323.el7_9.i686
glibc-headers-2.17-323.el7_9.x86_64
glibc-common-2.17-323.el7_9.x86_64

-Paul Edmon-

On 2/12/2021 11:01 AM, bugs@schedmd.com wrote:
>
> *Comment # 39 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c39> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko 
> Plazonic <mailto:plazonic@princeton.edu> *
> Hm, now that you mentioned that casting/printf test - maybe this is related to:
>
> https://access.redhat.com/errata/RHBA-2021:0439  <https://access.redhat.com/errata/RHBA-2021:0439>
>
> We upgraded glibc to the problematic version at the same time as we upgrades
> slurm to 20.11.3.
>
> We did apply the patch as well BUT we also just updated glibc to the version
> mentioned in the errata (2.17-323.el7_9).
>
> Was everyone that saw this issue on RHEL7 with glibc version 2.17-322.el7_9?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 43 Josko Plazonic 2021-02-12 09:22:21 MST

Right, but when did you upgrade to 323 - that one was released by Red Hat on Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) - have you ran with 322 when/while you had the problem.

Comment 44 Paul Edmon 2021-02-12 09:27:13 MST

We upgraded to 323 Feb 9 at 9:35.  I started seeing the NAN errors at 
9:21 on Feb. 10 after I did a slurmctld restart.

-Paul Edmon-

On 2/12/2021 11:22 AM, bugs@schedmd.com wrote:
>
> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c43> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko 
> Plazonic <mailto:plazonic@princeton.edu> *
> Right, but when did you upgrade to 323 - that one was released by Red Hat on
> Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) -
> have you ran with 322 when/while you had the problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 45 Paul Edmon 2021-02-12 09:29:56 MST

A possibility is that we already had the errors introduced from this 
glibc bug but we only noticed them after slurmctld restart.

-Paul Edmon-

On 2/12/2021 11:22 AM, bugs@schedmd.com wrote:
>
> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c43> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko 
> Plazonic <mailto:plazonic@princeton.edu> *
> Right, but when did you upgrade to 323 - that one was released by Red Hat on
> Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) -
> have you ran with 322 when/while you had the problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 48 Albert Gil 2021-02-12 09:51:57 MST

Hi,

This is making perfect sense to me.
The patch seems to be reporting always this error "NAN ignored loading assoc usage".
So, the NaN appears right when the slurmctld reads the long double from the StateSaveLocation using, and that's the key:

	if (sscanf(val_str, "%Lf", &nl) != 1)
		return SLURM_ERROR;

Note that the values saved on StateSaveLocation where saved with:

	snprintf(val_str, sizeof(val_str), "%Lf", val);

The bug on glibc says that long doubles with 0 are printed as nan:

#include <stdio.h>
void main() {
    long double d = 0;
    printf("%Lg\r\n",d);
    printf("%Lf\r\n",d);
}

$ gcc ld2str.c 
$ ./a.out 

Actual results:
nan
nan

Expected results:
0
0.000000

So, it makes perfect sense that assocs with usage 0 are saved to disk as NANs and then read as NANs when slurmcyld is restarted.
And that can be happening with any other long double saved/restored by slurmctld in in StateSaveLocation (!).

Thanks for the clue!
Albert

Comment 49 Adam 2021-02-12 10:09:11 MST

We also did an OS update prior to changing to 20.02.6 and that update upgraded our glibc to -322.

Outstanding catch Josko, well done!

Comment 50 Josko Plazonic 2021-02-12 10:17:10 MST

I wish I could take the credit but one of our users complained of a new problem with opencv - stopped compiling - and even provided a link to an opencv open bug that, helpfully enough, had a link to the errata. Add Albert's test code and it is easy to connect the dots.

So a bright user with good bug report gets the credit.

Comment 51 Paul Edmon 2021-02-12 10:24:48 MST

Interesting.  We have our compute on locked version so we didn't see 
this.  However our slurm masters are not on a locked version and 
automatically update to the latest glibc to pick up bug and security 
fixes as soon as possible.

This was a great find.

-Paul Edmon-

On 2/12/2021 12:17 PM, bugs@schedmd.com wrote:
>
> *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c50> on 
> bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko 
> Plazonic <mailto:plazonic@princeton.edu> *
> I wish I could take the credit but one of our users complained of a new problem
> with opencv - stopped compiling - and even provided a link to an opencv open
> bug that, helpfully enough, had a link to the errata. Add Albert's test code
> and it is easy to connect the dots.
>
> So a bright user with good bug report gets the credit.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 52 Adam 2021-02-12 10:48:16 MST

After upgrading to glibc to 2.17-323 and restarting slurmctld:

[root@mgt2.grace ~]# sshare -a -o "GrpTRESRaw%250" | head -10
                                                                                                                                                                                                                                                GrpTRESRaw 
                                                                                                                                                                                                     ----------------------------------------------------- 
                                                                                                                                       cpu=471514502,mem=2950804311542,energy=0,node=123183524,billing=339396740,fs/disk=0,vmem=0,pages=0,gres/gpu=1131310 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 
                                                                                                                                                                                 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0

Comment 53 Albert Gil 2021-02-12 11:04:00 MST

Hi all,

This has been a great team effort, thank you! :-))

Once your usage is fine and you are on a safe glibc version, please remove the debug patch I've posted before.

We are discussing internally another patch to defense Slurm from that glibc bug (that may be also impacting other Slurm areas).

I'll keep you posted,
Thanks!
Albert

Comment 57 Adam 2021-02-12 11:18:57 MST

Hi Albert,

Is undoing the patch urgent or can that wait until Monday? I'm already way over-quota in terms of Friday changes to production systems. 

Thank you,
Adam

Comment 58 Albert Gil 2021-02-12 11:26:46 MST

Hi Adam,

> Is undoing the patch urgent or can that wait until Monday? I'm already way
> over-quota in terms of Friday changes to production systems. 

Not urgent at all.
The patch won't do anything.
I just wanted to remind you to avoid you having debug and unnecessary patches.

If we finally go ahead and publish a better patch to avoid/workaround the problem in the first place even in a buggy glibc, you'll get it from an official Slurm release.

Have soft Friday,
Albert

Comment 60 Albert Gil 2021-02-12 12:56:15 MST

Hi,

I'm glad to let you know that we already pushed an official workaround for the glibc bug and it will be released as part of 20.11.4:

https://github.com/SchedMD/slurm/commit/c57311f19d2ec9a258162909699aba9505e368b8

commit c57311f19d2ec9a258162909699aba9505e368b8
Author:     Albert Gil <albert.gil@schedmd.com>
AuthorDate: Fri Feb 12 18:41:37 2021 +0100

    Work around glibc bug where "0" as a long double is printed as "nan".
    
    On broken glibc versions, the zeroes in the association state file will
    be saved as "nan" in packlongdouble(). Detect if this has happened in
    unpacklongdouble() and convert back to zero.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1925204


With this I'm closing the bug as fixed.

Thank you all for your great team work on this one!

Comment 61 Marshall Garey 2021-02-25 13:22:40 MST

*** Ticket 10919 has been marked as a duplicate of this ticket. ***