We updated today to 20.11.3 and after update we noticed: sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 0.000000 9223372036854775808 1.000000 root root 1 0.000913 9223372036854775808 nan 0.000917 astro 70 0.063927 9223372036854775808 nan eost parent 0.063927 9223372036854775808 nan kunz parent 0.063927 9223372036854775808 nan cbe 42 0.038356 9223372036854775808 nan cee 9 0.008219 9223372036854775808 nan chem 45 0.041096 9223372036854775808 nan eac 220 0.200913 9223372036854775808 nan geo 90 0.082192 9223372036854775808 nan ... At least rawusage is wrong and likely the reason why other numbers, ilke fairshare, effectiveusage and others are wrong too. I have no idea what went wrong here - downgrading the client to 20.11.2 does not help, nor does slurmdbd or slurmctld - it still reports bad numbers (though I did not try downgrading all of it at the same time). Which is why I am putting this in "Component: Other" - don't know what is wrong here. Josko
We (Yale) are also seeing this issue with nan values in sshare, and we just upgraded from 20.02.3 to 20.02.6. We have noticed that the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0.
Created attachment 17857 [details] Yale 'sshare -a -l' output Here's the 'sshare -a -o "RawShares,NormShares,RawUsage,EffectvUsage,FairShare,NormUsage,GrpTRESMins,GrpTRESRaw,TRESRunMins,LevelFS"' output from our impacted system at Yale. From this we can see that the GrpTRESRaw cpu values are reaching 2^64. It may be interesting to compare this with Princeton. Best, Adam
Hi all, Thanks for reporting the issue and for the information already attached. It seems that same issue has been also reported on bug 10831. I'll concentrate my investigation here. A couple of things: Could you check if you are also getting this error reported on bug 10831: Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority '9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1) Also, could you dump into a file the output of "scontrol show assoc" and attach it? Thanks, Albert
*** Ticket 10831 has been marked as a duplicate of this ticket. ***
Hi, Could you do this: 1) Run "scontrol setdebugflags +Priority" 2) Run sprio and sshare 3) Wait for the time that you have on PriorityCalcPeriod 4) Run sprio and sshare again 5) Run "scontrol setdebugflags -Priority" 6) Attach your slurmctld log Thanks, Albert
Created attachment 17880 [details] yale_scontrol_show_assoc.gz Hi Albert, I checked that yesterday but we have not seen the same “exceeds” error that Harvard reported. In our case the ‘user’ fairshare values are normal (not nan, between 0-1) and our sprio values appear to be normal. I don’t know for sure if our priority calculations are being negatively impacted (our ‘account’ fairshare values are undefined) but we do not appear to be doing any FIFO scheduling as a result of all this. Our ‘scontrol show assoc’ is attached. ..we also noticed this: [root@mgt2.grace ~]# sshare -a -o "GrpTRESRaw%250" | head -5 GrpTRESRaw ----------------------------------------------------- cpu=465375763,mem=2919562476443,energy=9223372036854775808,node=123386296,billing=338831697,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=1090738 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 Thank you, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, February 11, 2021 at 3:47 AM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 10824] sshare command returning nan values for fairshare Comment # 3<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c3&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734203543%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o0ghy2D9JrMSQq1Cxrmi%2FCyyDmM7KIHLYjBEX0DlcgM%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=78L7kFdS%2BmI1vk245%2Fe4Fm1095bV7LkBpL3T0M174E8%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com> Hi all, Thanks for reporting the issue and for the information already attached. It seems that same issue has been also reported on bug 10831<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10831&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vxRrVaa9Fs3R2fGffdgkltP8GIy3Ktj%2FDEA899m3omc%3D&reserved=0>. I'll concentrate my investigation here. A couple of things: Could you check if you are also getting this error reported on bug 10831<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10831&data=04%7C01%7Cadam.munro%40yale.edu%7C64d25f54bb584e7fca1c08d8ce69b6f1%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486300734213537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vxRrVaa9Fs3R2fGffdgkltP8GIy3Ktj%2FDEA899m3omc%3D&reserved=0>: Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority '9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1) Also, could you dump into a file the output of "scontrol show assoc" and attach it? Thanks, Albert ________________________________ You are receiving this mail because: * You are on the CC list for the bug.
Created attachment 17882 [details] show assoc output As requested - show assoc output from one of the affected clusters (and all of them are...).
Created attachment 17883 [details] slurmctld log As requested in comment #6. I had to lift debugging level (debug3) - with our default on error it was not outputting anything.
Just FYI, I'm going to tag our logs with harvard in the name so you know which is which. I will pull the requested data as soon as I am able.
I am raising this to high impact - we are getting many more than usual complaints about job scheduling issues - right after we upgraded to 20.11.3. While we are not certain that this is the reason it is hard to not be suspicious of it. BTW, only users that have not run any jobs since update have meaningful numbers. Thanks.
Created attachment 17885 [details] Harvard sacctmgr show assoc Results of sacctmgr show assoc for Harvard.
Created attachment 17886 [details] yale_scontrol_log_feb_11.gz Hi Albert, Attached as requested (Yale, Grace cluster). Thank you, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, February 11, 2021 at 4:06 AM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 10824] sshare command returning nan values for fairshare Comment # 6<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c6&data=04%7C01%7Cadam.munro%40yale.edu%7C06d813f9a0f649d3287f08d8ce6c401a%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486311614462635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zMf9J85Dh3jC4H8AOMKPc5EsqsAm8sIEluM4sO%2FI7zw%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C06d813f9a0f649d3287f08d8ce6c401a%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486311614462635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Rmlr6lXQMjT6tLm4vqY1nx9kU%2Bj5S0AjbLPZB3IG1xY%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com> Hi, Could you do this: 1) Run "scontrol setdebugflags +Priority" 2) Run sprio and sshare 3) Wait for the time that you have on PriorityCalcPeriod 4) Run sprio and sshare again 5) Run "scontrol setdebugflags -Priority" 6) Attach your slurmctld log Thanks, Albert ________________________________ You are receiving this mail because: * You are on the CC list for the bug.
Created attachment 17887 [details] yale_scontrol_show_assoc.gz
Created attachment 17888 [details] Harvard slurmctld log with Priority logging on Our PriorityCalcPeriod is set to 1 minute. I followed this pattern: 1) Run "scontrol setdebugflags +Priority" 2) Run sprio and sshare 3) Wait for the time that you have on PriorityCalcPeriod 4) Run sprio and sshare again 5) Run "scontrol setdebugflags -Priority" 6) Attach your slurmctld log
Created attachment 17889 [details] yale_sprio_sshare_with_debugflagpriority.txt Hi Albert, In case you wanted to see them, here are the before/after sshare/sprio outputs. Thank you, Adam
Thank you all for the information. We've been digging and seems clear that the problem is that one internal variable storing the usage is being corrupted for some reason and we get that NaN values. Some commands show those NaN values (internally floats) to big integers, but that's just a printing issue. The important thing is the internal wrong/NaN values in slurmctld, and this is impacting fairshare for sure. The fact that several of you noted that is happening to users/assocs with 0 usage before will help. I've not being able to reproduce it, though. Could you confirm the versions that you upgrade from? Yale: 20.02.3 -> 20.02.6 Princeton: 20.11.2 (?) -> 20.11.3 Harvard: ??? -> 20.11.3 (not upgrade, just restart?) I'll keep you posted, Albert
Harvard upgraded from 20.11.2 -> 20.11.3 on February 1st. I filled a ticket then regarding a slurmdbd issue: https://bugs.schedmd.com/show_bug.cgi?id=10753 However that error went away. We had restarted slurmctld multiple times in between Feb. 1st and when we first saw this error yesterday. -Paul Edmon- On 2/11/2021 1:01 PM, bugs@schedmd.com wrote: > > *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c20> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Albert > Gil <mailto:albert.gil@schedmd.com> * > Thank you all for the information. > > We've been digging and seems clear that the problem is that one internal > variable storing the usage is being corrupted for some reason and we get that > NaN values. > Some commands show those NaN values (internally floats) to big integers, but > that's just a printing issue. > The important thing is the internal wrong/NaN values in slurmctld, and this is > impacting fairshare for sure. > The fact that several of you noted that is happening to users/assocs with 0 > usage before will help. I've not being able to reproduce it, though. > > Could you confirm the versions that you upgrade from? > > Yale: 20.02.3 -> 20.02.6 > Princeton: 20.11.2 (?) -> 20.11.3 > Harvard: ??? -> 20.11.3 (not upgrade, just restart?) > > I'll keep you posted, > Albert > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
Indeed, 20.11.2 -> 20.11.3 for us. Note that I said that the only users that do not seem to have been affected are those that did *not* run any jobs recently. On the other hand, we have a small test setup that ran no jobs recently and yet I now find it messed up as well...
20.02.3 (with bug8847_20023_v13.patch applied) -> 20.02.6 confirmed for Yale. This is happening on our largest cluster (Grace) but we have three other clusters not showing this problem (one already running 20.02.6 and two running 20.02.5). Thank you, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, February 11, 2021 at 1:01 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 10824] sshare command returning nan values for fairshare Comment # 20<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0> on bug 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com> Thank you all for the information. We've been digging and seems clear that the problem is that one internal variable storing the usage is being corrupted for some reason and we get that NaN values. Some commands show those NaN values (internally floats) to big integers, but that's just a printing issue. The important thing is the internal wrong/NaN values in slurmctld, and this is impacting fairshare for sure. The fact that several of you noted that is happening to users/assocs with 0 usage before will help. I've not being able to reproduce it, though. Could you confirm the versions that you upgrade from? Yale: 20.02.3 -> 20.02.6 Princeton: 20.11.2 (?) -> 20.11.3 Harvard: ??? -> 20.11.3 (not upgrade, just restart?) I'll keep you posted, Albert ________________________________ You are receiving this mail because: * You are on the CC list for the bug.
Additional point on Harvards end. We have one other cluster that went through the same upgrade route but it hasn't seen this error. It's unique to our main production cluster. -Paul Edmon- On 2/11/2021 1:15 PM, bugs@schedmd.com wrote: > > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c23> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Adam > <mailto:adam.munro@yale.edu> * > 20.02.3 (withbug8847 <show_bug.cgi?id=8847>_20023_v13.patch applied) -> 20.02.6 confirmed for Yale. > > This is happening on our largest cluster (Grace) but we have three other > clusters not showing this problem (one already running 20.02.6 and two running > 20.02.5). > > Thank you, > Adam > > > From: "bugs@schedmd.com <mailto:bugs@schedmd.com>" <bugs@schedmd.com <mailto:bugs@schedmd.com>> > Date: Thursday, February 11, 2021 at 1:01 PM > To: "Munro, Adam" <adam.munro@yale.edu <mailto:adam.munro@yale.edu>> > Subject: [Bug 10824 <show_bug.cgi?id=10824>] sshare command returning nan values for fairshare > > Comment # > 20<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0 <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824%23c20&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976010610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7hJzmpCsZ1VJb2%2FmWNOdiWk342nwH4SZy2BlbJJ0axc%3D&reserved=0>> > on bug > 10824<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0 <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10824&data=04%7C01%7Cadam.munro%40yale.edu%7C4767115609ec4845778c08d8ceb7127d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637486632976020608%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Nn3Srr9ItGWwl78QmwsNk7kmUeJ8leyz5AwHj5ZZ3Cs%3D&reserved=0>> > from Albert Gil<mailto:albert.gil@schedmd.com <mailto:albert.gil@schedmd.com>> > > Thank you all for the information. > > > > We've been digging and seems clear that the problem is that one internal > > variable storing the usage is being corrupted for some reason and we get that > > NaN values. > > Some commands show those NaN values (internally floats) to big integers, but > > that's just a printing issue. > > The important thing is the internal wrong/NaN values in slurmctld, and this is > > impacting fairshare for sure. > > The fact that several of you noted that is happening to users/assocs with 0 > > usage before will help. I've not being able to reproduce it, though. > > > > Could you confirm the versions that you upgrade from? > > > > Yale: 20.02.3 -> 20.02.6 > > Princeton: 20.11.2 (?) -> 20.11.3 > > Harvard: ??? -> 20.11.3 (not upgrade, just restart?) > > > > I'll keep you posted, > > Albert > > ________________________________ > You are receiving this mail because: > > * You are on the CC list for the bug. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
*** Ticket 10836 has been marked as a duplicate of this ticket. ***
Hi, I've not being able to reproduce your issue yet, but I'm already working on a debug/workaround patch to identify what is leading to your situation, and to try to avoid it in the first place with extra defensive code. In the meantime, I think that we could try some simpler workaround to allow your clusters to recover a bit of your fairshare. Could you try to run these commands using your accounts and users with wrong values as $WRONG_ACCT and/or $WRONG_USER: $ sacctmgr modify account $WRONG_ACCT set RawUsage=0 $ sacctmgr modify user $WRONG_USER where account=$WRONG_ACCT set RawUsage=0 You can do for all your accounts if you want to do a full reset of the fairshare usage if you want. I think that it won't be enough to make it work fine again, though. Is it? But in some cases it may and if it restores some sane values from sshare, and after a while (or restart) wrong values appear again, please send me new logs because that could help us to find the root cause. Also slurmdbd logs. In the meantime, I'll keep working on a more specific patch. Also, to help me to reproduce it, could you attach your slurm.conf? Thanks, Albert
Created attachment 17914 [details] Debug patch for 20.11.3 to detect NaN on raw usage (v1) Hi, This is an initial/debug/workaround patch that detects and tries to avoid association usage being set to NaN. At this point the patch is focused main assoc usage, not on specific TRES usage. When the patched version of slurmctld is restarted, any association usage that was saved as NaN (in StateSaveLocation) will be set to 0, not their specific TRES usage values. To also set all wrong usage on specific TRES, please use the sacctmgr commands mentioned on comment 28 before applying the patch. The patch will log new error messages starting with "NAN". Please send me the slurmctld logs with the patch applied and let me know if it helped. Please note that this patch is not an actual fix. I'm focusing on reproducing the issue now to find the root cause, to write an actual fix. Regards, Albert
Created attachment 17915 [details] Debug patch for 20.02.6 to detect NaN on assoc usage (v1) This is the same patch for 20.02.6 (as the previous one doesn't apply directly. Please use the right patch for your version. Regards, Albert
Thank you, We’ll try this on our test cluster first and then switch over to using the patched version on our main production cluster. Does the patch negate the need to perform the sacctmgr workaround? Cheers, Adam
We've applied the patch to our 20.02.6 build. This has improved things but we still see the 2^64 values show up in the GrpTRESRaw columns. [root@mgt2.grace ~]# sshare | head -10 Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 0.000000 21507212375 1.000000 root root 1 0.002020 0 0.000000 0.834494 abadi 1 0.002020 0 0.000000 abaluck 1 0.002020 0 0.000000 acar 1 0.002020 539 0.000000 admins 1 0.002020 775989 0.000036 admins root 1 0.045455 104 0.000134 0.520201 ague 1 0.002020 149707450 0.006961 [root@mgt2.grace ~]# sshare -a -o "RawUsage%20,GrpTRESRaw%250" | head -10 RawUsage GrpTRESRaw -------------------- ----------------------------------------------------- 21507194467 cpu=470440048,mem=2947845411827,energy=9223372036854775808,node=123276933,billing=339443226,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=1129485 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=0,mem=0,energy=9223372036854775808,node=0,billing=0,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808 0 cpu=9223372036854775808,mem=9223372036854775808,energy=9223372036854775808,node=9223372036854775808,billing=9223372036854775808,fs/disk=9223372036854775808,vmem=9223372036854775808,pages=9223372036854775808,gres/gpu=9223372036854775808
Created attachment 17917 [details] Yale_slurm_conf_paramemters Our slurmctld/slurmd configuration is broken up into four config files, including this one. The other 3 relate to the node definitions, partition definitions, and cluster definition.
Created attachment 17919 [details] Harvard slurm.conf Here is Harvard's slurm.conf
After running the sacctmgr RawUsage=0 command to zero out bad accounts and restart slurmctld that cleared all the NaN's in fairshare and the errors we were seeing in our logs about nonsensical priorities. I haven't applied the patch yet but I will do so and let you know if I continue to see problems.
Adam, > We’ll try this on our test cluster first and then switch over to using the > patched version on our main production cluster. Great! > Does the patch negate the > need to perform the sacctmgr workaround? No, as you mentioned in comment 32, the v1 of the patch is not workarounding (yet) the GrpTRESRaw NaNs. We still need the sacctmgr command to clean those. What I want to discover is if they appear again or not after the command + patch, ie the root cause. Please send me the slurmctld logs once the patch is applied, and specially if any NaN or insane number appears again with the patch and the sacctmgr command run. Thanks, Albert BTW, in case you are curious this is small test confirms that the "magic" 9223372036854775808 shown is just a float NAN cast to integer (we use float internally for the fairshare maths, and integers only to show the values in the commands): #include <stdio.h> #include <inttypes.h> #include <math.h> int main(void) { long double ld = NAN; uint64_t ui = (uint64_t)ld; printf("ld = %Lf\n", ld); printf("ui = %"PRIu64" (0x%"PRIx64")\n", ui, ui ); return 0; } Ouput: ld = nan ui = 9223372036854775808 (0x8000000000000000)
Paul, > After running the sacctmgr RawUsage=0 command to zero out bad accounts and > restart slurmctld that cleared all the NaN's in fairshare and the errors we > were seeing in our logs about nonsensical priorities. Good! > I haven't applied the patch yet but I will do so and let you know if I > continue to see problems. The patch adds extra defensive code that will: - Avoid some NaNs to be propagated - Log some attempts of a NaN being set. The first will help the cluster stability, and the second will help me to find out more clues about the root cause. So, if you can apply the patch it will help us. Thanks, Albert
Yup, I've applied it to our cluster and I will let you know if we see a reoccurrence of the error and what the logs say. -Paul Edmon- On 2/12/2021 10:57 AM, bugs@schedmd.com wrote: > > *Comment # 37 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c37> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Albert > Gil <mailto:albert.gil@schedmd.com> * > Paul, > > > After running the sacctmgr RawUsage=0 command to zero out bad accounts and > restart slurmctld that cleared all the NaN's in fairshare and the > errors we > were seeing in our logs about nonsensical priorities. > > Good! > > > I haven't applied the patch yet but I will do so and let you know if I > continue to see problems. > > The patch adds extra defensive code that will: > - Avoid some NaNs to be propagated > - Log some attempts of a NaN being set. > > The first will help the cluster stability, and the second will help me to find > out more clues about the root cause. > So, if you can apply the patch it will help us. > > Thanks, > Albert > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
Hm, now that you mentioned that casting/printf test - maybe this is related to: https://access.redhat.com/errata/RHBA-2021:0439 We upgraded glibc to the problematic version at the same time as we upgrades slurm to 20.11.3. We did apply the patch as well BUT we also just updated glibc to the version mentioned in the errata (2.17-323.el7_9). Was everyone that saw this issue on RHEL7 with glibc version 2.17-322.el7_9?
Actually - pretty sure it is related to the glibc. On one of our smaller clusters - applied both slurm and glibc updates to newest versions (with patch). Restart slurmctld - no more "error: NAN ignored loading assoc usage". I can restart slurmctld all I want, no messages. Downgrade glibc to 322. Restart slurmctld (to get it to use 322 glibc). Restart it again (now that it runs with 322 version) - NAN ignored messages appear... Every time: [root@eddy-nfs slurm]# tail slurmctld.log [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage [root@eddy-nfs slurm]# systemctl restart slurmctld [root@eddy-nfs slurm]# tail slurmctld.log [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [2021-02-12T11:07:26.936] error: NAN ignored loading assoc usage [root@eddy-nfs slurm]# rpm -q glibc glibc-2.17-322.el7_9.x86_64 glibc-2.17-322.el7_9.i686 Once I am back on 323, after first restart (to clear the problem and start using 323 libs): [root@eddy-nfs slurm]# systemctl restart slurmctld [root@eddy-nfs slurm]# tail slurmctld.log [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [root@eddy-nfs slurm]# systemctl restart slurmctld [root@eddy-nfs slurm]# tail slurmctld.log [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [root@eddy-nfs slurm]# systemctl restart slurmctld [root@eddy-nfs slurm]# tail slurmctld.log [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage [2021-02-12T11:08:16.803] error: NAN ignored loading assoc usage
(In reply to Josko Plazonic from comment #40) > Actually - pretty sure it is related to the glibc. > > On one of our smaller clusters - applied both slurm and glibc updates to > newest versions (with patch). Restart slurmctld - no more "error: NAN > ignored loading assoc usage". I can restart slurmctld all I want, no > messages. > > Downgrade glibc to 322. Restart slurmctld (to get it to use 322 glibc). > Restart it again (now that it runs with 322 version) - NAN ignored messages > appear... Every time: > > [root@eddy-nfs slurm]# tail slurmctld.log > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage > [2021-02-12T11:06:11.075] error: NAN ignored loading assoc usage Wow! That sounds like a great finding! That would explain all my thinkings (so different versions with same problem, no code flaws detected, no way to reproduce it...) I'll take a closer look. Could the rest confirm your glibc version? Thanks Josko! Albert
This is what we are running: [root@holy-slurm02 ~]# rpm -qa | grep glibc glibc-2.17-323.el7_9.x86_64 glibc-devel-2.17-323.el7_9.x86_64 glibc-2.17-323.el7_9.i686 glibc-headers-2.17-323.el7_9.x86_64 glibc-common-2.17-323.el7_9.x86_64 -Paul Edmon- On 2/12/2021 11:01 AM, bugs@schedmd.com wrote: > > *Comment # 39 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c39> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > Hm, now that you mentioned that casting/printf test - maybe this is related to: > > https://access.redhat.com/errata/RHBA-2021:0439 <https://access.redhat.com/errata/RHBA-2021:0439> > > We upgraded glibc to the problematic version at the same time as we upgrades > slurm to 20.11.3. > > We did apply the patch as well BUT we also just updated glibc to the version > mentioned in the errata (2.17-323.el7_9). > > Was everyone that saw this issue on RHEL7 with glibc version 2.17-322.el7_9? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
Right, but when did you upgrade to 323 - that one was released by Red Hat on Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) - have you ran with 322 when/while you had the problem.
We upgraded to 323 Feb 9 at 9:35. I started seeing the NAN errors at 9:21 on Feb. 10 after I did a slurmctld restart. -Paul Edmon- On 2/12/2021 11:22 AM, bugs@schedmd.com wrote: > > *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c43> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > Right, but when did you upgrade to 323 - that one was released by Red Hat on > Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) - > have you ran with 322 when/while you had the problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
A possibility is that we already had the errors introduced from this glibc bug but we only noticed them after slurmctld restart. -Paul Edmon- On 2/12/2021 11:22 AM, bugs@schedmd.com wrote: > > *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c43> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > Right, but when did you upgrade to 323 - that one was released by Red Hat on > Feb 8th (not sure how quickly centos had theirs out, surely not before 8th) - > have you ran with 322 when/while you had the problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
Hi, This is making perfect sense to me. The patch seems to be reporting always this error "NAN ignored loading assoc usage". So, the NaN appears right when the slurmctld reads the long double from the StateSaveLocation using, and that's the key: if (sscanf(val_str, "%Lf", &nl) != 1) return SLURM_ERROR; Note that the values saved on StateSaveLocation where saved with: snprintf(val_str, sizeof(val_str), "%Lf", val); The bug on glibc says that long doubles with 0 are printed as nan: #include <stdio.h> void main() { long double d = 0; printf("%Lg\r\n",d); printf("%Lf\r\n",d); } $ gcc ld2str.c $ ./a.out Actual results: nan nan Expected results: 0 0.000000 So, it makes perfect sense that assocs with usage 0 are saved to disk as NANs and then read as NANs when slurmcyld is restarted. And that can be happening with any other long double saved/restored by slurmctld in in StateSaveLocation (!). Thanks for the clue! Albert
We also did an OS update prior to changing to 20.02.6 and that update upgraded our glibc to -322. Outstanding catch Josko, well done!
I wish I could take the credit but one of our users complained of a new problem with opencv - stopped compiling - and even provided a link to an opencv open bug that, helpfully enough, had a link to the errata. Add Albert's test code and it is easy to connect the dots. So a bright user with good bug report gets the credit.
Interesting. We have our compute on locked version so we didn't see this. However our slurm masters are not on a locked version and automatically update to the latest glibc to pick up bug and security fixes as soon as possible. This was a great find. -Paul Edmon- On 2/12/2021 12:17 PM, bugs@schedmd.com wrote: > > *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=10824#c50> on > bug 10824 <https://bugs.schedmd.com/show_bug.cgi?id=10824> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > I wish I could take the credit but one of our users complained of a new problem > with opencv - stopped compiling - and even provided a link to an opencv open > bug that, helpfully enough, had a link to the errata. Add Albert's test code > and it is easy to connect the dots. > > So a bright user with good bug report gets the credit. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
After upgrading to glibc to 2.17-323 and restarting slurmctld: [root@mgt2.grace ~]# sshare -a -o "GrpTRESRaw%250" | head -10 GrpTRESRaw ----------------------------------------------------- cpu=471514502,mem=2950804311542,energy=0,node=123183524,billing=339396740,fs/disk=0,vmem=0,pages=0,gres/gpu=1131310 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0
Hi all, This has been a great team effort, thank you! :-)) Once your usage is fine and you are on a safe glibc version, please remove the debug patch I've posted before. We are discussing internally another patch to defense Slurm from that glibc bug (that may be also impacting other Slurm areas). I'll keep you posted, Thanks! Albert
Hi Albert, Is undoing the patch urgent or can that wait until Monday? I'm already way over-quota in terms of Friday changes to production systems. Thank you, Adam
Hi Adam, > Is undoing the patch urgent or can that wait until Monday? I'm already way > over-quota in terms of Friday changes to production systems. Not urgent at all. The patch won't do anything. I just wanted to remind you to avoid you having debug and unnecessary patches. If we finally go ahead and publish a better patch to avoid/workaround the problem in the first place even in a buggy glibc, you'll get it from an official Slurm release. Have soft Friday, Albert
Hi, I'm glad to let you know that we already pushed an official workaround for the glibc bug and it will be released as part of 20.11.4: https://github.com/SchedMD/slurm/commit/c57311f19d2ec9a258162909699aba9505e368b8 commit c57311f19d2ec9a258162909699aba9505e368b8 Author: Albert Gil <albert.gil@schedmd.com> AuthorDate: Fri Feb 12 18:41:37 2021 +0100 Work around glibc bug where "0" as a long double is printed as "nan". On broken glibc versions, the zeroes in the association state file will be saved as "nan" in packlongdouble(). Detect if this has happened in unpacklongdouble() and convert back to zero. https://bugzilla.redhat.com/show_bug.cgi?id=1925204 With this I'm closing the bug as fixed. Thank you all for your great team work on this one!
*** Ticket 10919 has been marked as a duplicate of this ticket. ***