We just updated a bunch of clusters from 20.02.5 to 20.02.6, and are seeing problems with the usage reported by sshare: [day36@rztopaz578:~]$ sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 0.000000 9223372036854775808 1.000000 asc 89905 0.899050 9223372036854775808 nan hpcic2b 2983 0.033179 0 nan sparc 5 0.000056 9223372036854775808 nan wci 86917 0.966765 9223372036854775808 nan cmetal 17 0.000196 9223372036854775808 nan cbronze 10 0.588235 9223372036854775808 nan csilver 7 0.411765 9223372036854775808 nan gmetal 4900 0.056376 9223372036854775808 nan gbronze 2700 0.551020 9223372036854775808 nan ggold 200 0.040816 0 nan gsilver 2000 0.408163 0 nan imetal 11000 0.126558 9223372036854775808 nan ibronze 6100 0.554545 9223372036854775808 nan igold 400 0.036364 0 nan isilver 4500 0.409091 9223372036854775808 nan pmetal 5000 0.057526 9223372036854775808 nan pbronze 5000 1.000000 9223372036854775808 nan wmetal 66000 0.759345 9223372036854775808 nan wbronze 46400 0.703030 9223372036854775808 nan wgold 2500 0.037879 0 nan wsilver 17100 0.259091 0 nan I'm just starting to look into it, but I'm also seeing lots of errors in the slurmctld logs like: [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES cpu grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES node grp_used_tres_run_secs underflow, tried to remove 3000 seconds when only 2997 remained. [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES billing grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. which seem like they might be related. I can upload slurm.conf and such, but really quick, here's our config: [day36@rztopaz578:~]$ scontrol show config Configuration data as of 2021-02-22T08:34:31 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = rzslurmdb AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2021-02-18T07:51:39 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = rztopaz CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CredType = cred/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = NO Epilog = /etc/slurm/epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = (null) GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 65533 sec JobAcctGatherFrequency = task=60 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm/joblog JobCompPort = 0 JobCompType = jobcomp/filetxt JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 1 JobRequeue = 0 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 60 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = lscratchrza:100000,lustre1:100000 LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxDBDMsgs = 42968 MaxJobCount = 20000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 200000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = pmi2 MpiParams = (null) MsgAggregationParams = (null) NEXT_JOB_ID = 4983521 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = CANCEL PreemptType = preempt/qos PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10000 PriorityWeightAssoc = 0 PriorityWeightFairShare = 990000 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 2000000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = /etc/slurm/prolog PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = (null) PropagateResourceLimitsExcept = MEMLOCK RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 0 RoutePlugin = route/default SallocDefaultCommand = /usr/bin/srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none --mpibind=off $SHELL SbcastParameters = (null) SchedulerParameters = kill_invalid_depend,bf_continue,bf_max_job_test=5000,bf_max_job_user=50,bf_interval=60 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE SlurmUser = slurm(101) SlurmctldAddr = (null) SlurmctldDebug = verbose SlurmctldHost[0] = rztopaz187(erztopaz187) SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 120 sec SlurmctldParameters = (null) SlurmdDebug = verbose SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 20.02.6 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /tmp/rztopaz SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = affinity,cgroup TaskPluginParam = (null type) TaskProlog = /etc/slurm/mpibind.prolog TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = Yes UnkillableStepProgram = (null) UnkillableStepTimeout = 3600 sec VSizeFactor = 0 percent WaitTime = 30 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 95.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at rztopaz187 is UP [day36@rztopaz578:~]$ Clusters that are still on 20.02.5 are still showing reasonable values for usage.
It looks like people who are actually running have reasonable values for RawUsage, but users who haven't run are getting the underflowed value, which is then propagated up the tree: [day36@rztopaz194:~]$ sshare -a Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 0.000000 9223372036854775808 1.000000 root root 1 0.000010 33923250 nan 0.000477 ... wci 86917 0.966765 9223372036854775808 nan cmetal 17 0.000196 9223372036854775808 nan cbronze 10 0.588235 9223372036854775808 nan cbronze abcbitz 1 0.004902 0 nan 0.934700 cbronze abdulla 1 0.004902 9223372036854775808 nan 0.936130 cbronze adams106 1 0.004902 9223372036854775808 nan 0.971401 cbronze adler5 1 0.004902 7814080 nan 0.963775 cbronze afeyan 1 0.004902 9223372036854775808 nan 0.970448 cbronze agrusa1 1 0.004902 0 nan 0.960439 cbronze alan2 1 0.004902 9223372036854775808 nan 0.952812 cbronze alead 1 0.004902 0 nan 0.934223 cbronze ames6 1 0.004902 85760 nan 0.933746 Values for usage in the slurmdb appear to be okay as well.
Ryan this might be a duplicate of bug#10824. https://github.com/SchedMD/slurm/commit/c57311f19d2ec9a258162909699aba9505e368b8 commit c57311f19d2ec9a258162909699aba9505e368b8 Author: Albert Gil <albert.gil@schedmd.com> AuthorDate: Fri Feb 12 18:41:37 2021 +0100 Work around glibc bug where "0" as a long double is printed as "nan". On broken glibc versions, the zeroes in the association state file will be saved as "nan" in packlongdouble(). Detect if this has happened in unpacklongdouble() and convert back to zero. https://bugzilla.redhat.com/show_bug.cgi?id=1925204
Yes. It does look like we also updated to the broken 322 build of glibc at the same time as we updated to 20.02.6. I'm not quite clear from the discussion of that bug whether just fixing glibc will be sufficient to completely fix this, or if we'll still have to do more to clean up the NaNs that were introduced by the 322 build of glibc. Thanks, Ryan
> [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES cpu grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. > [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES node grp_used_tres_run_secs underflow, tried to remove 3000 seconds when only 2997 remained. > [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES billing grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. This might also be a duplicate of bug#7375. We will let you know.
(In reply to Ryan Day from comment #3) > Yes. It does look like we also updated to the broken 322 build of glibc at > the same time as we updated to 20.02.6. I'm not quite clear from the > discussion of that bug whether just fixing glibc will be sufficient to > completely fix this, or if we'll still have to do more to clean up the NaNs > that were introduced by the 322 build of glibc. Since NaNs have now been introduced, I believe you'll need to update Slurm as well. Alternatively, you could just cherrypick Albert's commit and apply it to your 20.02 Slurm locally - it's a really small commit so it should be easy enough to apply. This way you won't have to upgrade Slurm and deal with everything that goes along with an upgrade. You can always just upgrade glibc and try it out (see if the NaNs go away) but I think you'll also need the patch. Can you let us know when you've upgraded glibc and/or applied the patch to Slurm and if it works for you? I'll look into the assoc underflow errors and get back to you - they might be related to 7375, but they might be something else.
Thanks. I've pulled Albert's commit and applied it to 20.02.6. We'll include that with the newer glibc and let you know if we still see any issues.
Hey Ryan, did everything work out with Albert's patch and the glibc update?
(In reply to Marshall Garey from comment #7) > Hey Ryan, did everything work out with Albert's patch and the glibc update? Hey Marshall, Yes. It's looking good. Thanks for the fast responses on everything. We're really glad we got the glibc issue caught before it could bite our users. Ryan
Sounds good. I'll close this as a duplicate of bug 10824 If you keep seeing those association underflow errors, you can open a new bug report about it. I don't think it's a duplicate of 7375 which deals with qos underflow errors. *** This ticket has been marked as a duplicate of ticket 10824 ***