Ticket 10919 - problems with usage in 20.02.6
Summary: problems with usage in 20.02.6
Status: RESOLVED DUPLICATE of ticket 10824
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-02-22 09:38 MST by Ryan Day
Modified: 2021-03-04 12:57 MST (History)
1 user (show)

See Also:
Site: LLNL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Day 2021-02-22 09:38:00 MST
We just updated a bunch of clusters from 20.02.5 to 20.02.6, and are seeing problems with the usage reported by sshare:

[day36@rztopaz578:~]$ sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000 9223372036854775808      1.000000            
 asc                                 89905    0.899050 9223372036854775808           nan            
  hpcic2b                             2983    0.033179           0           nan            
  sparc                                  5    0.000056 9223372036854775808           nan            
  wci                                86917    0.966765 9223372036854775808           nan            
   cmetal                               17    0.000196 9223372036854775808           nan            
    cbronze                             10    0.588235 9223372036854775808           nan            
    csilver                              7    0.411765 9223372036854775808           nan            
   gmetal                             4900    0.056376 9223372036854775808           nan            
    gbronze                           2700    0.551020 9223372036854775808           nan            
    ggold                              200    0.040816           0           nan            
    gsilver                           2000    0.408163           0           nan            
   imetal                            11000    0.126558 9223372036854775808           nan            
    ibronze                           6100    0.554545 9223372036854775808           nan            
    igold                              400    0.036364           0           nan            
    isilver                           4500    0.409091 9223372036854775808           nan            
   pmetal                             5000    0.057526 9223372036854775808           nan            
    pbronze                           5000    1.000000 9223372036854775808           nan            
   wmetal                            66000    0.759345 9223372036854775808           nan            
    wbronze                          46400    0.703030 9223372036854775808           nan            
    wgold                             2500    0.037879           0           nan            
    wsilver                          17100    0.259091           0           nan        

I'm just starting to look into it, but I'm also seeing lots of errors in the slurmctld logs like:

[2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES cpu grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained.
[2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES node grp_used_tres_run_secs underflow, tried to remove 3000 seconds when only 2997 remained.
[2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES billing grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained.

which seem like they might be related.

I can upload slurm.conf and such, but really quick, here's our config:

[day36@rztopaz578:~]$ scontrol show config
Configuration data as of 2021-02-22T08:34:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = rzslurmdb
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2021-02-18T07:51:39
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = rztopaz
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = /etc/slurm/epilog
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 65533 sec
JobAcctGatherFrequency  = task=60
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm/joblog
JobCompPort             = 0
JobCompType             = jobcomp/filetxt
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 1
JobRequeue              = 0
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 60 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = lscratchrza:100000,lustre1:100000
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 42968
MaxJobCount             = 20000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 200000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = pmi2
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 4983521
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = CANCEL
PreemptType             = preempt/qos
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 990000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 2000000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = /etc/slurm/prolog
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 0
RoutePlugin             = route/default
SallocDefaultCommand    = /usr/bin/srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none --mpibind=off $SHELL
SbcastParameters        = (null)
SchedulerParameters     = kill_invalid_depend,bf_continue,bf_max_job_test=5000,bf_max_job_user=50,bf_interval=60
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE
SlurmUser               = slurm(101)
SlurmctldAddr           = (null)
SlurmctldDebug          = verbose
SlurmctldHost[0]        = rztopaz187(erztopaz187)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = verbose
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.02.6
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /tmp/rztopaz
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)
TaskProlog              = /etc/slurm/mpibind.prolog
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = Yes
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 3600 sec
VSizeFactor             = 0 percent
WaitTime                = 30 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 95.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at rztopaz187 is UP
[day36@rztopaz578:~]$

Clusters that are still on 20.02.5 are still showing reasonable values for usage.
Comment 1 Ryan Day 2021-02-22 10:30:33 MST
It looks like people who are actually running have reasonable values for RawUsage, but users who haven't run are getting the underflowed value, which is then propagated up the tree:

[day36@rztopaz194:~]$ sshare -a
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000 9223372036854775808      1.000000            
 root                      root          1    0.000010    33923250           nan   0.000477
...
  wci                                86917    0.966765 9223372036854775808           nan            
   cmetal                               17    0.000196 9223372036854775808           nan            
    cbronze                             10    0.588235 9223372036854775808           nan            
     cbronze            abcbitz          1    0.004902           0           nan   0.934700 
     cbronze            abdulla          1    0.004902 9223372036854775808           nan   0.936130 
     cbronze           adams106          1    0.004902 9223372036854775808           nan   0.971401 
     cbronze             adler5          1    0.004902     7814080           nan   0.963775 
     cbronze             afeyan          1    0.004902 9223372036854775808           nan   0.970448 
     cbronze            agrusa1          1    0.004902           0           nan   0.960439 
     cbronze              alan2          1    0.004902 9223372036854775808           nan   0.952812 
     cbronze              alead          1    0.004902           0           nan   0.934223 
     cbronze              ames6          1    0.004902       85760           nan   0.933746

Values for usage in the slurmdb appear to be okay as well.
Comment 2 Jason Booth 2021-02-22 10:39:20 MST
Ryan this might be a duplicate of bug#10824.


https://github.com/SchedMD/slurm/commit/c57311f19d2ec9a258162909699aba9505e368b8

commit c57311f19d2ec9a258162909699aba9505e368b8
Author:     Albert Gil <albert.gil@schedmd.com>
AuthorDate: Fri Feb 12 18:41:37 2021 +0100

    Work around glibc bug where "0" as a long double is printed as "nan".
    
    On broken glibc versions, the zeroes in the association state file will
    be saved as "nan" in packlongdouble(). Detect if this has happened in
    unpacklongdouble() and convert back to zero.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1925204
Comment 3 Ryan Day 2021-02-22 10:57:36 MST
Yes. It does look like we also updated to the broken 322 build of glibc at the same time as we updated to 20.02.6. I'm not quite clear from the discussion of that bug whether just fixing glibc will be sufficient to completely fix this, or if we'll still have to do more to clean up the NaNs that were introduced by the 322 build of glibc.

Thanks,
Ryan
Comment 4 Jason Booth 2021-02-22 11:12:32 MST
> [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES cpu grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained.
> [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES node grp_used_tres_run_secs underflow, tried to remove 3000 seconds when only 2997 remained.
> [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES billing grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained.

This might also be a duplicate of bug#7375. We will let you know.
Comment 5 Marshall Garey 2021-02-22 15:03:42 MST
(In reply to Ryan Day from comment #3)
> Yes. It does look like we also updated to the broken 322 build of glibc at
> the same time as we updated to 20.02.6. I'm not quite clear from the
> discussion of that bug whether just fixing glibc will be sufficient to
> completely fix this, or if we'll still have to do more to clean up the NaNs
> that were introduced by the 322 build of glibc.


Since NaNs have now been introduced, I believe you'll need to update Slurm as well. Alternatively, you could just cherrypick Albert's commit and apply it to your 20.02 Slurm locally - it's a really small commit so it should be easy enough to apply. This way you won't have to upgrade Slurm and deal with everything that goes along with an upgrade.

You can always just upgrade glibc and try it out (see if the NaNs go away) but I think you'll also need the patch.

Can you let us know when you've upgraded glibc and/or applied the patch to Slurm and if it works for you?

I'll look into the assoc underflow errors and get back to you - they might be related to 7375, but they might be something else.
Comment 6 Ryan Day 2021-02-22 16:20:22 MST
Thanks. I've pulled Albert's commit and applied it to 20.02.6. We'll include that with the newer glibc and let you know if we still see any issues.
Comment 7 Marshall Garey 2021-02-25 12:40:02 MST
Hey Ryan, did everything work out with Albert's patch and the glibc update?
Comment 8 Ryan Day 2021-02-25 12:58:06 MST
(In reply to Marshall Garey from comment #7)
> Hey Ryan, did everything work out with Albert's patch and the glibc update?

Hey Marshall,

Yes. It's looking good. Thanks for the fast responses on everything. We're really glad we got the glibc issue caught before it could bite our users.

Ryan
Comment 9 Marshall Garey 2021-02-25 13:22:40 MST
Sounds good. I'll close this as a duplicate of bug 10824

If you keep seeing those association underflow errors, you can open a new bug report about it. I don't think it's a duplicate of 7375 which deals with qos underflow errors.

*** This ticket has been marked as a duplicate of ticket 10824 ***