Ticket 10831 - Bogus Fairshare
Summary: Bogus Fairshare
Status: RESOLVED DUPLICATE of ticket 10824
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-02-10 07:30 MST by Paul Edmon
Modified: 2021-02-11 01:51 MST (History)
0 users

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2021-02-10 07:30:09 MST
After a restart of our slurmctld today I started seeing NAN's in sprio:

[root@holy-slurm02 ~]# sprio | head
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE        QOS
       13138401 gpu_reque -922337203          0   10000000        nan          0
       13139089 gpu_reque -922337203          0   10000000        nan          0
       13139091 gpu_reque -922337203          0   10000000        nan          0
       13139329 gpu_reque -922337203          0   10000000        nan          0
       13139339 gpu_reque -922337203          0   10000000        nan          0
       13139356 gpu_reque -922337203          0   10000000        nan          0
       13139668 gpu_reque -922337203          0   10000000        nan          0
       13140165 gpu_reque -922337203          0   10000000        nan          0
       13248924 gpu_reque -922337203          0   10000000        nan          0

Here's an example job:

[root@holy7c22501 ~]# scontrol show job 13138401
JobId=13138401 JobName=submit_gpu_requeue.slurm
   UserId=briedel(30101) GroupId=icecube(10263) MCS_label=N/A
   Priority=4294967295 Nice=0 Account=icecube QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-01-20T11:36:28 EligibleTime=2021-01-20T11:38:29
   AccrueTime=2021-01-20T11:38:29
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-10T09:28:08
   Partition=gpu_requeue AllocNode:Sid=icecube:1988
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=holygpu7c1711
   NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=11600M,node=1,billing=39,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=11600M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/home00/briedel/submit_gpu_requeue.slurm
   WorkDir=/n/home00/briedel
   StdErr=/n/home00/briedel/out/myerrors_gpu_13138401.err
   StdIn=/dev/null
   StdOut=/n/home00/briedel/out/myoutput_gpu_13138401.out
   Power=
   CpusPerTres=gpu:1
   MemPerTres=gpu:100
   TresPerNode=gpu:1
   NtasksPerTRES:0

The slurmctld is throwing this error:

Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority '9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1)

Something got messed up with the way its calculating fairshare.  We are using classic fairshare.

Everything is still scheduling but the queues are now basically all in FIFO.  We can probably continue on this way for a while, but this is a pretty major issue if NAN's are being thrown.
Comment 1 Paul Edmon 2021-02-10 07:49:41 MST
In addition sshare looks like this:

[root@holy7c22501 ~]# sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root                                          1.000000 9223372036854775808      1.000000   0.500000
 root                      root        100    0.001376    15315873           nan        nan
 ac290r                                100    0.001376           0           nan        nan
 acc_lab                               100    0.001376  1566896368           nan        nan
 ahill_lab                             100    0.001376    73846100           nan        nan
 aizenberg_lab                         100    0.001376     9495643           nan        nan
 alsan_lab                             100    0.001376    44422022           nan        nan
 am205                                 100    0.001376        4071           nan        nan
 amin_lab                              100    0.001376    38153763           nan        nan
 anderson_lab                          112    0.001541  1215878208           nan        nan
 ang_lab                               100    0.001376    21049678           nan        nan
 anl                                   168    0.002312      900148           nan        nan
 ap278                                 100    0.001376 9223372036854775808           nan        nan
 arguelles_delgado_+                   116    0.001597   622380030           nan        nan
 arielamir_lab                         100    0.001376  1130953646           nan        nan
 arlotta_lab                           100    0.001376   113139463           nan        nan
 arthanari_lab                         100    0.001376    23241570           nan        nan
Comment 2 Albert Gil 2021-02-11 01:51:38 MST
Hi Paul,

This seems a duplicate of bug 10824.
I'm concentrating my investigation there.
Please add yourself in the CC list of bug 10824 to be updated and helping me with the investigation.

Regards,
Albert

*** This ticket has been marked as a duplicate of ticket 10824 ***