After a restart of our slurmctld today I started seeing NAN's in sprio: [root@holy-slurm02 ~]# sprio | head JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS 13138401 gpu_reque -922337203 0 10000000 nan 0 13139089 gpu_reque -922337203 0 10000000 nan 0 13139091 gpu_reque -922337203 0 10000000 nan 0 13139329 gpu_reque -922337203 0 10000000 nan 0 13139339 gpu_reque -922337203 0 10000000 nan 0 13139356 gpu_reque -922337203 0 10000000 nan 0 13139668 gpu_reque -922337203 0 10000000 nan 0 13140165 gpu_reque -922337203 0 10000000 nan 0 13248924 gpu_reque -922337203 0 10000000 nan 0 Here's an example job: [root@holy7c22501 ~]# scontrol show job 13138401 JobId=13138401 JobName=submit_gpu_requeue.slurm UserId=briedel(30101) GroupId=icecube(10263) MCS_label=N/A Priority=4294967295 Nice=0 Account=icecube QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-01-20T11:36:28 EligibleTime=2021-01-20T11:38:29 AccrueTime=2021-01-20T11:38:29 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-10T09:28:08 Partition=gpu_requeue AllocNode:Sid=icecube:1988 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=holygpu7c1711 NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=11600M,node=1,billing=39,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=11600M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/home00/briedel/submit_gpu_requeue.slurm WorkDir=/n/home00/briedel StdErr=/n/home00/briedel/out/myerrors_gpu_13138401.err StdIn=/dev/null StdOut=/n/home00/briedel/out/myoutput_gpu_13138401.out Power= CpusPerTres=gpu:1 MemPerTres=gpu:100 TresPerNode=gpu:1 NtasksPerTRES:0 The slurmctld is throwing this error: Feb 10 09:28:32 holy-slurm02 slurmctld[21208]: error: JobId=16980533 priority '9223372036854775808' exceeds 32 bits. Reducing it to 4294967295 (2^32 - 1) Something got messed up with the way its calculating fairshare. We are using classic fairshare. Everything is still scheduling but the queues are now basically all in FIFO. We can probably continue on this way for a while, but this is a pretty major issue if NAN's are being thrown.
In addition sshare looks like this: [root@holy7c22501 ~]# sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 9223372036854775808 1.000000 0.500000 root root 100 0.001376 15315873 nan nan ac290r 100 0.001376 0 nan nan acc_lab 100 0.001376 1566896368 nan nan ahill_lab 100 0.001376 73846100 nan nan aizenberg_lab 100 0.001376 9495643 nan nan alsan_lab 100 0.001376 44422022 nan nan am205 100 0.001376 4071 nan nan amin_lab 100 0.001376 38153763 nan nan anderson_lab 112 0.001541 1215878208 nan nan ang_lab 100 0.001376 21049678 nan nan anl 168 0.002312 900148 nan nan ap278 100 0.001376 9223372036854775808 nan nan arguelles_delgado_+ 116 0.001597 622380030 nan nan arielamir_lab 100 0.001376 1130953646 nan nan arlotta_lab 100 0.001376 113139463 nan nan arthanari_lab 100 0.001376 23241570 nan nan
Hi Paul, This seems a duplicate of bug 10824. I'm concentrating my investigation there. Please add yourself in the CC list of bug 10824 to be updated and helping me with the investigation. Regards, Albert *** This ticket has been marked as a duplicate of ticket 10824 ***