Created attachment 8284 [details] slurm.conf Since upgrading to 18.08.3 we're seeing a lot of these in our slurmctld logs that preface the start of a job. [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial acct oz016 accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial user 10675 accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 169(oz016/dweight/(null)) accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 155(oz016/(null)/(null)) accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 710(ozgrav/(null)/(null)) accrue_cnt underflow [2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow [2018-11-12T21:02:07.578] backfill: Started JobId=1712496_66(1722685) in skylake on john20 or: [2018-11-12T21:00:37.449] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow [2018-11-12T21:00:37.449] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow [2018-11-12T21:00:37.449] backfill: Started JobId=1722675 in skylake on john49 Not sure if this is a bug or whether the new version is surfacing issues that were already present? Looks like others are seeing it too (see bug#5838). Let me know if there's anything else you need! All the best, Chris
Hi Chris, Are the jobs that cause these errors still from the queue before the upgrade? Can you show me an 'scontrol show job' and show me the serial QoS details? Theoretically this issue should be fixed in 3e3400cdaa available @ 18.08.2.
Hi Felip, Yes, you're right, it does look like it's just for jobs submitted from before the upgrade (of which there are lots, including a number of job arrays). A newly submitted sbatch just shows my logging from the Lua script: Hi Felip, This seems to be happening for newly submitted jobs as well. For instance for a simple "sbatch --wrap hostname" I see: [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Incoming job wants 4294967294 tasks with 65534 cores per task (min_cpus=1) [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): will set tmp GRES to 100M [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): after setting tmp GRES we have GRES (tmp:100M) [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Starting job packing tests [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Setting ntasks_per_node to be num_tasks (1) for user 10631 [2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Setting cores-per-socket to be 1 when job requested 1 for user 10631 [2018-11-13T08:18:19.317] _slurm_rpc_submit_batch_job: JobId=1724956 InitPrio=3370997 usec=440 [2018-11-13T08:18:19.393] sched: Allocate JobId=1724956 NodeList=john101 #CPUs=1 Partition=skylake [2018-11-13T08:18:19.414] prolog_running_decr: Configuration for JobId=1724956 is complete [2018-11-13T08:18:20.120] _job_complete: JobId=1724956 WEXITSTATUS 0 [2018-11-13T08:18:20.120] _job_complete: JobId=1724956 done Whereas a job array element that started just after it shows the issue I reported: [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial acct oz016 accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial user 10675 accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 169(oz016/dweight/(null)) accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 155(oz016/(null)/(null)) accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 710(ozgrav/(null)/(null)) accrue_cnt underflow [2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow [2018-11-13T08:20:19.027] backfill: Started JobId=1712514_38(1724959) in skylake on john101 [2018-11-13T08:20:19.044] prolog_running_decr: Configuration for JobId=1712514_38(1724959) is complete Here's the scontrol show job for that job: $ scontrol show job 1712514_38 JobId=1724959 ArrayJobId=1712514 ArrayTaskId=38 JobName=banks.sh UserId=dweight(10675) GroupId=UWA(504) MCS_label=N/A Priority=101665 Nice=0 Account=oz016 QOS=serial JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:02:13 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2018-11-10T10:39:48 EligibleTime=2018-11-10T10:39:51 AccrueTime=2018-11-10T10:39:51 StartTime=2018-11-13T08:20:19 EndTime=2018-11-16T08:20:19 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-11-13T08:20:19 Partition=skylake AllocNode:Sid=farnarkle2:26602 ReqNodeList=(null) ExcNodeList=(null) NodeList=john101 BatchHost=john101 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:* TRES=cpu=1,mem=2G,node=1,billing=1,gres/tmp=104857600 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=100M Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/fred/oz016/dweight/sbnk/banks.sh H1 parameter_testing 1 378 WorkDir=/fred/oz016/dweight/sbnk StdErr=/fred/oz016/dweight/sbnk/logs_parameter_testing/gstlal_iir_bank_1712514_H1_378_38.err StdIn=/dev/null StdOut=/fred/oz016/dweight/sbnk/logs_parameter_testing/gstlal_iir_bank_1712514_H1_378_38.out Power= TresPerNode=tmp:100M Here's the serial QOS: $ sacctmgr list qos serial Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- serial 0 00:00:00 cluster NoReserve 1.000000 cpu=1280 cpu=1 cpu=256 256 Hope that helps! All the best, Chris
Just a "me too" comment. Just today over 21k lines of "accrue_cnt underflow" in our slurmctld.log. I see it for two different QOSes and associations: # sacctmgr list -P qos normal,high Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500,mem=35T||10000| high|1000|00:00:00|low,lowest|cluster|||1.000000|||||||||||cpu=48||||||
(In reply to Jurij Pečar from comment #4) > Just a "me too" comment. Just today over 21k lines of "accrue_cnt underflow" > in our slurmctld.log. > > I see it for two different QOSes and associations: > > # sacctmgr list -P qos normal,high > Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpT > RES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNo > de|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|Ma > xSubmitPA|MinTRES > normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500, > mem=35T||10000| > high|1000|00:00:00|low,lowest|cluster|||1.000000|||||||||||cpu=48|||||| Thanks, Which Slurm version are you using?
(In reply to Felip Moll from comment #5) > Which Slurm version are you using? 18.08.3, same as Christopher.
Chris, can you send me slurmctld log with this error and associated job information, if possible at a debug2 (from submission of the affected job until it ends or at least error happens)? Also scontrol show job of the affected job, and a show qos and partition of the requested ones. Thanks
(In reply to Jurij Pečar from comment #7) > (In reply to Felip Moll from comment #5) > > Which Slurm version are you using? > > 18.08.3, same as Christopher. (In reply to Felip Moll from comment #8) > Chris, can you send me slurmctld log with this error and associated job > information, if possible at a debug2 (from submission of the affected job > until it ends or at least error happens)? > > Also scontrol show job of the affected job, and a show qos and partition of > the requested ones. > > Thanks Jurij, Your info would be also valuable if you want to add it here. Thanks
Got one: slurmctld.log at debug2: [2018-11-14T10:09:26.606] backfill: beginning [2018-11-14T10:09:26.607] debug: backfill: 2 jobs to backfill [2018-11-14T10:09:26.607] ========================================= [2018-11-14T10:09:26.607] Begin:2018-11-14T10:09:26 End:2018-11-21T10:09:26 Nodes:bc01-[01-14],bc02-[01-08,10-14],bc03-[01-14],bc06-08,bx01-[01-18],bx02-[01-07,09-18],bx03-[01-18],bx04-[01-1 8],bx05-[01-18],bx06-[01-18],bx07-[01-18],bx08-[01-02,04-18],bx09-[01-17],gpu[2-5],nile,sb01-[01-20],sb02-[01-02,04-20],sm-epyc-[01-05] [2018-11-14T10:09:26.607] ========================================= [2018-11-14T10:09:26.607] backfill test for JobId=23795814_* Prio=3728 Partition=htc [2018-11-14T10:09:26.607] debug2: backfill: entering _try_sched for JobId=23795814_*. [2018-11-14T10:09:26.608] Test JobId=23795814_* at 2018-11-14T10:09:26 on bx01-[02-18],bx02-[01-07,09-18],bx03-[01-18],bx04-[01-18],bx05-[01-18],bx06-[01-18],bx07-[01-18],bx08-[01-02,04-18], bx09-[01-17],sb01-[01-20],sb02-[02,04-20],sm-epyc-[01-05] [2018-11-14T10:09:26.608] debug2: select_p_job_test for JobId=23795814_* [2018-11-14T10:09:26.608] debug2: found 34 usable nodes from config containing bx[01-02]-[01-18] [2018-11-14T10:09:26.608] debug2: found 106 usable nodes from config containing bx[03-04,06-09]-[01-18] [2018-11-14T10:09:26.608] debug2: found 18 usable nodes from config containing bx[05]-[01-18] [2018-11-14T10:09:26.608] debug2: found 38 usable nodes from config containing sb[01-02]-[01-20] [2018-11-14T10:09:26.608] debug2: found 1 usable nodes from config containing sm-epyc-01 [2018-11-14T10:09:26.608] debug2: found 4 usable nodes from config containing sm-epyc-0[2-5] [2018-11-14T10:09:26.608] debug2: Advanced reservation removed 4 nodes from consideration for JobId=23795814_543(23795814) [2018-11-14T10:09:26.608] debug2: select_p_job_test for JobId=23795814_543(23795814) [2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: QOS normal acct gibson accrue_cnt underflow [2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: QOS normal user 22510 accrue_cnt underflow [2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: Assoc 2622(gibson/hsanchez/(null)) accrue_cnt underflow [2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: Assoc 163(gibson/(null)/(null)) accrue_cnt underflow [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(cpu) is 728580939 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(mem) is 2306714571936 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(node) is 48075727 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(billing) is 728580939 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(fs/disk) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(vmem) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(pages) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(gres/gpu) is 3055255 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(cpu) is 13500642 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(mem) is 11389407232 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(node) is 3576382 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(billing) is 13500642 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(fs/disk) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(vmem) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(pages) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(gres/gpu) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(cpu) is 24728109 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(mem) is 51609897232 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(node) is 4978279 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(billing) is 24728109 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(vmem) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(pages) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is 728580939 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(mem) is 2306714571936 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(node) is 48075727 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is 728580939 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(vmem) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(pages) is 0 [2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 3055255 [2018-11-14T10:09:26.611] backfill: Started JobId=23795814_543(23798011) in htc on sm-epyc-01 [2018-11-14T10:09:26.615] backfill: Jobs backfilled since boot: 1 # scontrol show job 23795814_543 JobId=23798011 ArrayJobId=23795814 ArrayTaskId=543 JobName=Pmey UserId=hsanchez(22510) GroupId=gibson(452) MCS_label=N/A Priority=3728 Nice=0 Account=gibson QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:03:00 TimeLimit=06:00:00 TimeMin=N/A SubmitTime=2018-11-14T08:05:17 EligibleTime=2018-11-14T10:09:26 AccrueTime=2018-11-14T08:05:18 StartTime=2018-11-14T10:09:26 EndTime=2018-11-14T16:09:26 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-11-14T10:09:26 Partition=htc AllocNode:Sid=login:46818 ReqNodeList=(null) ExcNodeList=(null) NodeList=sm-epyc-01 BatchHost=sm-epyc-01 NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=4G,node=1,billing=6 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=6 MinMemoryNode=4G MinTmpDiskNode=0 Features=avx2 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/hsamano/Scripts/Multiple_submission_229.sh WorkDir=/home/hsanchez StdErr=/scratch/hsamano/tests/MultipleSubmission/logs/Pmey_23795814_543.err StdIn=/dev/null StdOut=/scratch/hsamano/tests/MultipleSubmission/logs/Pmey_23795814_543.out Power= # sacctmgr show -P qos normal Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500,mem=35T||10000| # scontrol show partition htc PartitionName=htc AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=00:05:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=20-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=bc[01-03]-[01-14],bc06-08,bx[01-09]-[01-18],sm-epyc-0[1-5],sb[01-02]-[01-20] PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=8000 TotalNodes=250 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=2048 MaxMemPerCPU=16384 Let me know if you need more.
That's enough, thank you. I reproduced the issue too. Will inform you as soon as I have more information.
On 14/11/18 5:51 am, bugs@schedmd.com wrote: > Chris, can you send me slurmctld log with this error and associated > job information, if possible at a debug2 (from submission of the > affected job until it ends or at least error happens)? > > Also scontrol show job of the affected job, and a show qos and > partition of the requested ones. Sorry I missed this earlier (been chasing other issues), do you still need these now you've reproduced them? All the best, Chris
(In reply to Christopher Samuel from comment #14) > On 14/11/18 5:51 am, bugs@schedmd.com wrote: > > > Chris, can you send me slurmctld log with this error and associated > > job information, if possible at a debug2 (from submission of the > > affected job until it ends or at least error happens)? > > > > Also scontrol show job of the affected job, and a show qos and > > partition of the requested ones. > > Sorry I missed this earlier (been chasing other issues), do you still > need these now you've reproduced them? > > All the best, > Chris Hi Chris, No, I don't need anything more for the moment. I discovered that accrue counters are not behaving correctly for job arrays, when you cancel a job array or a task belonging to it. It also fails when restarting/reconfiguring slurmctld, so the most probable thing is that you had some job arrays running in your system and you issued a reconfigure/restart of slurmctld. At that point the counters are reset to 0 and when a job terminates it causes the underflow. It doesn't happen with regular jobs, but regular jobs can trigger the error if there are more job arrays tasks finishing than number of regular jobs in the system, after a restart. We are working on a solution.
Hi Felip, On Thursday, 15 November 2018 11:42:52 PM AEDT bugs@schedmd.com wrote: > No, I don't need anything more for the moment. [...] > We are working on a solution. Thanks so much for the confirmation and the update! All the best, Chris
*** Ticket 6072 has been marked as a duplicate of this ticket. ***
Hi, The issue is finally solved in commit 1e21965cc619b0. This fix will be included starting at 18.08.4. It turned out that there were a couple of issues. Everything should be fine now. I am heading to close the bug. Best regards, Felip M
Closing issue as fix provided.
On 28/11/18 2:33 am, bugs@schedmd.com wrote: > The issue is finally solved in commit 1e21965cc619b0. Thank you Felip! I'll let the Swinburne folks know. All the best, Chris