Ticket 6016 - Since upgrading to 18.08.3 from 17.11.12 seeing lots of "accrue_cnt underflow" errors
Summary: Since upgrading to 18.08.3 from 17.11.12 seeing lots of "accrue_cnt underflow...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 18.08.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
: 6072 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-11-12 03:12 MST by Christopher Samuel
Modified: 2018-11-27 15:36 MST (History)
5 users (show)

See Also:
Site: Swinburne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.20 KB, text/plain)
2018-11-12 03:12 MST, Christopher Samuel
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Christopher Samuel 2018-11-12 03:12:50 MST
Created attachment 8284 [details]
slurm.conf

Since upgrading to 18.08.3 we're seeing a lot of these in our slurmctld logs that preface the start of a job.

[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial acct oz016 accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: QOS serial user 10675 accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 169(oz016/dweight/(null)) accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 155(oz016/(null)/(null)) accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 710(ozgrav/(null)/(null)) accrue_cnt underflow
[2018-11-12T21:02:07.578] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow
[2018-11-12T21:02:07.578] backfill: Started JobId=1712496_66(1722685) in skylake on john20

or:

[2018-11-12T21:00:37.449] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow
[2018-11-12T21:00:37.449] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow
[2018-11-12T21:00:37.449] backfill: Started JobId=1722675 in skylake on john49

Not sure if this is a bug or whether the new version is surfacing issues that were already present?

Looks like others are seeing it too (see bug#5838).

Let me know if there's anything else you need!

All the best,
Chris
Comment 2 Felip Moll 2018-11-12 11:02:24 MST
Hi Chris,

Are the jobs that cause these errors still from the queue before the upgrade?

Can you show me an 'scontrol show job' and show me the serial QoS details? 

Theoretically this issue should be fixed in 3e3400cdaa available @ 18.08.2.
Comment 3 Christopher Samuel 2018-11-12 14:23:37 MST
Hi Felip,

Yes, you're right, it does look like it's just for jobs submitted from before the upgrade (of which there are lots, including a number of job arrays).

A newly submitted sbatch just shows my logging from the Lua script:

Hi Felip,

This seems to be happening for newly submitted jobs as well. For instance for a simple "sbatch --wrap hostname" I see:

[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Incoming job wants 4294967294 tasks with 65534 cores per task (min_cpus=1)
[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): will set tmp GRES to 100M
[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): after setting tmp GRES we have GRES (tmp:100M)
[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Starting job packing tests
[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Setting ntasks_per_node to be num_tasks (1) for user 10631
[2018-11-13T08:18:19.317] job_submit.lua: slurm_job_submit (lua): Setting cores-per-socket to be 1 when job requested 1 for user 10631
[2018-11-13T08:18:19.317] _slurm_rpc_submit_batch_job: JobId=1724956 InitPrio=3370997 usec=440
[2018-11-13T08:18:19.393] sched: Allocate JobId=1724956 NodeList=john101 #CPUs=1 Partition=skylake
[2018-11-13T08:18:19.414] prolog_running_decr: Configuration for JobId=1724956 is complete
[2018-11-13T08:18:20.120] _job_complete: JobId=1724956 WEXITSTATUS 0
[2018-11-13T08:18:20.120] _job_complete: JobId=1724956 done


Whereas a job array element that started just after it shows the issue I reported:

[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial acct oz016 accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: QOS serial user 10675 accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 169(oz016/dweight/(null)) accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 155(oz016/(null)/(null)) accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 710(ozgrav/(null)/(null)) accrue_cnt underflow
[2018-11-13T08:20:19.027] error: acct_policy_handle_accrue_time: Assoc 1(root/(null)/(null)) accrue_cnt underflow
[2018-11-13T08:20:19.027] backfill: Started JobId=1712514_38(1724959) in skylake on john101
[2018-11-13T08:20:19.044] prolog_running_decr: Configuration for JobId=1712514_38(1724959) is complete

Here's the scontrol show job for that job:

$ scontrol show job 1712514_38
JobId=1724959 ArrayJobId=1712514 ArrayTaskId=38 JobName=banks.sh
   UserId=dweight(10675) GroupId=UWA(504) MCS_label=N/A
   Priority=101665 Nice=0 Account=oz016 QOS=serial
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:02:13 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2018-11-10T10:39:48 EligibleTime=2018-11-10T10:39:51
   AccrueTime=2018-11-10T10:39:51
   StartTime=2018-11-13T08:20:19 EndTime=2018-11-16T08:20:19 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-11-13T08:20:19
   Partition=skylake AllocNode:Sid=farnarkle2:26602
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=john101
   BatchHost=john101
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:*
   TRES=cpu=1,mem=2G,node=1,billing=1,gres/tmp=104857600
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=100M
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/fred/oz016/dweight/sbnk/banks.sh H1 parameter_testing 1 378
   WorkDir=/fred/oz016/dweight/sbnk
   StdErr=/fred/oz016/dweight/sbnk/logs_parameter_testing/gstlal_iir_bank_1712514_H1_378_38.err
   StdIn=/dev/null
   StdOut=/fred/oz016/dweight/sbnk/logs_parameter_testing/gstlal_iir_bank_1712514_H1_378_38.out
   Power=
   TresPerNode=tmp:100M


Here's the serial QOS:

$ sacctmgr list qos serial
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
    serial          0   00:00:00                cluster                                NoReserve               1.000000      cpu=1280                                                                   cpu=1                                                cpu=256       256



Hope that helps!

All the best,
Chris
Comment 4 Jurij Pečar 2018-11-13 07:42:59 MST
Just a "me too" comment. Just today over 21k lines of "accrue_cnt underflow" in our slurmctld.log.

I see it for two different QOSes and associations:

# sacctmgr list -P qos normal,high
Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500,mem=35T||10000|
high|1000|00:00:00|low,lowest|cluster|||1.000000|||||||||||cpu=48||||||
Comment 5 Felip Moll 2018-11-13 11:06:24 MST
(In reply to Jurij Pečar from comment #4)
> Just a "me too" comment. Just today over 21k lines of "accrue_cnt underflow"
> in our slurmctld.log.
> 
> I see it for two different QOSes and associations:
> 
> # sacctmgr list -P qos normal,high
> Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpT
> RES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNo
> de|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|Ma
> xSubmitPA|MinTRES
> normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500,
> mem=35T||10000|
> high|1000|00:00:00|low,lowest|cluster|||1.000000|||||||||||cpu=48||||||

Thanks,

Which Slurm version are you using?
Comment 7 Jurij Pečar 2018-11-13 11:50:00 MST
(In reply to Felip Moll from comment #5)
> Which Slurm version are you using?

18.08.3, same as Christopher.
Comment 8 Felip Moll 2018-11-13 11:51:02 MST
Chris, can you send me slurmctld log with this error and associated job information, if possible at a debug2 (from submission of the affected job until it ends or at least error happens)?

Also scontrol show job of the affected job, and a show qos and partition of
the requested ones.

Thanks
Comment 9 Felip Moll 2018-11-13 11:53:52 MST
(In reply to Jurij Pečar from comment #7)
> (In reply to Felip Moll from comment #5)
> > Which Slurm version are you using?
> 
> 18.08.3, same as Christopher.

(In reply to Felip Moll from comment #8)
> Chris, can you send me slurmctld log with this error and associated job
> information, if possible at a debug2 (from submission of the affected job
> until it ends or at least error happens)?
> 
> Also scontrol show job of the affected job, and a show qos and partition of
> the requested ones.
> 
> Thanks

Jurij,

Your info would be also valuable if you want to add it here.

Thanks
Comment 10 Jurij Pečar 2018-11-14 02:18:38 MST
Got one:

slurmctld.log at debug2:

[2018-11-14T10:09:26.606] backfill: beginning
[2018-11-14T10:09:26.607] debug:  backfill: 2 jobs to backfill
[2018-11-14T10:09:26.607] =========================================
[2018-11-14T10:09:26.607] Begin:2018-11-14T10:09:26 End:2018-11-21T10:09:26 Nodes:bc01-[01-14],bc02-[01-08,10-14],bc03-[01-14],bc06-08,bx01-[01-18],bx02-[01-07,09-18],bx03-[01-18],bx04-[01-1
8],bx05-[01-18],bx06-[01-18],bx07-[01-18],bx08-[01-02,04-18],bx09-[01-17],gpu[2-5],nile,sb01-[01-20],sb02-[01-02,04-20],sm-epyc-[01-05]
[2018-11-14T10:09:26.607] =========================================
[2018-11-14T10:09:26.607] backfill test for JobId=23795814_* Prio=3728 Partition=htc
[2018-11-14T10:09:26.607] debug2: backfill: entering _try_sched for JobId=23795814_*.
[2018-11-14T10:09:26.608] Test JobId=23795814_* at 2018-11-14T10:09:26 on bx01-[02-18],bx02-[01-07,09-18],bx03-[01-18],bx04-[01-18],bx05-[01-18],bx06-[01-18],bx07-[01-18],bx08-[01-02,04-18],
bx09-[01-17],sb01-[01-20],sb02-[02,04-20],sm-epyc-[01-05]
[2018-11-14T10:09:26.608] debug2: select_p_job_test for JobId=23795814_*
[2018-11-14T10:09:26.608] debug2: found 34 usable nodes from config containing bx[01-02]-[01-18]
[2018-11-14T10:09:26.608] debug2: found 106 usable nodes from config containing bx[03-04,06-09]-[01-18]
[2018-11-14T10:09:26.608] debug2: found 18 usable nodes from config containing bx[05]-[01-18]
[2018-11-14T10:09:26.608] debug2: found 38 usable nodes from config containing sb[01-02]-[01-20]
[2018-11-14T10:09:26.608] debug2: found 1 usable nodes from config containing sm-epyc-01
[2018-11-14T10:09:26.608] debug2: found 4 usable nodes from config containing sm-epyc-0[2-5]
[2018-11-14T10:09:26.608] debug2: Advanced reservation removed 4 nodes from consideration for JobId=23795814_543(23795814)
[2018-11-14T10:09:26.608] debug2: select_p_job_test for JobId=23795814_543(23795814)
[2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: QOS normal acct gibson accrue_cnt underflow
[2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: QOS normal user 22510 accrue_cnt underflow
[2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: Assoc 2622(gibson/hsanchez/(null)) accrue_cnt underflow
[2018-11-14T10:09:26.610] error: acct_policy_handle_accrue_time: Assoc 163(gibson/(null)/(null)) accrue_cnt underflow
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(cpu) is 728580939
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(mem) is 2306714571936
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(node) is 48075727
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(billing) is 728580939
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(fs/disk) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(vmem) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(pages) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), qos normal grp_used_tres_run_secs(gres/gpu) is 3055255
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(cpu) is 13500642
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(mem) is 11389407232
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(node) is 3576382
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(billing) is 13500642
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(vmem) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(pages) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 2622(gibson/hsanchez/(null)) grp_used_tres_run_secs(gres/gpu) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(cpu) is 24728109
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(mem) is 51609897232
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(node) is 4978279
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(billing) is 24728109
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(vmem) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(pages) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 163(gibson/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is 728580939
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(mem) is 2306714571936
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(node) is 48075727
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is 728580939
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(vmem) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(pages) is 0
[2018-11-14T10:09:26.610] debug2: acct_policy_job_begin: after adding JobId=23795814_543(23798011), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 3055255
[2018-11-14T10:09:26.611] backfill: Started JobId=23795814_543(23798011) in htc on sm-epyc-01
[2018-11-14T10:09:26.615] backfill: Jobs backfilled since boot: 1


# scontrol show job 23795814_543
JobId=23798011 ArrayJobId=23795814 ArrayTaskId=543 JobName=Pmey
   UserId=hsanchez(22510) GroupId=gibson(452) MCS_label=N/A
   Priority=3728 Nice=0 Account=gibson QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:00 TimeLimit=06:00:00 TimeMin=N/A
   SubmitTime=2018-11-14T08:05:17 EligibleTime=2018-11-14T10:09:26
   AccrueTime=2018-11-14T08:05:18
   StartTime=2018-11-14T10:09:26 EndTime=2018-11-14T16:09:26 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-11-14T10:09:26
   Partition=htc AllocNode:Sid=login:46818
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=sm-epyc-01
   BatchHost=sm-epyc-01
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=4G,node=1,billing=6
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryNode=4G MinTmpDiskNode=0
   Features=avx2 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/hsamano/Scripts/Multiple_submission_229.sh
   WorkDir=/home/hsanchez
   StdErr=/scratch/hsamano/tests/MultipleSubmission/logs/Pmey_23795814_543.err
   StdIn=/dev/null
   StdOut=/scratch/hsamano/tests/MultipleSubmission/logs/Pmey_23795814_543.out
   Power=

# sacctmgr show -P qos normal
Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
normal|100|00:00:00|low,lowest|cluster|||1.000000||||||||||||||cpu=3500,mem=35T||10000|

# scontrol show  partition htc
PartitionName=htc
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=20-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=bc[01-03]-[01-14],bc06-08,bx[01-09]-[01-18],sm-epyc-0[1-5],sb[01-02]-[01-20]
   PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=8000 TotalNodes=250 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2048 MaxMemPerCPU=16384


Let me know if you need more.
Comment 11 Felip Moll 2018-11-14 07:52:12 MST
That's enough, thank you.

I reproduced the issue too.

Will inform you as soon as I have more information.
Comment 14 Christopher Samuel 2018-11-14 16:03:00 MST
On 14/11/18 5:51 am, bugs@schedmd.com wrote:

> Chris, can you send me slurmctld log with this error and associated
> job information, if possible at a debug2 (from submission of the
> affected job until it ends or at least error happens)?
> 
> Also scontrol show job of the affected job, and a show qos and
> partition of the requested ones.

Sorry I missed this earlier (been chasing other issues), do you still
need these now you've reproduced them?

All the best,
Chris
Comment 15 Felip Moll 2018-11-15 05:42:52 MST
(In reply to Christopher Samuel from comment #14)
> On 14/11/18 5:51 am, bugs@schedmd.com wrote:
> 
> > Chris, can you send me slurmctld log with this error and associated
> > job information, if possible at a debug2 (from submission of the
> > affected job until it ends or at least error happens)?
> > 
> > Also scontrol show job of the affected job, and a show qos and
> > partition of the requested ones.
> 
> Sorry I missed this earlier (been chasing other issues), do you still
> need these now you've reproduced them?
> 
> All the best,
> Chris

Hi Chris,

No, I don't need anything more for the moment.

I discovered that accrue counters are not behaving correctly for
job arrays, when you cancel a job array or a task belonging to it. It
also fails when restarting/reconfiguring slurmctld, so the most probable
thing is that you had some job arrays running in your system and
you issued a reconfigure/restart of slurmctld.

At that point the counters are reset to 0 and when a job terminates it
causes the underflow.

It doesn't happen with regular jobs, but regular jobs can trigger the
error if there are more job arrays tasks finishing than number of
regular jobs in the system, after a restart.

We are working on a solution.
Comment 18 Christopher Samuel 2018-11-16 21:57:30 MST
Hi Felip,

On Thursday, 15 November 2018 11:42:52 PM AEDT bugs@schedmd.com wrote:

> No, I don't need anything more for the moment.
[...]
> We are working on a solution.

Thanks so much for the confirmation and the update!

All the best,
Chris
Comment 31 Felip Moll 2018-11-22 02:43:06 MST
*** Ticket 6072 has been marked as a duplicate of this ticket. ***
Comment 35 Felip Moll 2018-11-27 08:33:38 MST
Hi,

The issue is finally solved in commit 1e21965cc619b0.

This fix will be included starting at 18.08.4. It turned out that there
were a couple of issues. Everything should be fine now.

I am heading to close the bug.

Best regards,
Felip M
Comment 36 Felip Moll 2018-11-27 15:33:50 MST
Closing issue as fix provided.
Comment 37 Christopher Samuel 2018-11-27 15:36:01 MST
On 28/11/18 2:33 am, bugs@schedmd.com wrote:

> The issue is finally solved in commit 1e21965cc619b0.

Thank you Felip!  I'll let the Swinburne folks know.

All the best,
Chris