Ticket 8033 - slurmctld.log - error: slurmdbd: agent queue filling (55619), RESTART SLURMDBD NOW
Summary: slurmctld.log - error: slurmdbd: agent queue filling (55619), RESTART SLURMDB...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-11-01 00:02 MDT by Damien
Modified: 2019-11-27 02:48 MST (History)
1 user (show)

See Also:
Site: Monash University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (83.48 MB, application/x-gzip)
2019-11-01 00:16 MDT, Damien
Details
slurmdb logs (89.05 KB, application/x-gzip)
2019-11-01 00:18 MDT, Damien
Details
patch proposal (590 bytes, patch)
2019-11-01 01:58 MDT, Dominik Bartkiewicz
Details | Diff
runaway jobs (6.62 MB, text/plain)
2019-11-20 19:17 MST, Damien
Details
Current slurmdbd logs (752.39 KB, application/x-bzip2)
2019-11-21 02:35 MST, Damien
Details
workaround (426 bytes, patch)
2019-11-21 06:04 MST, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Damien 2019-11-01 00:02:08 MDT
Good Afternoon Slurm Support

We are getting these errors in our slurmctld.log, see the followings:
--
[2019-11-01T11:05:11.960] Registering slurmctld at port 6817 with slurmdbd.
[2019-11-01T11:06:49.397] error: slurmdbd: agent queue filling (55619), RESTART SLURMDBD NOW
[2019-11-01T11:08:54.015] error: slurmdbd: agent queue filling (55659), RESTART SLURMDBD NOW
[2019-11-01T11:10:55.293] error: slurmdbd: agent queue filling (55697), RESTART SLURMDBD NOW
[2019-11-01T11:13:03.528] error: slurmdbd: agent queue filling (55721), RESTART SLURMDBD NOW
[2019-11-01T11:15:13.307] error: slurmdbd: agent queue filling (55771), RESTART SLURMDBD NOW
[2019-11-01T11:17:40.362] error: slurmdbd: agent queue filling (55818), RESTART SLURMDBD NOW
[2019-11-01T11:19:42.175] error: slurmdbd: agent queue filling (55852), RESTART SLURMDBD NOW
[2019-11-01T11:21:53.956] error: slurmdbd: agent queue filling (55908), RESTART SLURMDBD NOW
[2019-11-01T11:23:55.364] error: slurmdbd: agent queue filling (55936), RESTART SLURMDBD NOW
[2019-11-01T11:26:01.045] error: slurmdbd: agent queue filling (55992), RESTART SLURMDBD NOW
[2019-11-01T11:28:09.010] error: slurmdbd: agent queue filling (56016), RESTART SLURMDBD NOW
[2019-11-01T11:30:11.090] error: slurmdbd: agent queue filling (56052), RESTART SLURMDBD NOW
[2019-11-01T11:31:52.000] slurmdbd: agent queue size 56100
[2019-11-01T11:32:35.690] error: slurmdbd: agent queue filling (56108), RESTART SLURMDBD NOW
[2019-11-01T11:34:43.201] error: slurmdbd: agent queue filling (56128), RESTART SLURMDBD NOW
[2019-11-01T11:36:50.111] error: slurmdbd: agent queue filling (56193), RESTART SLURMDBD NOW
[2019-11-01T11:45:15.943] Registering slurmctld at port 6817 with slurmdbd.


It is making us very nervous. 



These circumstances might contributed to these errors:

- On Monday (28th Oct 2019), we suffered from https://bugs.schedmd.com/show_bug.cgi?id=8006
- Two days ago, we rebuild the VM that is running its mariaDB, so the database is down for at least 12 hours 


I will be sending more logs into this ticket.


Kindly advise

Many Thanks

Damien
Comment 1 Damien 2019-11-01 00:16:54 MDT
Created attachment 12192 [details]
slurmctld log

slurmctld log with errors slurmdbd
Comment 2 Damien 2019-11-01 00:18:20 MDT
Created attachment 12193 [details]
slurmdb logs
Comment 3 Dominik Bartkiewicz 2019-11-01 01:58:00 MDT
Created attachment 12194 [details]
patch proposal

Like in bug 8006 you hit the known bug.

This patch drops multiple irrelevant messages that spam slurmdbd.

Please recompile slurmdbd and restart it.
Comment 4 Dominik Bartkiewicz 2019-11-01 02:08:29 MDT
Hi

If you didn't run slurmdbd with the above patch
you can also cherrypick patch from 19.05:
https://github.com/SchedMD/slurm/commit/89910fd50a3120bd1d8b29fcf6d543e53dbfe732

Could you send me outputs from:
"scontrol show res" "sacctmgr show reservation"

Dominik
Comment 5 Damien 2019-11-01 02:16:08 MDT
$ scontrol show res
ReservationName=mxbeampostproc StartTime=2017-12-18T12:20:35 EndTime=2025-01-08T11:38:57 Duration=2577-23:18:22
   Nodes=m3d[006-007] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=48
   Users=(null) Accounts=-training Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=cryosparc StartTime=2018-11-09T11:33:37 EndTime=2020-11-08T11:33:37 Duration=730-00:00:00
   Nodes=m3h015 NodeCnt=1 CoreCnt=28 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=28
   Users=jafarl,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=highmem StartTime=2018-11-30T11:34:35 EndTime=2019-11-30T11:34:35 Duration=365-00:00:00
   Nodes=m3m000 NodeCnt=1 CoreCnt=36 Features=(null) PartitionName=m3m Flags=SPEC_NODES
   TRES=cpu=36
   Users=trungn,ctan,damienl,chines,lancew,jafarl,nwong,kerriw,philipc,smichnow,jchang,mmcg0002,angu0022,ska565,ozgeo,rbeare,zseeger,keving,swil0005,lawyl1,nlkar1,stya0001 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=sexton StartTime=2019-02-07T11:36:44 EndTime=2019-12-31T12:00:00 Duration=327-00:23:16
   Nodes=m3g[000,006] NodeCnt=2 CoreCnt=72 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=72
   Users=galisa,lynnliang,mbelouso,xzha0043,ctan,spiper Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=M3-backup StartTime=2019-03-15T08:43:14 EndTime=2020-03-14T08:43:14 Duration=365-00:00:00
   Nodes=m3d[002,008] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=48
   Users=ctan,trungn Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=CryoemFacility StartTime=2019-05-28T10:12:23 EndTime=2020-05-27T10:12:23 Duration=365-00:00:00
   Nodes=m3c004,m3g014 NodeCnt=2 CoreCnt=60 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=60
   Users=lancew,hven0001,jafarl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=AWX StartTime=2019-06-07T14:48:59 EndTime=2019-12-31T12:00:00 Duration=206-20:11:01
   Nodes=m3a[009-010] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=48
   Users=trungn,ctan,damienl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=simple StartTime=2019-06-30T10:41:27 EndTime=2020-06-29T10:41:27 Duration=365-00:00:00
   Nodes=m3a[016-020] NodeCnt=5 CoreCnt=120 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=120
   Users=hael,creboul,zqi1,sarahle,lancew,marionb,ctan,skie0002,mcaggian Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=uow StartTime=2019-09-13T11:06:48 EndTime=2025-01-08T10:00:00 Duration=1943-21:53:12
   Nodes=m3c011,m3h011 NodeCnt=2 CoreCnt=52 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=52
   Users=lancew,simonb,jafarl,jbouwer Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=nfs StartTime=2019-09-23T18:17:40 EndTime=2020-09-22T18:17:40 Duration=365-00:00:00
   Nodes=m3f[027-028] NodeCnt=2 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=6
   Users=chines Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=p100testing StartTime=2019-10-28T09:00:00 EndTime=2019-11-11T09:00:00 Duration=14-00:00:00
   Nodes=m3h[001,003] NodeCnt=2 CoreCnt=56 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=56
   Users=philipc,mbelouso Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
Comment 6 Dominik Bartkiewicz 2019-11-01 03:07:33 MDT
Hi

Have you managed to recompile slurmdbd with one of those patches?
Did this help?

Dominik
Comment 7 Damien 2019-11-01 03:43:48 MDT
Hi Dominik

Does v18.08.08 have the fixes for these too ?

Cheers

Damien

On Friday, 1 November 2019, <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=8033#c6> on bug
> 8033 <https://bugs.schedmd.com/show_bug.cgi?id=8033> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Hi
>
> Have you managed to recompile slurmdbd with one of those patches?
> Did this help?
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 8 Dominik Bartkiewicz 2019-11-01 04:07:22 MDT
Hi

89910fd50a is only available in 19.05.
18.08 contains a few fixes which prevent such a situation (incoherent reservation data in slurmctld and slurmdbd).

I recommend applying 89910fd50a before you start losing accounting data.

Please send me the output from "sacctmgr show reservation"

Dominik
Comment 9 Damien 2019-11-01 04:42:10 MDT
$ sacctmgr show reservation
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
        m3             AWX                         cpu=48 2019-10-28T20:16:55 2019-12-31T12:00:00 3.4818e+05 
        m3  CryoemFacility                         cpu=60 2019-10-28T20:16:55 2020-05-27T10:12:23 4.4692e+04 
        m3       cryosparc                         cpu=28 2019-10-28T20:16:55 2020-11-08T11:33:37 3.4818e+05 
        m3         highmem                         cpu=36 2019-10-28T20:16:55 2019-11-30T11:34:35 1.5440e+05 
        m3       M3-backup                         cpu=48 2019-10-28T20:16:55 2020-03-14T08:43:14 3.4818e+05 
        m3  mxbeampostproc                         cpu=48 2019-10-28T20:16:55 2025-01-08T11:38:57 3.4818e+05 
        m3             nfs                          cpu=6 2019-10-28T20:16:55 2020-09-22T18:17:40 3.4818e+05 
        m3     p100testing                         cpu=56 2019-10-28T20:16:55 2019-11-11T09:00:00 2.6519e+05 
        m3          sexton                         cpu=72 2019-10-28T20:16:55 2019-12-31T12:00:00 8.7279e+04 
        m3          simple                        cpu=120 2019-10-28T20:16:55 2020-06-29T10:41:27   0.000000 
        m3             uow                         cpu=52 2019-10-28T20:16:55 2025-01-08T10:00:00 3.4818e+05
Comment 10 Damien 2019-11-01 04:46:03 MDT
sacctmgr show reservation -p
Cluster|Name|TRES|TimeStart|TimeEnd|UnusedWall|
m3|AWX|cpu=48|2019-10-28T20:16:55|2019-12-31T12:00:00|348185.000000|
m3|CryoemFacility|cpu=60|2019-10-28T20:16:55|2020-05-27T10:12:23|44692.000000|
m3|cryosparc|cpu=28|2019-10-28T20:16:55|2020-11-08T11:33:37|348185.000000|
m3|highmem|cpu=36|2019-10-28T20:16:55|2019-11-30T11:34:35|154400.000000|
m3|M3-backup|cpu=48|2019-10-28T20:16:55|2020-03-14T08:43:14|348185.000000|
m3|mxbeampostproc|cpu=48|2019-10-28T20:16:55|2025-01-08T11:38:57|348185.000000|
m3|nfs|cpu=6|2019-10-28T20:16:55|2020-09-22T18:17:40|348185.000000|
m3|p100testing|cpu=56|2019-10-28T20:16:55|2019-11-11T09:00:00|265186.571429|
m3|sexton|cpu=72|2019-10-28T20:16:55|2019-12-31T12:00:00|87279.000000|
m3|simple|cpu=120|2019-10-28T20:16:55|2020-06-29T10:41:27|0.000000|
m3|uow|cpu=52|2019-10-28T20:16:55|2025-01-08T10:00:00|348185.000000|
Comment 11 Damien 2019-11-02 02:16:24 MDT
Hi Dominik

Looks like the direction is to upgrade to the latest stable v19.05.x, Just have to sort out any depreciated features.

Can I conclude the root cause of this whole episode is that, some jobs being submitted into our production cluster with GPU requirements but without GRES values, which cause this bug to surface to both the slurmctld and slurmdbd ?



Cheers

Damien
Comment 12 Dominik Bartkiewicz 2019-11-02 04:40:10 MDT
Hi

I assume that this was the root cause of segfault from bug 8006.
Multiple restarts/crash of slurmctld and slurmdbd can cause situation like in this bug (spamming slurmdbd incorrect reservation update requests).

Update to 19.05 will solve these issues, but it must be done quick otherways
slurmctld start dropping accounting information.

In my opinion, te easers solution, for now, is to use 
https://github.com/SchedMD/slurm/commit/89910fd50a

Dominik
Comment 13 Dominik Bartkiewicz 2019-11-04 05:19:15 MST
Hi

Any news?

Dominik
Comment 14 Damien 2019-11-04 07:14:11 MST
Hi Dominik

Thanks for the follow up.

I am preparing to rollup v19.05.03, Once the CR is tested and approved. I will do a rolling upgrade.



Cheers

Damien
Comment 15 Damien 2019-11-05 19:43:03 MST
Hi Dominik

We are planning to upgrade to v19.05.03, 

One of the first task (as with the previous updates) is to update slurmdbd first and convert database (18.08.06 -> 19.05.03), As we are having this problem right now.

1) Do we need to fix this first slurmdbd/mySQL before this upgrade ?

or

2) Just go ahead and upgrade to v19.05.03, it will fix this issue and existing database 



Kindly advise


Thanks

Damien
Comment 16 Dominik Bartkiewicz 2019-11-05 23:32:22 MST
(In reply to Damien from comment #15)
> Hi Dominik
> 
> We are planning to upgrade to v19.05.03, 
> 
> One of the first task (as with the previous updates) is to update slurmdbd
> first and convert database (18.08.06 -> 19.05.03), As we are having this
> problem right now.
> 
> 1) Do we need to fix this first slurmdbd/mySQL before this upgrade ?
> 
> or
> 
> 2) Just go ahead and upgrade to v19.05.03, it will fix this issue and
> existing database 
> 
> 
> 
> Kindly advise
> 
> 
> Thanks
> 
> Damien

Hi

1) no
2) yes
You can just update slurmdbd without any extra /not standard steps.

Notes:
Bug 8029
all 19.05 releases have a known problem, which is not yet fixed.
old sacct (18.08) has some problem with displaying new job records, but all dates are correctly storing in the database.

Dominik
Comment 17 Damien 2019-11-06 00:44:17 MST
Hi Dominik

Thanks for your info.


So I can just upgrade directly to v19.05.03, and it will solve our current issues.




To clarify, 

1) Upgrade slurmDBD and convert/repair database
An hour

2) Upgrade primary slurmctld and secondary slurmctld
30 minutes

3) Upgrade all compute nodes and restart slurmds
An hour

With this proposed timeline, I should not faced this problem https://bugs.schedmd.com/show_bug.cgi?id=8029

Right ?



Cheers

Damien
Comment 18 Dominik Bartkiewicz 2019-11-06 00:55:23 MST
Hi

Yes, I don't see any problem

Dominik
Comment 19 Dominik Bartkiewicz 2019-11-12 02:58:50 MST
Hi

Could you tell me if the upgrade went well?

Dominik
Comment 20 Damien 2019-11-20 16:00:54 MST
Hi Dominik

Thanks for this follow up.

Finally, I have the CR to upgrade to v19.05.04, I am doing it now.



Cheers

Damien
Comment 21 Damien 2019-11-20 19:15:12 MST
Hi Dominik 

I am seeing this error after the upgrade to v19.05.04:

# sacctmgr show runaway > runaway.txt
sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

See attached (runaway.txt)

[ec2-user@m3-mgmt1 sbin]$ ./slurmctld -V
slurm 19.05.4

[ec2-user@m3-mgmt1 sbin]$ sudo ./slurmdbd -V
slurm 19.05.4



Kindly advise


Thanks

Damien
Comment 22 Damien 2019-11-20 19:17:55 MST
Created attachment 12361 [details]
runaway jobs

Kindly review
Comment 23 Dominik Bartkiewicz 2019-11-21 01:46:39 MST
Hi

The first error "Resource temporarily unavailable" occurred due to updating the database schema.
During this process slurmdbd responses this way on requests.

I think all these runaway jobs came from lost accounting data,
and unfortunately, some of those can be lost permanently. 
Could you send me the output from sdiag?

Dominik
Comment 24 Damien 2019-11-21 01:54:27 MST
[damienl@m3-login1 ~]$ 
[damienl@m3-login1 ~]$ sdiag
*******************************************************
sdiag output at Thu Nov 21 19:51:43 2019 (1574326303)
Data since      Thu Nov 21 11:34:22 2019 (1574296462)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0

DBD Agent queue size: 0

Jobs submitted: 1576
Jobs started:   1244
Jobs completed: 1298
Jobs canceled:  654
Jobs failed:    107

Job states ts:  Thu Nov 21 19:51:17 2019 (1574326277)
Jobs pending:   1790
Jobs running:   366

Main schedule statistics (microseconds):
	Last cycle:   6617
	Max cycle:    689757
	Total cycles: 1963
	Mean cycle:   8939
	Mean depth cycle:  301
	Cycles per minute: 3
	Last queue length: 1080

Backfilling stats
	Total backfilled jobs (since last slurm start): 850
	Total backfilled jobs (since last stats cycle start): 850
	Total backfilled heterogeneous job components: 0
	Total cycles: 186
	Last cycle when: Thu Nov 21 19:49:59 2019 (1574326199)
	Last cycle: 50962799
	Max cycle:  139778412
	Mean cycle: 88238873
	Last depth cycle: 1122
	Last depth cycle (try sched): 334
	Depth Mean: 974
	Depth Mean (try depth): 269
	Last queue length: 1084
	Queue length mean: 1013

Latency for gettimeofday() (x1000): 27 microseconds

Remote Procedure Call statistics by message type
	REQUEST_FED_INFO                        ( 2049) count:261407 ave_time:111    total_time:29191258
	REQUEST_PARTITION_INFO                  ( 2009) count:222390 ave_time:212    total_time:47164334
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:197733 ave_time:469336 total_time:92803388501
	REQUEST_JOB_USER_INFO                   ( 2039) count:63669  ave_time:177437 total_time:11297276984
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:17828  ave_time:371567 total_time:6624311001
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2783   ave_time:45263  total_time:125969702
	REQUEST_UPDATE_NODE                     ( 3002) count:2387   ave_time:13073  total_time:31206991
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:1694   ave_time:168319 total_time:285133339
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:1577   ave_time:298427 total_time:470620147
	REQUEST_STEP_COMPLETE                   ( 5016) count:1564   ave_time:64703  total_time:101195877
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:1547   ave_time:292122 total_time:451913175
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:1276   ave_time:424674 total_time:541884666
	REQUEST_NODE_INFO                       ( 2007) count:961    ave_time:17452  total_time:16771928
	REQUEST_KILL_JOB                        ( 5032) count:670    ave_time:43586  total_time:29202883
	REQUEST_CONTROL_STATUS                  ( 2053) count:294    ave_time:109    total_time:32107
	REQUEST_SHARE_INFO                      ( 2022) count:142    ave_time:4335   total_time:615659
	REQUEST_JOB_INFO                        ( 2003) count:141    ave_time:274539 total_time:38710057
	REQUEST_RESERVATION_INFO                ( 2024) count:75     ave_time:34618  total_time:2596380
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:14     ave_time:6621   total_time:92706
	REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:13     ave_time:294840 total_time:3832931
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:9      ave_time:79528  total_time:715755
	REQUEST_JOB_READY                       ( 4019) count:7      ave_time:333120 total_time:2331845
	REQUEST_PING                            ( 1008) count:6      ave_time:121    total_time:727
	REQUEST_UPDATE_JOB                      ( 3001) count:3      ave_time:249064 total_time:747192
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:2      ave_time:187    total_time:374
	REQUEST_TRIGGER_PULL                    ( 2030) count:1      ave_time:125    total_time:125
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:1      ave_time:4500   total_time:4500
	REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0

Remote Procedure Call statistics by user
	spiper          (   12815) count:157833 ave_time:45376  total_time:7161941665
	swil0005        (   10264) count:129331 ave_time:231910 total_time:29993234988
	root            (       0) count:46844  ave_time:178468 total_time:8360168179
	a1058400        (   11860) count:20405  ave_time:160558 total_time:3276190690
	vpam0001        (   11208) count:16821  ave_time:158968 total_time:2674014874
	rukshana        (   10942) count:12049  ave_time:152603 total_time:1838720823
	hfettke         (   11128) count:11799  ave_time:149586 total_time:1764969090
	bgully          (   10652) count:10817  ave_time:151090 total_time:1634344561
	tday0002        (   12019) count:10533  ave_time:150989 total_time:1590377477
	awri0001        (   11597) count:10363  ave_time:148071 total_time:1534460168
	nsingh          (   12834) count:9908   ave_time:153935 total_time:1525191685
	asegal          (   11199) count:9452   ave_time:154914 total_time:1464249055
	aafo0001        (   11936) count:9107   ave_time:153237 total_time:1395533288
	robe0002        (   12508) count:9007   ave_time:154414 total_time:1390812825
	mahmoudm        (   11193) count:8882   ave_time:155416 total_time:1380408567
	hbhukya1        (   10583) count:8579   ave_time:154224 total_time:1323089062
	ksabaroe        (   10365) count:8556   ave_time:156796 total_time:1341550759
	msam0006        (   11491) count:8554   ave_time:155057 total_time:1326363120
	dkor0003        (   10806) count:8551   ave_time:156050 total_time:1334391860
	ttha0011        (   11105) count:8549   ave_time:156814 total_time:1340606888
	vieth           (   11159) count:8548   ave_time:155134 total_time:1326093038
	ska565          (   10255) count:8547   ave_time:157061 total_time:1342406227
	elia0001        (   10680) count:8545   ave_time:155794 total_time:1331265156
	emct0001        (   11214) count:8544   ave_time:154893 total_time:1323410598
	chaos           (   10272) count:8541   ave_time:155929 total_time:1331790547
	hven0001        (   10350) count:8541   ave_time:155931 total_time:1331814677
	aazhar          (   11233) count:8541   ave_time:156106 total_time:1333303787
	scle0003        (   11552) count:8538   ave_time:157057 total_time:1340953256
	aper0034        (   11507) count:8538   ave_time:156919 total_time:1339781446
	yche0241        (   11880) count:8538   ave_time:157089 total_time:1341228250
	csen0005        (   10998) count:8538   ave_time:156955 total_time:1340084586
	joshuamhardy    (   10156) count:8538   ave_time:156934 total_time:1339908677
	strikm          (   12849) count:8515   ave_time:143787 total_time:1224347318
	gmaj0001        (   10708) count:8297   ave_time:156558 total_time:1298967447
	sarahsyh        (   11365) count:8088   ave_time:158678 total_time:1283390121
	xiaoliuz        (   10462) count:7794   ave_time:156480 total_time:1219608911
	huiw            (   11363) count:7590   ave_time:160663 total_time:1219435862
	stuarto         (   10273) count:7050   ave_time:156394 total_time:1102580241
	wparker         (   11551) count:6003   ave_time:168641 total_time:1012357688
	tallison        (   10896) count:5663   ave_time:156993 total_time:889051643
	ychen           (   11648) count:5328   ave_time:160490 total_time:855093028
	mfarrell        (   10679) count:5170   ave_time:171170 total_time:884949319
	scottk          (   11656) count:5151   ave_time:148395 total_time:764384252
	bsun            (   11451) count:4896   ave_time:148291 total_time:726033780
	jawon5          (   11240) count:4781   ave_time:142611 total_time:681826696
	jcoxon          (   10835) count:4687   ave_time:166827 total_time:781921815
	ademarco        (   10256) count:4542   ave_time:170087 total_time:772538509
	lancew          (   10213) count:4532   ave_time:150366 total_time:681462011
	rshi0007        (   10423) count:4261   ave_time:145292 total_time:619091689
	clam0015        (   11607) count:4234   ave_time:172173 total_time:728983097
	mbernhar        (   10398) count:4232   ave_time:143730 total_time:608267818
	bktan12         (   11006) count:4170   ave_time:164563 total_time:686230129
	anirka          (   12061) count:3588   ave_time:129919 total_time:466151661
	hwan0088        (   11628) count:2787   ave_time:155925 total_time:434565615
	wpit0001        (   11017) count:2782   ave_time:119519 total_time:332503014
	jinmingz        (   12653) count:2760   ave_time:126613 total_time:349454482
	uqpmart6        (   11694) count:2634   ave_time:122579 total_time:322875008
	tnewing         (   10925) count:2533   ave_time:125210 total_time:317157403
	melvinl         (   12907) count:2502   ave_time:129006 total_time:322773722
	esomayeh        (   10379) count:2193   ave_time:100939 total_time:221359348
	tsiejka         (   12091) count:2142   ave_time:118669 total_time:254190436
	phoebeimms      (   10690) count:1919   ave_time:159973 total_time:306989524
	arevans         (   11437) count:1878   ave_time:115411 total_time:216742760
	mom008          (   11465) count:1779   ave_time:40346  total_time:71777277
	ctav0001        (   12340) count:1779   ave_time:128953 total_time:229407799
	sgujjari        (   12661) count:1671   ave_time:108665 total_time:181579846
	khsmsampath     (   11028) count:1520   ave_time:174775 total_time:265659500
	mju2            (   11470) count:1232   ave_time:72626  total_time:89475773
	titust          (   12865) count:1016   ave_time:116938 total_time:118809711
	suzanm          (   11205) count:853    ave_time:137001 total_time:116862685
	mkam0016        (   12925) count:756    ave_time:77172  total_time:58342773
	yaghmaien       (   12664) count:719    ave_time:152840 total_time:109892012
	jpollock        (   12914) count:701    ave_time:31738  total_time:22248743
	ctiw0001        (   12015) count:513    ave_time:141064 total_time:72365989
	wtsch1          (   11038) count:415    ave_time:31441  total_time:13048043
	land            (   10877) count:368    ave_time:41022  total_time:15096298
	mbos0003        (   11595) count:343    ave_time:57252  total_time:19637660
	slurm           (     497) count:295    ave_time:109    total_time:32232
	aklaus          (   10976) count:285    ave_time:117614 total_time:33520249
	mziemann        (   10442) count:240    ave_time:16777  total_time:4026686
	flii0002        (   11596) count:230    ave_time:58724  total_time:13506725
	sgwe0001        (   10173) count:224    ave_time:63965  total_time:14328207
	xuanlih         (   10644) count:182    ave_time:84236  total_time:15331060
	lindacac        (   11464) count:170    ave_time:112589 total_time:19140257
	taoh            (   11834) count:122    ave_time:40877  total_time:4987089
	scohen1         (   11404) count:121    ave_time:140231 total_time:16967970
	jrigby          (   10065) count:114    ave_time:93654  total_time:10676590
	shaunc          (   10436) count:113    ave_time:122155 total_time:13803593
	lwan0025        (   10803) count:102    ave_time:62875  total_time:6413274
	hhew0002        (   11689) count:76     ave_time:145397 total_time:11050182
	jueqingl        (   12339) count:70     ave_time:16080  total_time:1125601
	gholamrh        (   10097) count:66     ave_time:62349  total_time:4115065
	cristobg        (   12021) count:63     ave_time:49781  total_time:3136230
	amaligg         (   10873) count:55     ave_time:107446 total_time:5909576
	wcao0006        (   10768) count:54     ave_time:132485 total_time:7154223
	xvuthith        (   10699) count:49     ave_time:54393  total_time:2665286
	aher0013        (   11326) count:45     ave_time:14942  total_time:672420
	rkai0001        (   10719) count:45     ave_time:25170  total_time:1132662
	pt5684          (   11786) count:45     ave_time:14605  total_time:657243
	kwyr0001        (   11370) count:44     ave_time:310054 total_time:13642378
	damienl         (   10005) count:44     ave_time:93348  total_time:4107332
	ec2-user        (    1000) count:41     ave_time:179807 total_time:7372101
	trungn          (   10419) count:40     ave_time:114830 total_time:4593212
	eyk             (   10865) count:30     ave_time:35397  total_time:1061918
	benfulcher      (   10424) count:30     ave_time:215895 total_time:6476872
	pzaremoo        (   10041) count:29     ave_time:83384  total_time:2418159
	keving          (   10454) count:28     ave_time:48410  total_time:1355481
	tmason1         (   11592) count:21     ave_time:104538 total_time:2195314
	andrewc         (   10452) count:20     ave_time:113029 total_time:2260589
	dmpar7          (   11685) count:18     ave_time:53242  total_time:958358
	billyjow        (   10312) count:18     ave_time:166579 total_time:2998431
	jafarl          (   10347) count:18     ave_time:94010  total_time:1692197
	twang           (   11986) count:17     ave_time:157256 total_time:2673357
	rgomi           (   11448) count:17     ave_time:156806 total_time:2665714
	zhuy            (   11494) count:15     ave_time:114    total_time:1723
	vchr0002        (   11487) count:15     ave_time:118    total_time:1782
	gahun3          (   12820) count:15     ave_time:124    total_time:1873
	tandrill        (   11162) count:15     ave_time:140    total_time:2113
	debopris        (   11349) count:15     ave_time:56240  total_time:843607
	narjesa         (   10666) count:15     ave_time:124    total_time:1871
	dnguyen         (   11186) count:15     ave_time:10844  total_time:162674
	jste0021        (   10774) count:12     ave_time:70416  total_time:844993
	andrewpe        (   10152) count:11     ave_time:212765 total_time:2340424
	philipc         (   10003) count:11     ave_time:184018 total_time:2024204
	kwong           (   12517) count:8      ave_time:1158   total_time:9264
	hhartono        (   11775) count:6      ave_time:239071 total_time:1434428
	dbw             (   10800) count:6      ave_time:122027 total_time:732163
	kcorbett        (   12751) count:6      ave_time:141    total_time:850
	mbelouso        (   10091) count:6      ave_time:588688 total_time:3532130
	fruzicka        (   12906) count:4      ave_time:408643 total_time:1634572
	leei            (   10874) count:4      ave_time:3808   total_time:15235
	szhang          (   12694) count:4      ave_time:567    total_time:2268
	xjia0032        (   11502) count:4      ave_time:111882 total_time:447528
	ctan            (   10189) count:4      ave_time:550    total_time:2200
	zwan394         (   11209) count:3      ave_time:126    total_time:378
	rbeare          (   10299) count:3      ave_time:75592  total_time:226777
	charlesbj       (   10297) count:3      ave_time:214    total_time:643
	miguelg         (   10815) count:3      ave_time:232890 total_time:698671
	khodgins        (   11538) count:2      ave_time:486800 total_time:973600
	fairuzhh        (   10228) count:2      ave_time:827    total_time:1655
	ywan0280        (   11526) count:2      ave_time:680    total_time:1360

Pending RPC statistics
	No pending RPCs




Is this good or bad ?
Comment 25 Damien 2019-11-21 02:00:30 MST
 
[ec2-user@m3-mgmt1 ~]$ 
[ec2-user@m3-mgmt1 ~]$ sdiag
*******************************************************
sdiag output at Thu Nov 21 19:55:12 2019 (1574326512)
Data since      Thu Nov 21 11:34:22 2019 (1574296462)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 1579
Jobs started:   1249
Jobs completed: 1301
Jobs canceled:  654
Jobs failed:    107

Job states ts:  Thu Nov 21 19:54:49 2019 (1574326489)
Jobs pending:   1787
Jobs running:   367

Main schedule statistics (microseconds):
	Last cycle:   6105
	Max cycle:    689757
	Total cycles: 1972
	Mean cycle:   8989
	Mean depth cycle:  301
	Cycles per minute: 3
	Last queue length: 1079

Backfilling stats
	Total backfilled jobs (since last slurm start): 853
	Total backfilled jobs (since last stats cycle start): 853
	Total backfilled heterogeneous job components: 0
	Total cycles: 188
	Last cycle when: Thu Nov 21 19:53:33 2019 (1574326413)
	Last cycle: 48423258
	Max cycle:  139778412
	Mean cycle: 87833870
	Last depth cycle: 1118
	Last depth cycle (try sched): 330
	Depth Mean: 975
	Depth Mean (try depth): 270
	Last queue length: 1080
	Queue length mean: 1014

Latency for 1000 calls to gettimeofday(): 27 microseconds

Remote Procedure Call statistics by message type
	REQUEST_FED_INFO                        ( 2049) count:263391 ave_time:111    total_time:29414907
	REQUEST_PARTITION_INFO                  ( 2009) count:223889 ave_time:211    total_time:47380872
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:199240 ave_time:468220 total_time:93288178721
	REQUEST_JOB_USER_INFO                   ( 2039) count:64146  ave_time:176847 total_time:11344088999
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:18008  ave_time:367854 total_time:6624329513
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2872   ave_time:44080  total_time:126597873
	REQUEST_UPDATE_NODE                     ( 3002) count:2387   ave_time:13073  total_time:31206991
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:1697   ave_time:168467 total_time:285889254
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:1580   ave_time:297870 total_time:470634828
	REQUEST_STEP_COMPLETE                   ( 5016) count:1567   ave_time:65200  total_time:102169432
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:1550   ave_time:291558 total_time:451915394
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:1281   ave_time:424041 total_time:543197008
	REQUEST_NODE_INFO                       ( 2007) count:977    ave_time:17193  total_time:16797772
	REQUEST_KILL_JOB                        ( 5032) count:670    ave_time:43586  total_time:29202883
	REQUEST_CONTROL_STATUS                  ( 2053) count:296    ave_time:109    total_time:32352
	REQUEST_JOB_INFO                        ( 2003) count:144    ave_time:275460 total_time:39666367
	REQUEST_SHARE_INFO                      ( 2022) count:142    ave_time:4335   total_time:615659
	REQUEST_RESERVATION_INFO                ( 2024) count:75     ave_time:34618  total_time:2596380
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:14     ave_time:6621   total_time:92706
	REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:13     ave_time:294840 total_time:3832931
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:9      ave_time:79528  total_time:715755
	REQUEST_JOB_READY                       ( 4019) count:7      ave_time:333120 total_time:2331845
	REQUEST_PING                            ( 1008) count:6      ave_time:121    total_time:727
	REQUEST_UPDATE_JOB                      ( 3001) count:3      ave_time:249064 total_time:747192
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:2      ave_time:187    total_time:374
	REQUEST_TRIGGER_PULL                    ( 2030) count:1      ave_time:125    total_time:125
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:1      ave_time:4500   total_time:4500
	REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:171    total_time:171

Remote Procedure Call statistics by user
	spiper          (   12815) count:159003 ave_time:45318  total_time:7205759520
	swil0005        (   10264) count:130667 ave_time:231151 total_time:30203855369
	root            (       0) count:47307  ave_time:176784 total_time:8363141346
	a1058400        (   11860) count:20465  ave_time:160384 total_time:3282269462
	vpam0001        (   11208) count:16895  ave_time:158633 total_time:2680106009
	rukshana        (   10942) count:12049  ave_time:152603 total_time:1838720823
	hfettke         (   11128) count:11859  ave_time:149461 total_time:1772460975
	bgully          (   10652) count:10891  ave_time:150891 total_time:1643361843
	tday0002        (   12019) count:10593  ave_time:150707 total_time:1596443726
	awri0001        (   11597) count:10419  ave_time:147792 total_time:1539848281
	nsingh          (   12834) count:9908   ave_time:153935 total_time:1525191685
	asegal          (   11199) count:9512   ave_time:154723 total_time:1471728180
	aafo0001        (   11936) count:9167   ave_time:153006 total_time:1402611180
	robe0002        (   12508) count:9067   ave_time:154064 total_time:1396899926
	mahmoudm        (   11193) count:8942   ave_time:155053 total_time:1386484435
	hbhukya1        (   10583) count:8639   ave_time:153849 total_time:1329105381
	ksabaroe        (   10365) count:8616   ave_time:156409 total_time:1347626254
	msam0006        (   11491) count:8614   ave_time:154683 total_time:1332442486
	dkor0003        (   10806) count:8611   ave_time:155670 total_time:1340481775
	ttha0011        (   11105) count:8609   ave_time:156426 total_time:1346672171
	vieth           (   11159) count:8608   ave_time:154761 total_time:1332182944
	ska565          (   10255) count:8607   ave_time:156672 total_time:1348481981
	elia0001        (   10680) count:8605   ave_time:155414 total_time:1337343602
	emct0001        (   11214) count:8604   ave_time:154520 total_time:1329495418
	chaos           (   10272) count:8601   ave_time:155548 total_time:1337870388
	hven0001        (   10350) count:8601   ave_time:155550 total_time:1337887414
	aazhar          (   11233) count:8601   ave_time:155725 total_time:1339396221
	scle0003        (   11552) count:8598   ave_time:156667 total_time:1347030704
	aper0034        (   11507) count:8598   ave_time:156530 total_time:1345852305
	yche0241        (   11880) count:8598   ave_time:156699 total_time:1347303495
	csen0005        (   10998) count:8598   ave_time:156565 total_time:1346150112
	joshuamhardy    (   10156) count:8598   ave_time:156545 total_time:1345974817
	strikm          (   12849) count:8575   ave_time:143653 total_time:1231830826
	gmaj0001        (   10708) count:8360   ave_time:156046 total_time:1304550653
	sarahsyh        (   11365) count:8088   ave_time:158678 total_time:1283390121
	xiaoliuz        (   10462) count:7854   ave_time:156058 total_time:1225686000
	huiw            (   11363) count:7632   ave_time:160908 total_time:1228051874
	stuarto         (   10273) count:7110   ave_time:155931 total_time:1108670010
	wparker         (   11551) count:6003   ave_time:168641 total_time:1012357688
	tallison        (   10896) count:5663   ave_time:156993 total_time:889051643
	ychen           (   11648) count:5388   ave_time:159979 total_time:861969654
	scottk          (   11656) count:5211   ave_time:147853 total_time:770463124
	mfarrell        (   10679) count:5170   ave_time:171170 total_time:884949319
	bsun            (   11451) count:4956   ave_time:147722 total_time:732110599
	jawon5          (   11240) count:4781   ave_time:142611 total_time:681826696
	jcoxon          (   10835) count:4687   ave_time:166827 total_time:781921815
	ademarco        (   10256) count:4542   ave_time:170087 total_time:772538509
	lancew          (   10213) count:4532   ave_time:150366 total_time:681462011
	rshi0007        (   10423) count:4261   ave_time:145292 total_time:619091689
	clam0015        (   11607) count:4234   ave_time:172173 total_time:728983097
	mbernhar        (   10398) count:4232   ave_time:143730 total_time:608267818
	bktan12         (   11006) count:4170   ave_time:164563 total_time:686230129
	anirka          (   12061) count:3662   ave_time:129339 total_time:473642748
	wpit0001        (   11017) count:2842   ave_time:119629 total_time:339986124
	jinmingz        (   12653) count:2820   ave_time:126576 total_time:356945877
	hwan0088        (   11628) count:2787   ave_time:155925 total_time:434565615
	uqpmart6        (   11694) count:2694   ave_time:122625 total_time:330354260
	tnewing         (   10925) count:2593   ave_time:125200 total_time:324644160
	melvinl         (   12907) count:2502   ave_time:129006 total_time:322773722
	esomayeh        (   10379) count:2207   ave_time:100827 total_time:222527283
	tsiejka         (   12091) count:2202   ave_time:118838 total_time:261681573
	arevans         (   11437) count:1938   ave_time:115704 total_time:224234400
	phoebeimms      (   10690) count:1919   ave_time:159973 total_time:306989524
	mom008          (   11465) count:1779   ave_time:40346  total_time:71777277
	ctav0001        (   12340) count:1779   ave_time:128953 total_time:229407799
	sgujjari        (   12661) count:1671   ave_time:108665 total_time:181579846
	khsmsampath     (   11028) count:1520   ave_time:174775 total_time:265659500
	mju2            (   11470) count:1240   ave_time:72158  total_time:89477051
	titust          (   12865) count:1016   ave_time:116938 total_time:118809711
	suzanm          (   11205) count:853    ave_time:137001 total_time:116862685
	mkam0016        (   12925) count:756    ave_time:77172  total_time:58342773
	yaghmaien       (   12664) count:719    ave_time:152840 total_time:109892012
	jpollock        (   12914) count:701    ave_time:31738  total_time:22248743
	ctiw0001        (   12015) count:513    ave_time:141064 total_time:72365989
	wtsch1          (   11038) count:415    ave_time:31441  total_time:13048043
	land            (   10877) count:368    ave_time:41022  total_time:15096298
	mbos0003        (   11595) count:343    ave_time:57252  total_time:19637660
	slurm           (     497) count:297    ave_time:109    total_time:32477
	aklaus          (   10976) count:285    ave_time:117614 total_time:33520249
	mziemann        (   10442) count:240    ave_time:16777  total_time:4026686
	flii0002        (   11596) count:230    ave_time:58724  total_time:13506725
	sgwe0001        (   10173) count:224    ave_time:63965  total_time:14328207
	ywan0280        (   11526) count:214    ave_time:8211   total_time:1757341
	xuanlih         (   10644) count:192    ave_time:84847  total_time:16290776
	lindacac        (   11464) count:170    ave_time:112589 total_time:19140257
	taoh            (   11834) count:122    ave_time:40877  total_time:4987089
	scohen1         (   11404) count:121    ave_time:140231 total_time:16967970
	jrigby          (   10065) count:114    ave_time:93654  total_time:10676590
	shaunc          (   10436) count:113    ave_time:122155 total_time:13803593
	lwan0025        (   10803) count:102    ave_time:62875  total_time:6413274
	hhew0002        (   11689) count:76     ave_time:145397 total_time:11050182
	jueqingl        (   12339) count:70     ave_time:16080  total_time:1125601
	gholamrh        (   10097) count:66     ave_time:62349  total_time:4115065
	cristobg        (   12021) count:63     ave_time:49781  total_time:3136230
	rkai0001        (   10719) count:60     ave_time:18946  total_time:1136795
	amaligg         (   10873) count:55     ave_time:107446 total_time:5909576
	wcao0006        (   10768) count:54     ave_time:132485 total_time:7154223
	xvuthith        (   10699) count:49     ave_time:54393  total_time:2665286
	damienl         (   10005) count:45     ave_time:91277  total_time:4107503
	aher0013        (   11326) count:45     ave_time:14942  total_time:672420
	pt5684          (   11786) count:45     ave_time:14605  total_time:657243
	kwyr0001        (   11370) count:44     ave_time:310054 total_time:13642378
	ec2-user        (    1000) count:41     ave_time:179807 total_time:7372101
	trungn          (   10419) count:40     ave_time:114830 total_time:4593212
	benfulcher      (   10424) count:30     ave_time:215895 total_time:6476872
	eyk             (   10865) count:30     ave_time:35397  total_time:1061918
	pzaremoo        (   10041) count:29     ave_time:83384  total_time:2418159
	keving          (   10454) count:28     ave_time:48410  total_time:1355481
	tmason1         (   11592) count:21     ave_time:104538 total_time:2195314
	andrewc         (   10452) count:20     ave_time:113029 total_time:2260589
	billyjow        (   10312) count:18     ave_time:166579 total_time:2998431
	jafarl          (   10347) count:18     ave_time:94010  total_time:1692197
	dmpar7          (   11685) count:18     ave_time:53242  total_time:958358
	rgomi           (   11448) count:17     ave_time:156806 total_time:2665714
	twang           (   11986) count:17     ave_time:157256 total_time:2673357
	vchr0002        (   11487) count:15     ave_time:118    total_time:1782
	gahun3          (   12820) count:15     ave_time:124    total_time:1873
	tandrill        (   11162) count:15     ave_time:140    total_time:2113
	debopris        (   11349) count:15     ave_time:56240  total_time:843607
	narjesa         (   10666) count:15     ave_time:124    total_time:1871
	dnguyen         (   11186) count:15     ave_time:10844  total_time:162674
	zhuy            (   11494) count:15     ave_time:114    total_time:1723
	jste0021        (   10774) count:12     ave_time:70416  total_time:844993
	philipc         (   10003) count:11     ave_time:184018 total_time:2024204
	andrewpe        (   10152) count:11     ave_time:212765 total_time:2340424
	kwong           (   12517) count:8      ave_time:1158   total_time:9264
	dbw             (   10800) count:6      ave_time:122027 total_time:732163
	kcorbett        (   12751) count:6      ave_time:141    total_time:850
	mbelouso        (   10091) count:6      ave_time:588688 total_time:3532130
	hhartono        (   11775) count:6      ave_time:239071 total_time:1434428
	leei            (   10874) count:4      ave_time:3808   total_time:15235
	szhang          (   12694) count:4      ave_time:567    total_time:2268
	xjia0032        (   11502) count:4      ave_time:111882 total_time:447528
	ctan            (   10189) count:4      ave_time:550    total_time:2200
	fruzicka        (   12906) count:4      ave_time:408643 total_time:1634572
	rbeare          (   10299) count:3      ave_time:75592  total_time:226777
	charlesbj       (   10297) count:3      ave_time:214    total_time:643
	miguelg         (   10815) count:3      ave_time:232890 total_time:698671
	zwan394         (   11209) count:3      ave_time:126    total_time:378
	fairuzhh        (   10228) count:2      ave_time:827    total_time:1655
	khodgins        (   11538) count:2      ave_time:486800 total_time:973600
	zdyson          (   11463) count:1      ave_time:24290  total_time:24290

Pending RPC statistics
	No pending RPCs
Comment 26 Damien 2019-11-21 02:02:19 MST
Hi Dominik


If this is the case, how can we clear these runaway jobs now ?


Cheers

Damien
Comment 27 Dominik Bartkiewicz 2019-11-21 02:15:05 MST
Hi

Could you send me slurmdbd.log?

Dominik
Comment 28 Damien 2019-11-21 02:35:49 MST
Created attachment 12363 [details]
Current slurmdbd logs

Kindly review this
Comment 29 Dominik Bartkiewicz 2019-11-21 03:06:19 MST
Hi

Which version of sacctmgr you use?

Dominik
Comment 30 Damien 2019-11-21 03:08:40 MST
[ec2-user@m3-mgmt1 ~]$ 
[ec2-user@m3-mgmt1 ~]$ which sacctmgr 
/opt/slurm-19.05.4/bin/sacctmgr
[ec2-user@m3-mgmt1 ~]$ sacctmgr -V
slurm 19.05.4
[ec2-user@m3-mgmt1 ~]$
Comment 31 Damien 2019-11-21 03:12:14 MST
[root@m3-mgmt1 ~]# 
[root@m3-mgmt1 ~]# sacctmgr show runaway   |more
NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database
          ID       Name  Partition    Cluster      State          TimeSubmit           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 
11231563     TEMPOcatm+       comp         m3    RUNNING 2019-09-09T10:42:00 2019-11-11T13:11:19             Unknown 
11231565     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231566     TEMPOcatm+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231573     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231576     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:31             Unknown             Unknown 
11730893             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-10T10:58:55             Unknown 
11764396             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-12T09:40:21             Unknown 
11764903             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-12T11:53:57             Unknown 
11800431             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-16T07:28:27             Unknown
.....
.....
.....

11870196     L037.slurm        m3g         m3    PENDING 2019-11-21T10:39:56             Unknown             Unknown 
11870220     Jupyter L+       comp         m3    RUNNING 2019-11-21T10:52:57 2019-11-21T10:52:57             Unknown 
11870230        desktop    desktop         m3    RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02             Unknown 
11870239      interact+       comp         m3    PENDING 2019-11-21T11:04:22             Unknown             Unknown 
11870251        desktop    desktop         m3    PENDING 2019-11-21T11:07:57             Unknown             Unknown 
11870262        desktop    desktop         m3    PENDING 2019-11-21T11:11:26             Unknown             Unknown 
sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
Comment 32 Dominik Bartkiewicz 2019-11-21 04:17:22 MST
Hi

My hypothesis from comment 23 are false, the reason for this error must be different.

Could you try one more time with debug output?
sacctmgr -vvvv list runawayjobs

Dominik
Comment 33 Damien 2019-11-21 04:35:20 MST
[root@m3-mgmt1 ~]# 
[root@m3-mgmt1 ~]# sacctmgr -vvvv list runawayjobs  |more
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/accounting_storage_slurmdbd.so
sacctmgr: Accounting storage SLURMDBD plugin loaded
sacctmgr: debug3: Success.
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/auth_munge.so
sacctmgr: debug:  Munge authentication plugin loaded
sacctmgr: debug3: Success.
sacctmgr: debug:  slurmdbd: Sent PersistInit msg
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cons_res.so
sacctmgr: Consumable Resources (CR) Node Selection plugin loaded with argument 20
sacctmgr: debug3: Success.
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so
sacctmgr: Cray/Aries node selection plugin loaded
sacctmgr: debug3: Success.
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so
sacctmgr: select/cons_tres loaded with argument 20
sacctmgr: debug3: Success.
sacctmgr: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/select_linear.so
sacctmgr: Linear node selection plugin loaded with argument 20
sacctmgr: debug3: Success.
NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database
          ID       Name  Partition    Cluster      State          TimeSubmit           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 
11231563     TEMPOcatm+       comp         m3    RUNNING 2019-09-09T10:42:00 2019-11-11T13:11:19             Unknown 
11231565     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231566     TEMPOcatm+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231573     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:30             Unknown             Unknown 
11231576     TEMPOcatp+       comp         m3    PENDING 2019-09-09T10:42:31             Unknown             Unknown 
11730893             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-10T10:58:55             Unknown 
11764396             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-12T09:40:21             Unknown 
11764903             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-12T11:53:57             Unknown 
11800431             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-16T07:28:27             Unknown 
11800732             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-16T07:36:07             Unknown 
11802587             mm       comp         m3    RUNNING 2019-09-10T10:44:48 2019-11-16T13:32:26             Unknown 
.....
.....
.....

11870196     L037.slurm        m3g         m3    PENDING 2019-11-21T10:39:56             Unknown             Unknown 
11870220     Jupyter L+       comp         m3    RUNNING 2019-11-21T10:52:57 2019-11-21T10:52:57             Unknown 
11870230        desktop    desktop         m3    RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02             Unknown 
11870239      interact+       comp         m3    PENDING 2019-11-21T11:04:22             Unknown             Unknown 
11870251        desktop    desktop         m3    PENDING 2019-11-21T11:07:57             Unknown             Unknown 
11870262        desktop    desktop         m3    PENDING 2019-11-21T11:11:26             Unknown             Unknown 
sacctmgr: debug2: persistent connection is closed
sacctmgr: debug2: persistent connection is closed
sacctmgr: debug2: persistent connection is closed
sacctmgr: debug2: persistent connection is closed
sacctmgr: debug2: persistent connection is closed
sacctmgr: error: slurmdbd: Sending message type MsgType=1488: 11: Resource temporarily unavailable
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

sacctmgr: debug2: persistent connection is closed
sacctmgr: debug:  slurmdbd: Sent fini msg
[root@m3-mgmt1 ~]#
Comment 35 Dominik Bartkiewicz 2019-11-21 05:41:36 MST
Hi

Can you try this patch?  recompile and restart slurmdbd.
Then one more time try "sacctmgr -vvvv list runawayjobs".
It will get past the msg size limit (it looks that we don't expect so many runaway jobs).


diff --git a/src/common/slurm_persist_conn.c b/src/common/slurm_persist_conn.c
index c134c88a03..8f34a0ecfe 100644
--- a/src/common/slurm_persist_conn.c
+++ b/src/common/slurm_persist_conn.c
@@ -60,7 +60,7 @@
  *  Maximum message size. Messages larger than this value (in bytes)
  *  will not be received.
  */
-#define MAX_MSG_SIZE     (16*1024*1024)
+#define MAX_MSG_SIZE     (50*1024*1024)
 
 typedef struct {
        void *arg;


Dominik
Comment 36 Damien 2019-11-21 05:56:35 MST
Hi Dominik

Thanks for this.

I will use the backup slurmdbd for this purpose, the patch seems to be missing from this thread.



Cheers

Damien
Comment 37 Dominik Bartkiewicz 2019-11-21 06:04:51 MST
Created attachment 12367 [details]
workaround
Comment 38 Damien 2019-11-21 19:21:36 MST
11870230        desktop    desktop         m3    RUNNING 2019-11-21T10:56:01 2019-11-21T10:56:02             Unknown 
11870239      interact+       comp         m3    PENDING 2019-11-21T11:04:22             Unknown             Unknown 
11870251        desktop    desktop         m3    PENDING 2019-11-21T11:07:57             Unknown             Unknown 
11870262        desktop    desktop         m3    PENDING 2019-11-21T11:11:26             Unknown             Unknown 

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the earliest submit time of all the runaway jobs.)

 (You have 30 seconds to decide)
(N/y): y
[root@m3-mgmt2 slurm-19.05.4]#
Comment 39 Damien 2019-11-21 20:43:39 MST
Hi Dominik

Thanks, the patch works.


Cheers

Damien
Comment 40 Dominik Bartkiewicz 2019-11-22 01:16:37 MST
Hi

Great, good to hear this.

Dominik
Comment 41 Dominik Bartkiewicz 2019-11-26 01:40:06 MST
Hi

The last issue with sacctmgr is already solved by:

https://github.com/SchedMD/slurm/commit/41b5f95c936fc4931639697f47209ebe0dc63676
https://github.com/SchedMD/slurm/commit/5a1c0acc581ac9ae6886607c41fe32d8c353c631

Those patches will be included in 20.02.

Let me know if we can close this ticket.

Dominik
Comment 42 Damien 2019-11-26 20:48:16 MST
Thanks 

Our slurmdbd is alright now.


Kindly close this ticket


Damien
Comment 43 Dominik Bartkiewicz 2019-11-27 02:48:20 MST
I'm closing this ticket as infogiven 

Dominik