Bug 5147 - Agent Queue Size bursts and no cleanup
Summary: Agent Queue Size bursts and no cleanup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 17.11.5
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-05-08 17:06 MDT by Martin Siegert
Modified: 2018-05-23 05:13 MDT (History)
2 users (show)

See Also:
Site: Simon Fraser University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
backtrace of slurmstepd (4.44 KB, text/plain)
2018-05-08 17:06 MDT, Martin Siegert
Details
slurmctld.log (5.78 MB, application/xz)
2018-05-09 19:37 MDT, Martin Siegert
Details
sdiag output (19.78 KB, text/plain)
2018-05-09 19:38 MDT, Martin Siegert
Details
slurmd.log from cdr761 (2.90 MB, text/plain)
2018-05-09 19:38 MDT, Martin Siegert
Details
patch (2.20 KB, patch)
2018-05-16 10:31 MDT, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Siegert 2018-05-08 17:06:08 MDT
Created attachment 6799 [details]
backtrace of slurmstepd

This is in reference to bug 5111.
We have upgraded to 17.11.5 and additionally cherry picked commits 1675ada0a, a7c8964e, 3be9e1ee0 and e5f03971b. We continue to see the same problem: the agent queue size increases continuously (we've seen it go as high as 200000) and jobs stay in completing state. I am attaching the output from node cdr1545 of `gdb -batch -ex "thread apply all bt full" -p 99504` where is the slurmstepd process:
S USER       PID  PPID  NI   RSS  VSZ STIME %CPU     TIME COMMAND
S root       9489      1   0  7716 1506312 May03 0.0 00:00:01 /opt/software/slurm/sbin/slurmd
S root      99504      1   0  4536 306160 07:28  0.0 00:00:00 slurmstepd: [7774580.extern]

- Martin
Comment 2 Martin Siegert 2018-05-08 18:48:14 MDT
The biggest problem right now is that jobs get "started" by the scheduler are hung in the prolog (`scontrol show job <jobid>` shows Reason=Prolog), but nothing ever gets sent to the nodes. After a while the JobState changes to COMPLETING and the job disappears from the system without any record for the user. The agent queue size even after a slurmctld restart rarely drops below 2000 and then quickly rises again.
Comment 3 Dominik Bartkiewicz 2018-05-09 01:47:26 MDT
Hi

Good news is, slurmd bt looks fine now, this should reduced at least some rpcs.
Could you send me current slurmctld.log, slurmd.log and sdiag output?

Dominik
Comment 4 Martin Siegert 2018-05-09 19:35:31 MDT
Yesterday evening we paused the starting of new jobs by setting all partitions to State=Down. It took a while (more than 30 min.) during which slurm mostly completed jobs, but then the agent queue size dropped to 0. We brought the partitions back up and slurm has been stable since. I agree that the problems that we were seeing yesterday may not be related to the thread deadlocks. It locks more like that if the agent queue size grows above a certain limit, it continues to increase and never recovers.

We just had a "mini blib": the agent queue size rose to about 1500. I am attaching slurmctld.log and sdiag output during that blib. Slurm did recover though from this blib and the agent queue size dropped back to 0. I am also attaching slurm.log from one particular node (cdr761) which may have contributed to the blib because a user was running sbatch on it.
Comment 5 Martin Siegert 2018-05-09 19:37:27 MDT
Created attachment 6815 [details]
slurmctld.log
Comment 6 Martin Siegert 2018-05-09 19:38:10 MDT
Created attachment 6816 [details]
sdiag output
Comment 7 Martin Siegert 2018-05-09 19:38:51 MDT
Created attachment 6817 [details]
slurmd.log from cdr761
Comment 8 Dominik Bartkiewicz 2018-05-16 10:31:48 MDT
Created attachment 6877 [details]
patch

Hi

This patch solves some minor race/deadlock in slurmstepd which was introduced in 17.11.6.
Could you apply it and check if it helps?

Dominik
Comment 9 Adam 2018-05-16 12:14:50 MDT
Thanks Dominik.

I've applied the patch on our Cluster; by the looks of this it will only really apply to new slurmstepd processes so it'll take a little while to have an effect.  I'll have to have Martin update you on this as I'm away for the next 2.5 weeks and we're doing an outage for the last week of May so it wont be under any load during that time.
Comment 10 Martin Siegert 2018-05-22 18:49:59 MDT
Additionally we patched 17.11.5 now with commit 6a74be8.
The system has been stable ever since.
We also plan to upgrade to 17.11.7 on May 30.
Comment 11 Dominik Bartkiewicz 2018-05-23 05:13:04 MDT
Hi

Glad to hear that's all back to normal.
If it's alright with you, I'm going to move this to resolved/fixed.
If there's anything else I can help with, please reopen or file a new ticket.

Dominik