5443 – Accumulation of jobs in CG state

Ticket 5443 - Accumulation of jobs in CG state

Summary: Accumulation of jobs in CG state

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.8
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-07-17 13:05 MDT by Kilian Cavalotti
Modified:	2019-11-29 06:38 MST (History)
CC List:	5 users (show)

See Also:	7361
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	18.08.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
number of nodes in CG state (105.24 KB, image/png) 2018-07-17 13:05 MDT, Kilian Cavalotti	Details
Agent queue size (199.37 KB, image/png) 2018-07-17 13:06 MDT, Kilian Cavalotti	Details
Logs (20.35 MB, application/x-bzip) 2018-07-18 13:04 MDT, Kilian Cavalotti	Details
Logs (20.35 MB, application/x-bzip) 2018-07-18 15:52 MDT, Kilian Cavalotti	Details
sdiag output (2.85 KB, application/x-bzip) 2018-07-18 16:38 MDT, Kilian Cavalotti	Details
sdiag output (v2) (3.85 KB, application/x-bzip) 2018-07-19 09:32 MDT, Kilian Cavalotti	Details
slurmd debug logs (29.30 MB, application/x-bzip) 2018-07-19 13:21 MDT, Kilian Cavalotti	Details
slurm.conf (50.54 KB, text/plain) 2018-07-24 10:02 MDT, Kilian Cavalotti	Details
sdiag output (v3) (5.19 KB, application/x-bzip) 2018-07-24 10:10 MDT, Kilian Cavalotti	Details
Node states across the cluster (115.47 KB, image/png) 2018-07-26 09:27 MDT, Kilian Cavalotti	Details
extra debug (1.24 KB, patch) 2018-07-26 12:35 MDT, Dominik Bartkiewicz	Details \| Diff
slurmctld.log with extra debug (1.41 MB, application/x-bzip) 2018-07-29 16:48 MDT, Kilian Cavalotti	Details
extra debug (953 bytes, patch) 2018-07-30 11:27 MDT, Dominik Bartkiewicz	Details \| Diff
slurmctld logs (3.88 MB, application/x-bzip) 2018-07-30 15:23 MDT, Kilian Cavalotti	Details
sh-101-58 slurmd logs (2.81 KB, application/x-bzip) 2018-07-31 09:50 MDT, Kilian Cavalotti	Details
slurmctld prolog (1.30 KB, application/x-shellscript) 2018-08-01 09:40 MDT, Kilian Cavalotti	Details
prolog/epilog scripts (2.29 KB, application/x-bzip) 2018-08-01 11:02 MDT, Kilian Cavalotti	Details
new slurmctld logs (2.12 MB, application/x-bzip) 2018-08-06 14:40 MDT, Kilian Cavalotti	Details
slurmd logs for sh-16-01,sh-15-02,sh-15-04 (2.20 KB, application/x-bzip) 2018-08-08 09:59 MDT, Kilian Cavalotti	Details
job script for job 23171442 (2.04 KB, application/x-bzip) 2018-08-08 09:59 MDT, Kilian Cavalotti	Details
extra debug (2.12 KB, patch) 2018-08-14 08:51 MDT, Dominik Bartkiewicz	Details \| Diff
DebugFlag Route logs (100.23 MB, application/x-bzip) 2018-08-14 10:14 MDT, Kilian Cavalotti	Details
slurmctld logs for the last Aug 15 occurrence (27.98 MB, application/gzip) 2018-08-20 09:09 MDT, Kilian Cavalotti	Details
tracejobs+gdb logs (7.82 MB, application/x-bzip) 2018-08-22 15:30 MDT, Kilian Cavalotti	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2018-07-17 13:05:56 MDT

Created attachment 7324 [details]
number of nodes in CG state

Hi!

We'd like to report another case of recurrent accumulation of completing jobs. It looks very similar to what's already been reported in those bugs:

* https://bugs.schedmd.com/show_bug.cgi?id=5401
* https://bugs.schedmd.com/show_bug.cgi?id=5121
* https://bugs.schedmd.com/show_bug.cgi?id=5111
* https://bugs.schedmd.com/show_bug.cgi?id=5177
* https://bugs.schedmd.com/show_bug.cgi?id=5401

I'm attaching visual representation of the effects:
* CG_nodes shows the growing number of nodes stuck in CG state each time this occurs. Vertical blue lines are manual restarts of slurmctld
* agent_queue shows the increasing size of the agent queue, with highs in the 800k (left axis), with no sign of slowing down before we restart the controller.

This is extremely concerning and is starting to makes things really difficult for us, as we need to periodically restart the controller, with a success rate of about 50%: it often doesn't clear up completing jobs and requires many restarts in a row.

We've ruled out filesystem issues, prolog/epilog issues, stuck processes on the nodes, and networking appears to be fine. When this happens, the jobs are usually completely done on the node, with no more sign of any related activity. Most of the compute nodes have just one slurmd process running, all the slurmstepd and job-related children having correctly exited.

So it doesn't look like some unkillable process on the compute nodes, or communication issue between nodes and controller, but rather like a contention or a deadlock within the controller itself.

We even enabled the "throttle" plugin, to limit burst of jobs submissions, but it doesn't seem to have had much effect.

We would appreciate any insight or knob we could tweak to avoid those situations and limit the growing impact it has on our users.

Thanks!
-- 
Kilian

Comment 1 Kilian Cavalotti 2018-07-17 13:06:13 MDT

Created attachment 7325 [details]
Agent queue size

Comment 3 Dominik Bartkiewicz 2018-07-18 04:06:11 MDT

Hi
I will need more info. Could you provide us 

actual slurm.conf
slurmctld.log
slurmd.log from affected node
output from: gdb -batch -p <slurmctld pid> -ex 't a a bt'
output from: gdb -batch -p <slurmd pid of CG node> -ex 't a a bt'

Dominik

Comment 4 Dominik Bartkiewicz 2018-07-18 04:42:28 MDT

Hi

This command can be useful to localize which node blocs job:
"scontrol completing"

Dominik

Comment 5 Kilian Cavalotti 2018-07-18 09:24:47 MDT

Hi Dominik, 

Thanks for getting back to us. I will definitely provide the information you've asked for the next time this happens.

Two remarks in the meantime:

1. this seem to be strongly correlated with user activity: we had long periods of fine behavior for weeks running 17.11.6, and all of a sudden, those CG episodes started a few days ago. They seem to occur more frequently during work hours, which is when the most of the submission/cancellation of jobs happens.

2. when this happens, it's not one node blocking jobs and getting stuck in Completing state, it's literally hundreds. Which surely indicates that the problem is not with individual compute nodes themselves: that would mean they all somehow fail at the same time, which is quite unlikely. And a restart of the controller also wouldn't resolve the situation in most cases. So I can send logs of a couple CG compute nodes, but that will only be a partial picture.

That being said, I'm at the ready with gdb for the next time it happens.

Cheers,
-- 
Kilian

Comment 6 Kilian Cavalotti 2018-07-18 13:03:51 MDT

Here we go, it started again, so I'm attaching all of the requested info:

* slurm.conf
* full slurmctld.log
* output of "gdb -batch -p $(pidof slurmctld) -ex 't a a bt'"
* output of "scontrol completing"
* for nodes listed in "scontrol completing"
  - slurmd.log
  - output of "gdb -batch -p $(pidof slurmd) -ex 't a a bt'"

Right now, we have 300 nodes stuck in completing state (out of ~1,300) and a little over 1,000 CG jobs (out of ~15,000 in queue).

Please let me know if you need any more information.

Thanks!
-- 
Kilian

Comment 7 Kilian Cavalotti 2018-07-18 13:04:15 MDT

Created attachment 7345 [details]
Logs

Comment 8 Kilian Cavalotti 2018-07-18 15:52:41 MDT

Created attachment 7349 [details]
Logs

The previous archive was missing the gdb info from slurmctld

Comment 9 Dominik Bartkiewicz 2018-07-18 16:09:01 MDT

Hi

Thanks for all the data.
One more thing, could you attach sdiag output as well?

Dominik

Comment 10 Kilian Cavalotti 2018-07-18 16:37:23 MDT

(In reply to Dominik Bartkiewicz from comment #9)
> Thanks for all the data.
> One more thing, could you attach sdiag output as well?

Sure! It won't be completely consistent with the previous logs, since we've restarted the controller a couple time already, but it didn't resolve the situation and we still have close to ~1,000 completing jobs, so it will probably still be representative.

Thanks,
-- 
Kilian

Comment 11 Kilian Cavalotti 2018-07-18 16:38:06 MDT

Created attachment 7350 [details]
sdiag output

Comment 12 Dominik Bartkiewicz 2018-07-19 06:46:27 MDT

Hi

I see in slurmctld log that some nodes have different slurmconf than slurmctld.
Have you made any configuration updates? Maybe you have added new nodes?

Could you also increase debug level to debug for slurmd?
I see that slurmctld made multiple attempts to send REQUEST_TERMINATE_JOB.
I want check if slurmd got this messages.

Dominik

Comment 13 Kilian Cavalotti 2018-07-19 09:07:50 MDT

(In reply to Dominik Bartkiewicz from comment #12)
> I see in slurmctld log that some nodes have different slurmconf than
> slurmctld.
> Have you made any configuration updates? 

Yeah, I've been trying to adjust some of the SchedulerParameters options to see if it had any impact on the issue between slurmctld restarts. But I didn't distribute the slurm.conf file to all the compute nodes, as I don't believe slurmd cares about SchedulerParameters, does it?

> Maybe you have added new nodes?

We regularly add new nodes to the cluster, but for this, we make sure that all the nodes in the cluster have the same slurm.conf before restarting slurmctld and slurmds everywhere.

> Could you also increase debug level to debug for slurmd?
> I see that slurmctld made multiple attempts to send REQUEST_TERMINATE_JOB.
> I want check if slurmd got this messages.

Ok, I'll do this the next time it happens, thanks for the suggestion.


Another somehow relevant point is that for two days in a row, the Slurm controller has been completely fine all night long, but started behaving mid-morning, when users started working again. Which seems to indicate some correlation between the CG state pileup and user activity. 

If it's of any help, I've made the evolution of the CG jobs/nodes over time, correlated with the sdiag information, available here:
https://snapshot.raintank.io/dashboard/snapshot/3GeFuGAKITvAqnTrLrBt8TZUa7g8s4wr

Cheers,
-- 
Kilian

Comment 14 Dominik Bartkiewicz 2018-07-19 09:24:22 MDT

Hi

I suspect this is something more tricky but I just want to be sure. Could you generate sdiag one more time?
The last one showed quite high rpc rate but it was taken only for ~15 min.

Dominik

Comment 15 Kilian Cavalotti 2018-07-19 09:32:54 MDT

Created attachment 7356 [details]
sdiag output (v2)

Comment 16 Kilian Cavalotti 2018-07-19 09:33:12 MDT

(In reply to Dominik Bartkiewicz from comment #14)
> I suspect this is something more tricky but I just want to be sure. Could
> you generate sdiag one more time?
> The last one showed quite high rpc rate but it was taken only for ~15 min.


Sure! Here it is.

Thanks,
-- 
Kilian

Comment 17 Kilian Cavalotti 2018-07-19 09:38:58 MDT

Forgot to ask: is there a way to see a breakdown of RPC types per user? I see total #RPCs/user, and #RPCs/message type, but it would be helpful to see what type of RPC each user does too.

Cheers,
-- 
Kilian

Comment 18 Kilian Cavalotti 2018-07-19 13:21:38 MDT

Created attachment 7357 [details]
slurmd debug logs

It started happening again, so here's a new log/gdb bundle, with slurmd's Debug level set to "debug".

Thanks,
--
Kilian

Comment 19 Kilian Cavalotti 2018-07-19 17:03:30 MDT

Just a quick note to mention that we just deployed 17.11.8, and that it doesn't seem to have any specific effet on this: CG jobs started accumulating shortly after the update.

Comment 20 Dominik Bartkiewicz 2018-07-20 10:36:41 MDT

Hi

Thank you for all the data, it has helped a lot.

Currently, there is no way to display such data,
In 18.08 we added some extra info about pending RPCs and agent queue.


Dominik

Comment 21 Kilian Cavalotti 2018-07-20 11:01:41 MDT

(In reply to Dominik Bartkiewicz from comment #20)
> Thank you for all the data, it has helped a lot.

Great! Let me know if you need anything else.

> Currently, there is no way to display such data,
> In 18.08 we added some extra info about pending RPCs and agent queue.


Got it, thanks.

Cheers,
-- 
Kilian

Comment 22 Kilian Cavalotti 2018-07-23 15:26:01 MDT

Hi!

I'm raising the severity of this issue to Sev2, as the impact on our system is really high: maintaining the scheduler in a half-decent operating state requires constant attention and multiple slurmctld restarts all day long. It stabilizes for a couple hours, and then the CG jobs start piling up again and render all job submissions pretty much impossible.

This is really impacting our users' ability to work on the cluster, and made our support load dramatically increase of the last couple weeks.

We would really appreciate any insight you could provide in order to help quickly resolve this problem.

Thanks!
--
Kilian

Comment 23 Dominik Bartkiewicz 2018-07-24 05:01:43 MDT

Hi

I didn't find anything suspicious in the backtraces, logs also look fine (except resending REQUEST_TERMINATE_JOB). Maybe we can tune SchedulerParameters and try to make slurmctld more responsive.
Could you send me slurm.conf and one more time sdiag?
Dominik

Comment 24 Kilian Cavalotti 2018-07-24 10:02:59 MDT

Created attachment 7380 [details]
slurm.conf

Comment 25 Kilian Cavalotti 2018-07-24 10:09:39 MDT

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #23)
> I didn't find anything suspicious in the backtraces, logs also look fine
> (except resending REQUEST_TERMINATE_JOB). 

Thanks for the update, This is quite concerning, though, given the scale of the issue and the impact it has on our system and our users.

> Maybe we can tune
> SchedulerParameters and try to make slurmctld more responsive.
> Could you send me slurm.conf and one more time sdiag?

Yes, definitely! Please see the latest attached bundle, and I'll definitely take any recommendation about SchedulerParameters options.

I feel the need to mention again that those situations, where jobs in CG state accumulate, really seem to be the direct result from user activity: they usually don't happen at night or during week-ends, but seem to correlate with periods of higher interactive activity, when the most users are connected and active. They seem to be the consequence of large waves of job submission or cancellation.

Please let me know any information I could provide to help diagnose and resolve this issue, we still have plenty of occasions to see it happening and collect logs. :)

Cheers,
-- 
Kilian

Comment 26 Kilian Cavalotti 2018-07-24 10:10:34 MDT

Created attachment 7381 [details]
sdiag output (v3)

Comment 27 Jason Booth 2018-07-24 10:28:32 MDT

Hi Kilian,

 I hope this email finds you well. Jess mentioned to me that you were experiencing some struggles related to node states. Unfortunately, he was not sure of the specific issue/case number. I believe Jess was referring to this issue, however, I wanted to confirm that indeed this is the issue you were referring to when you spoke with Jess.

Best regards,
Jason

Comment 28 Kilian Cavalotti 2018-07-24 10:33:12 MDT

Hi Jason, 

(In reply to Jason Booth from comment #27)
>  I hope this email finds you well. Jess mentioned to me that you were
> experiencing some struggles related to node states. Unfortunately, he was
> not sure of the specific issue/case number. I believe Jess was referring to
> this issue, however, I wanted to confirm that indeed this is the issue you
> were referring to when you spoke with Jess.

Yes, this is the very issue that my colleague Stéphane discussed with Jess at PEARC, you're absolutely right!

Cheers,
--
Kilian

Comment 29 Dominik Bartkiewicz 2018-07-24 13:23:17 MDT

Hi

Do you know what generated that RPCs rate over the night?
REQUEST_JOB_USER_INFO ~ 3 per second 
REQUEST_PARTITION_INFO ~ 3 per second
It looks like some automatic script.


Could you check "System configuration" from:
https://slurm.schedmd.com/high_throughput.html

I can't give you any significant recommendations.
You can consider adding "defer" and slightly increase bf_yield_sleep from default 500000.

ps.:
It is not really important but you have repeated "batch_sched_delay=10" in SchedulerParameters.

Dominik

Comment 30 Kilian Cavalotti 2018-07-24 13:26:07 MDT

(In reply to Dominik Bartkiewicz from comment #29)
> Do you know what generated that RPCs rate over the night?
> REQUEST_JOB_USER_INFO ~ 3 per second 
> REQUEST_PARTITION_INFO ~ 3 per second
> It looks like some automatic script.

Likely users, yes. I'll check, but given our user population, this has to be expected.

> Could you check "System configuration" from:
> https://slurm.schedmd.com/high_throughput.html

Sure, I know that page well :)

> I can't give you any significant recommendations.
> You can consider adding "defer" and slightly increase bf_yield_sleep from
> default 500000.
> It is not really important but you have repeated "batch_sched_delay=10" in
> SchedulerParameters.

Ok, I'll take a stab at those.

What else do you suggest we could do to resolve this issue, then?

Thanks,
--
Kilian

Comment 31 Dominik Bartkiewicz 2018-07-25 01:09:48 MDT

Hi

I am still analysing code and logs and I hope I will soon find something.
I asked Alex to have a look at this.

Dominik

Comment 34 Dominik Bartkiewicz 2018-07-25 05:30:23 MDT

Hi

Alex prepared some recommendations 

1. Try to set more agresive messages aggregates :
Increase WindowMsgs to 50 and WindowTime to 300.

2. You have LaunchParameters=slurmstepd_memlock_all. 
Perhaps memlock() has some weird interaction with ulimits
Could you check if removing this has any positive effect on CG nodes?

3. the last suggestion is switching from JobAcctGatherParams=UsePss to
JobAcctGatherParams=UsePss,NoOverMemoryKill and MemLimitEnforce=no and see if
this helps improving things.

We also notice multiple oom messages in logs. Are you sure that system has enough memory to work properly? Maybe you can consider using 
MemSpecLimit.


Dominik

Comment 35 Kilian Cavalotti 2018-07-25 15:46:13 MDT

(In reply to Dominik Bartkiewicz from comment #34)
> Alex prepared some recommendations 

Thanks!

> 1. Try to set more agresive messages aggregates :
> Increase WindowMsgs to 50 and WindowTime to 300.

Ok, I'll do this. 

> 2. You have LaunchParameters=slurmstepd_memlock_all. 
> Perhaps memlock() has some weird interaction with ulimits
> Could you check if removing this has any positive effect on CG nodes?

Will try this too, yes.

> 3. the last suggestion is switching from JobAcctGatherParams=UsePss to
> JobAcctGatherParams=UsePss,NoOverMemoryKill and MemLimitEnforce=no and see if
> this helps improving things.

The idea being that the task/cgroup will take care of memory enforcement, got it.

 
> We also notice multiple oom messages in logs. Are you sure that system has
> enough memory to work properly? Maybe you can consider using 
> MemSpecLimit.

You mean on the controller or the nodes? I don't see any sign of memory exhaustion on the Slurm controller, but OOM messages are frequent on the nodes, yes, each time a job exceeds its memory allocation, but I take this is normal, isn't it? I don't see any slurmd process being killed by the OOM killer, and if it were the case, the node would quickly marked as DOWN, right?

Thanks for the suggestions, I'll give them a try.


Interestingly, the situation has been quiet today, no CG episode so far, but I'm not sure if it's a function of lower user activity, or a change in our config. The last significant one was to set bf_yield_interval=1500000 and bf_yield_sleep=750000 in SchedulerParameters.

We'll keep you updated in any case.


Cheers,
-- 
Kilian

Comment 36 Dominik Bartkiewicz 2018-07-26 03:59:22 MDT

Hi

I'm glad to hear it is better now, but we must still control this. 500k agent queue size is hard to explain only by not optimal scheduler parameters.
Do you still notice resending REQUEST_TERMINATE_JOB in slurmctld.log?

Dominik

Comment 37 Kilian Cavalotti 2018-07-26 09:26:48 MDT

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #36)
> I'm glad to hear it is better now, but we must still control this. 500k
> agent queue size is hard to explain only by not optimal scheduler parameters.

Unfortunately, that didn't last. It happened again last night around 11pm, triggered a segfault of the primary controller in the process (I'll open a separate bug for this), and the CG jobs have been piling up all night long:

# scontrol completing | wc -l
2067

> Do you still notice resending REQUEST_TERMINATE_JOB in slurmctld.log?

Yes, that's the core of the issue, I guess:

# grep TERMINATE_JOB /var/log/slurm/ctld/sh-sl02.log | grep "Jul 26" | wc -l
429559

430,000 occurrences that night only, since the backup controller assumed control after the primary one segfaulted.

Either those signals don't reach the compute nodes (but it's unlikely since most compute nodes don't run anything anymore, and it would be surprising if that many nodes would stop receiving those messages at the same time), or the controller fails to process them correctly (which seems more likely).

I'll attach another graph showing the evolution of the nodes' states in the cluster over night.

Cheers,
-- 
Kilian

Comment 38 Kilian Cavalotti 2018-07-26 09:27:15 MDT

Created attachment 7421 [details]
Node states across the cluster

Comment 39 Kilian Cavalotti 2018-07-26 09:59:26 MDT

(In reply to Kilian Cavalotti from comment #37)
> triggered a segfault of the primary controller in the process (I'll open a
> separate bug for this)

It's https://bugs.schedmd.com/show_bug.cgi?id=5486

Comment 40 Dominik Bartkiewicz 2018-07-26 12:35:58 MDT

Created attachment 7426 [details]
extra debug

This patch should generate extra info about agent queue in slurmctld.log when you will call sdiag.
Could you try to apply this patch and send me slurmctld log and sdiag output after it happened again?
I'm sorry it's taking so long but I want to ensure you that I'm working on your bug. 

Dominik

Comment 41 Kilian Cavalotti 2018-07-26 12:40:06 MDT

(In reply to Dominik Bartkiewicz from comment #40)
> Created attachment 7426 [details]
> extra debug
> 
> This patch should generate extra info about agent queue in slurmctld.log
> when you will call sdiag.
> Could you try to apply this patch and send me slurmctld log and sdiag output
> after it happened again?
> I'm sorry it's taking so long but I want to ensure you that I'm working on
> your bug. 

Awesome! Thanks a lot for the patch, I'm applying it now, and will provide the outputs at the next occurrence.

Cheers,
--
Kilian

Comment 42 Dominik Bartkiewicz 2018-07-27 04:39:59 MDT

Hi

I have suspicions that this can be related to bug 5438 in some way.
Did you change EnforcePartLimits or did you have always set ANY?

Dominik

Comment 43 Kilian Cavalotti 2018-07-27 11:25:50 MDT

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #42)
> I have suspicions that this can be related to bug 5438 in some way.

Ah interesting. I kind of assumed the corrupted job structures were a consequence of the "deadlock" situation with CG jobs, maybe exacerbated by our many controller restarts. But they may very well be the source of the issue, indeed.

> Did you change EnforcePartLimits or did you have always set ANY?

Nope, that hasn't changed recently. EnforcePartLimits has been set since our early days with Slurm. I believe it was set to "Yes" before "ANY" became a thing, but that's probably more than a year ago.

Thanks,
-- 
Kilian

Comment 44 Kilian Cavalotti 2018-07-29 16:47:25 MDT

Hi, 

It just happened again. Here's the additional info logged by the "extra debug" patch (full log attached):

Jul 29 15:42:17 sh-sl01 slurmctld[65270]: BUG 5443 queue:                                                                                                                                                                                    Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_TERMINATE_JOB count: 73475                                                                                                                                                           Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_LAUNCH_PROLOG count: 2061                                                                                                                                                            Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_BATCH_JOB_LAUNCH count: 2042                                                                                                                                                         Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: SRUN_JOB_COMPLETE count: 33                                                                                                                                                                  Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_KILL_TIMELIMIT count: 35                                                                                                                                                             Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_SIGNAL_TASKS count: 54                                                                                                                                                               Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_KILL_PREEMPTED count: 71                                                                                                                                                             Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: SRUN_STEP_SIGNAL count: 1                                                                                                                                                                    Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: RESPONSE_RESOURCE_ALLOCATION count: 12                                                                                                                                                       Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_HEALTH_CHECK count: 1                                                                                                                                                                Jul 29 15:42:17 sh-sl01 slurmctld[65270]: BUG 5443 dump end

Comment 45 Kilian Cavalotti 2018-07-29 16:48:38 MDT

Created attachment 7450 [details]
slurmctld.log with extra debug

Comment 46 Kilian Cavalotti 2018-07-29 23:06:01 MDT

Just to add that this last occurrence looks like it could be correlated with waves of multi-partiton job submissions, in case that's useful.

Comment 47 Dominik Bartkiewicz 2018-07-30 11:27:16 MDT

Created attachment 7457 [details]
extra debug

Hi

I have prepared another patch with an extra log.
This patch should catch situation when select_nodes is called for a running job.

Dominik

Comment 48 Kilian Cavalotti 2018-07-30 11:50:14 MDT

(In reply to Dominik Bartkiewicz from comment #47)
> I have prepared another patch with an extra log.
> This patch should catch situation when select_nodes is called for a running
> job.

Thanks Dominik, I just deployed the patch and I'll send logs at the next occurrence.

Cheers,
-- 
Kilian

Comment 49 Kilian Cavalotti 2018-07-30 15:23:18 MDT

Created attachment 7460 [details]
slurmctld logs

Issue started again around 13:45 today. Here are the latest logs with the new debug patch applied.

It apparently caught some things around 12:41, before the issue started.

Comment 51 Dominik Bartkiewicz 2018-07-31 05:01:28 MDT

Hi

This two issues caught in the log are normal and can't be root of this problem.
Could you send me slurmd log form sh-101-58?

Are you sure that all nodea have set time properly?
Non synchronized node can produce log like this:
"Zero Bytes were transmitted or received"

Dominik

Comment 52 Kilian Cavalotti 2018-07-31 09:50:15 MDT

Created attachment 7468 [details]
sh-101-58 slurmd logs

> Could you send me slurmd log form sh-101-58?

Here it is.

> Are you sure that all nodea have set time properly?
> Non synchronized node can produce log like this:
> "Zero Bytes were transmitted or received"

Yes, we use NTP all across the cluster, and all the compute nodes have their clocks set to the same time.
Plus, a few un-synchronized nodes wouldn't explain why thousands of jobs suddenly get stuck in CG state, would they?

Comment 53 Tim Wickberg 2018-07-31 12:29:51 MDT

Killan -

Could you increase the number of MUNGE threads on the slurmctld host considerably?

The default is 2, can you please adjust to 10 per our tuning guide? https://slurm.schedmd.com/high_throughput.html

I don't think this is directly implicated, but it'd be good to get out of the way.

Comment 54 Kilian Cavalotti 2018-07-31 12:32:57 MDT

Hi Tim, 

(In reply to Tim Wickberg from comment #53)
> Could you increase the number of MUNGE threads on the slurmctld host
> considerably?
> 
> The default is 2, can you please adjust to 10 per our tuning guide?
> https://slurm.schedmd.com/high_throughput.html
> 
> I don't think this is directly implicated, but it'd be good to get out of
> the way.

Thanks for the suggestion, but we have this in place already:

[root@sh-sl01 ~]# ps aux | grep -i munge
munge       936  0.5  0.0 751624  3568 ?        Sl   Jul19  99:48 /usr/sbin/munged --num-threads 10

Cheers,
--
Kilian

Comment 55 Tim Wickberg 2018-07-31 14:13:28 MDT

Is there a reason you'd originally turned on message aggregation?

I can see you've been running with it for quite some time, but I'd be concerned this may be exacerbating some other issues. I think Alex previously overlooked an interaction here when recommending an increased WindowTime - that might be the reason you're seeing more frequent issues around MUNGE credential timeouts if the MUNGE lifetime is now matching that setting.

I'd suggest disabling this entirely unless you had a specific reason it was originally enabled. I don't think it's helping you out here, and may be causing some issues and/or masking the real root cause.

Comment 56 Kilian Cavalotti 2018-07-31 15:11:55 MDT

(In reply to Tim Wickberg from comment #55)
> Is there a reason you'd originally turned on message aggregation?

I believe it may have been originally done to help with performance and scalability, when we were reaching the 800+ node level and running our Slurm controllers out of small-ish VMs.

> I can see you've been running with it for quite some time, but I'd be
> concerned this may be exacerbating some other issues. I think Alex
> previously overlooked an interaction here when recommending an increased
> WindowTime - that might be the reason you're seeing more frequent issues
> around MUNGE credential timeouts if the MUNGE lifetime is now matching that
> setting.
> 
> I'd suggest disabling this entirely unless you had a specific reason it was
> originally enabled. I don't think it's helping you out here, and may be
> causing some issues and/or masking the real root cause.

Ok, thanks for the suggestion. I've disabled it and restarted the daemons.

Cheers,
-- 
Kilian

Comment 57 Kilian Cavalotti 2018-07-31 15:40:15 MDT

(In reply to Kilian Cavalotti from comment #56)
> Ok, thanks for the suggestion. I've disabled it and restarted the daemons.

And it got put to test quite quickly: the issue just started happening again, with message aggregation disabled.

I'll keep it this way for now anyway.

Cheers,
-- 
Kilian

Comment 60 Dominik Bartkiewicz 2018-08-01 05:30:25 MDT

Hi

Could you send me PrologSlurmctld or describe what is in it?

Dominik

Comment 61 Kilian Cavalotti 2018-08-01 09:40:38 MDT

Created attachment 7479 [details]
slurmctld prolog

Hi Dominik, 

The PrologSlurmctld script is attached.

We tried various variations around it, including removing the scontrol part and disabling the prolog script entirely, but it still lead to CG pile-up situations.

Right now, there's a random delay introduced before calling scontrol, to avoid too many simultaneous RPCs when large job series are starting at the same time, and it's done asynchronously to let the prolog script return immediately and avoid blocking further job processing.

Comment 62 Dominik Bartkiewicz 2018-08-01 10:42:51 MDT

Hi

After the internal talk, we agreed that we would like to check other epilogue and prologue. Could you send us the rest of these scripts?
Maybe there is some detail which induces this.

Is there any chance to comment "PrologFlags=contain" and check if this issue will still occur? I understand this can be troublesome.


Dominik

Comment 63 Kilian Cavalotti 2018-08-01 11:02:06 MDT

Created attachment 7481 [details]
prolog/epilog scripts

> After the internal talk, we agreed that we would like to check other epilogue and prologue. Could you send us the rest of these scripts? Maybe there is some detail which induces this.

Sure, they're attached.

> Is there any chance to comment "PrologFlags=contain" and check if this issue will still occur? I understand this can be troublesome.

Mmmh, I would really prefer not having to do this, as it will prevent proper containerization of tasks via pam_slurm_adopt, and users heavily rely on the ability to ssh to nodes and run things in the context of their job.

I'd like to state again that the whole configuration didn't significantly change over the last year (besides adding more nodes), and that the issue suddenly started about a month ago.

Thanks!
-- 
Kilian

Comment 66 Marshall Garey 2018-08-03 12:42:58 MDT

Sorry Kilian, ignore comment 65. I posted that on the wrong bug.

Comment 67 Kilian Cavalotti 2018-08-03 12:45:52 MDT

(In reply to Marshall Garey from comment #66)
> Sorry Kilian, ignore comment 65. I posted that on the wrong bug.

No worries. But while you're here :) is there any update or progress on that problem?

Thanks,
--
Kilian

Comment 68 Marshall Garey 2018-08-03 12:55:55 MDT

Nothing specific right now - Dominik is primarily working on this one, but Tim and I have spent some time on it, too. I think the other Stanford bugs that I own are probably related to this, and I've continued to do research on those.

Comment 70 Dominik Bartkiewicz 2018-08-06 09:47:37 MDT

Hi

You can try to compile slurmctld with higher value of MAX_SERVER_THREADS (eg. 512)
MAX_SERVER_THREADS is defined in src/slurmctld/slurmctld.h
If this stops agent queue from increasing it will be easier to find what and why generate so many REQUEST_TERMINATE_JOB rpcs. 

Dominik

Comment 71 Kilian Cavalotti 2018-08-06 09:51:19 MDT

Hi Dominik,

In reply to Dominik Bartkiewicz from comment #70)
> You can try to compile slurmctld with higher value of MAX_SERVER_THREADS
> (eg. 512)
> MAX_SERVER_THREADS is defined in src/slurmctld/slurmctld.h
> If this stops agent queue from increasing it will be easier to find what and
> why generate so many REQUEST_TERMINATE_JOB rpcs. 

Ok, I'll try that, but wouldn't it exacerbate the problem, by allowing even more threads to add work to the queue?

How does that interact with max_rpc_cnt in SchedulerParameters?

Thanks,
--
Kilian

Comment 72 Dominik Bartkiewicz 2018-08-06 10:18:52 MDT

Hi

MAX_SERVER_THREADS limits separately "Server thread count" and numbers of agents.

max_rpc_cnt is compering with the count of incoming RPCs.

From sdiag output and one of your plots, it seems that slurmctld never hit this limit for "Server thread count".

Increasing MAX_SERVER_THREADS will allow creating more agents and I hope it will prevent hungering older RPCs in the queue.

Dominik

Comment 73 Kilian Cavalotti 2018-08-06 10:30:17 MDT

(In reply to Dominik Bartkiewicz from comment #72)
> Hi
> 
> MAX_SERVER_THREADS limits separately "Server thread count" and numbers of
> agents.
> 
> max_rpc_cnt is compering with the count of incoming RPCs.

Ok, I asked because the slurm.conf man page says:

    max_rpc_cnt=#   
        
         If the number of active threads in the slurmctld daemon is equal to or 
         larger than this value, defer scheduling  of  jobs. 

So it sounds pretty similar to a MAX_SERVER_THREADS value.


> From sdiag output and one of your plots, it seems that slurmctld never hit
> this limit for "Server thread count".
> 
> Increasing MAX_SERVER_THREADS will allow creating more agents and I hope it
> will prevent hungering older RPCs in the queue.

Ok. So I ignored the comment in slurmctld.h that says "increasing this value is strongly discouraged" :) set the value to 512, recompiled and restarted the daemons.


As a side note, the issue hasn't been really happening since last Tuesday (7/31). We had a few occurrences of a few dozen nodes in CG state at once, but the situation cleared itself out in about 30mn, so nothing comparable to the previous occurrences.

That is not to say that we consider the issue resolved, though. Since we haven't modified the config nor the code at all after the last occurrence, we can only consider that the problem is a function of a specific job pattern and that it can still happen at any time.

Cheers,
--
Kilian

Comment 74 Kilian Cavalotti 2018-08-06 12:17:07 MDT

And we just need to talk about it to make it happen again:

# sinfo -t completing -N -O nodelist -h | sort -u | wc -l
148
# scontrol completing | wc -l
289

This is *with* MAX_SERVER_THREADS=512

Do you need any dump/trace before I restart the controller?

Comment 75 Jason Booth 2018-08-06 13:43:22 MDT

Hi Kilian,

 We would like to see the slurmctld logs from today so, please attach those to the ticket. I do not believe a trace will be needed for this instance. 

-Jason

Comment 76 Kilian Cavalotti 2018-08-06 14:40:37 MDT

Created attachment 7521 [details]
new slurmctld logs

Sure, here they are.

Comment 77 Kilian Cavalotti 2018-08-07 10:01:25 MDT

Hi!

Did you find anything useful in the logs? Is there any update we can report to our users towards a fix?

Thanks,
-- 
Kilian

Comment 78 Jason Booth 2018-08-07 14:28:05 MDT

Hi Kilian,

 Unfortunately, the issue is still troublesome to reproduce on our end even with the additional logging and traces. Finding the root cause has been troublesome without a clear way to duplicate this and this is why we have had you modify some code such as MAX_SERVER_THREADS. We are continuing to look into this issue but do not, as of yet, have a suggested patch/fix.

Kind regards,
Jason

Comment 81 Dominik Bartkiewicz 2018-08-08 09:21:22 MDT

Hi

Can you try to disable temporary killing user processes in the job epilog?
Now it has some race between squeue and pkill/ipcrm.

Could you send my current slurmd.log of those nodes?
sh-16-01, sh-15-02, sh-15-04
In which state nodes are now?

Can you try to enable debug flag "Route" and when this occurs again could you send me slurmd log from a couple of completed nodes?

Do you have job_script/job specification of 23171442?
Perhaps this is some unrelated problem but I will try to find out why this job
has overallocated memory.



Dominik

Comment 82 Kilian Cavalotti 2018-08-08 09:58:26 MDT

Hi Dominik, 

Thanks for following up and for keeping looking at this.

> Can you try to disable temporary killing user processes in the job epilog?
> Now it has some race between squeue and pkill/ipcrm.

I'm curious about that race: would that be between different jobs that could be terminating at the same time on the same nodes? The pkill/ipcrm only happens if no other job is running on the node, so unless another job starts and terminates between the squeue and pkill calls, there shouldn't be any problem, right? Now it's true that the local scratch cleanup happens between those calls, and could take some time, so I'll move it after the pkill/ipcrm parts.

But for now, I disabled the pkill and ipcrm parts in the epilog script.

> Could you send my current slurmd.log of those nodes?
> sh-16-01, sh-15-02, sh-15-04
> In which state nodes are now?

They're all in COMPLETING state now:

# sinfo -p owners -Nn sh-16-01,sh-15-02,sh-15-04
NODELIST   NODES PARTITION STATE
sh-15-02       1    owners comp
sh-15-04       1    owners comp
sh-16-01       1    owners comp

I'll attach the slurmd logs for those nodes.

> Can you try to enable debug flag "Route" and when this occurs again could
> you send me slurmd log from a couple of completed nodes?

Sure, done. Didn't know this was a thing, I don't think the Route DebugFlag is mentioned in the slurm.conf man page :)

Do you need increased general log level too? That's what we have right now:

# scontrol show config | grep -i debug
DebugFlags              = Route
SlurmctldDebug          = info
SlurmctldSyslogDebug    = info
SlurmdDebug             = info
SlurmdSyslogDebug       = info


> Do you have job_script/job specification of 23171442?
> Perhaps this is some unrelated problem but I will try to find out why this
> job has overallocated memory.

Yes, it looks like this:

#SBATCH -p iric
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=6000
#SBATCH --exclusive
#SBATCH --ntasks-per-core=1

I'll attach the full script too.

Thanks!
-- 
Kilian

Comment 83 Kilian Cavalotti 2018-08-08 09:59:07 MDT

Created attachment 7540 [details]
slurmd logs for sh-16-01,sh-15-02,sh-15-04

Comment 84 Kilian Cavalotti 2018-08-08 09:59:31 MDT

Created attachment 7541 [details]
job script for job 23171442

Comment 85 Kilian Cavalotti 2018-08-08 11:37:05 MDT

Another note that may be of interest: since July 31st, which is when I disabled message aggregation (https://bugs.schedmd.com/show_bug.cgi?id=5443#c55), the kernel on the Slurm controller has been logging this:

[Tue Jul 31 16:25:47 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 11:44:51 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 12:00:44 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 12:15:43 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 12:45:44 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 13:00:43 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 13:15:50 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 13:29:50 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 13:44:48 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Mon Aug  6 22:45:05 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Tue Aug  7 11:45:54 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.
[Wed Aug  8 10:15:36 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters.

Do you think that may be related?
Should we re-enable message aggregation?

Thanks,
-- 
Kilian

Comment 86 Dominik Bartkiewicz 2018-08-08 15:16:26 MDT

Hi

Because now we know that message aggregation is not directly involved in this issue I think you can enable it.


You can set SlurmdDebug to debug.
It would be great if you increase SlurmctldDebug too,
but at first, test if increasing SlurmctldDebug has no notable impact on performance.

Dominik

Comment 87 Kilian Cavalotti 2018-08-08 17:31:01 MDT

(In reply to Dominik Bartkiewicz from comment #86)
> Because now we know that message aggregation is not directly involved in
> this issue I think you can enable it.

Ok, thanks.

> You can set SlurmdDebug to debug.
> It would be great if you increase SlurmctldDebug too,
> but at first, test if increasing SlurmctldDebug has no notable impact on
> performance.

Looks like the DebugFlag=Route option makes srun display debug messages like this:

$ srun -p test --pty bash
srun: ROUTE: split_hostlist: hl=sh-06-34 tree_width 0

Is it expected? That's confusing our users a little more. :)

Cheers,
-- 
Kilian

Comment 88 Dominik Bartkiewicz 2018-08-09 08:16:20 MDT

Hi

Sorry I didn't suspect that users also can see this messages.
You can disable this now, if we have some problems in this area, it will be already in log.

Dominik

Comment 91 Kilian Cavalotti 2018-08-10 12:42:59 MDT

Hi!

As a quick update, we've just updated to 17.11.9 to see if it makes any difference (and with the hope that https://github.com/SchedMD/slurm/commit/06582da8ff47132a124497f7dbd41dd5ac227c33 may be beneficial to our case).

We also added a cron job that regularly checks the number of completing jobs and will restart the controllers of it gets above a threshold. But that's obviously just a temporary workaround and not something we would like to have to keep for too long. :)

Cheers,
-- 
Kilian

Comment 92 Kilian Cavalotti 2018-08-10 13:50:13 MDT

(In reply to Kilian Cavalotti from comment #91)
> As a quick update, we've just updated to 17.11.9 to see if it makes any
> difference (and with the hope that
> https://github.com/SchedMD/slurm/commit/
> 06582da8ff47132a124497f7dbd41dd5ac227c33 may be beneficial to our case).


Our upgrade tentative to 17.11.9 was short-lived, because of https://bugs.schedmd.com/show_bug.cgi?id=5552, so we're back on 17.11.8

Comment 94 Dominik Bartkiewicz 2018-08-14 07:25:26 MDT

Hi

Could you send me logs that you have grabbed with debug flag "Route" from both slurmd and slurmctld? Currently, I expect that problem is in other parts of slurm but want to check in any way.

I will attach another patch with additional logs, could you apply it and enable TraceJobs debugflag?

could you also send me nodes record for one of compliting node
eg.:
gdb -p <slurmctld_pid> --batch -ex 'call find_node_record("<nod_name>")' -ex 'p *$1'

Comment 95 Dominik Bartkiewicz 2018-08-14 08:51:58 MDT

Created attachment 7600 [details]
extra debug

Comment 96 Kilian Cavalotti 2018-08-14 10:14:09 MDT

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #94)
> Could you send me logs that you have grabbed with debug flag "Route" from
> both slurmd and slurmctld? Currently, I expect that problem is in other
> parts of slurm but want to check in any way.

Sure. I'll attach the whole thing. It will be quite big though, so it may take me a couple tries.

> I will attach another patch with additional logs, could you apply it and
> enable TraceJobs debugflag?

Sure, thanks for the patch. It's applied and deployed, and TraceJobs debugflag has been activated. I'm a bit worried this will make our already huge logs explode, but I'll keep an eye on them.

> could you also send me nodes record for one of compliting node
> eg.:
> gdb -p <slurmctld_pid> --batch -ex 'call find_node_record("<nod_name>")' -ex
> 'p *$1'

Will do next time this happens. for sure!

Thanks,
--
Kilian

Comment 97 Kilian Cavalotti 2018-08-14 10:14:53 MDT

Created attachment 7603 [details]
DebugFlag Route logs

Comment 98 Dominik Bartkiewicz 2018-08-14 10:43:36 MDT

That means you don’t have nodes or jobs in the permanent completing state? Tracejobs will be interesting only when it will start again.

Dominik

Comment 99 Kilian Cavalotti 2018-08-14 10:50:33 MDT

(In reply to Dominik Bartkiewicz from comment #98)
> That means you don’t have nodes or jobs in the permanent completing state?

I'm a bit worried we've not being communicating clearly here, so let me try again: we don't have nodes or jobs permanently stuck in CG state, we have situations where many jobs and nodes accumulate in CG state, a stay that way until no resources are available anymore in the cluster (I've provided a number of graphes that show this). But a controller restart usually resolves the situation and CG states are cleared.

We obviously can't afford to keep our whole 1,300-node, 3,500-user cluster down for weeks while the issue is being diagnosed. So we routinely restart the Slurm controller when the issue happens and that usually clears up the completing nodes, until the problem happens again.

> Tracejobs will be interesting only when it will start again.

Ok, I'll enabled it next time the issue happens, then.

Cheers,
-- 
Kilian

Comment 100 Dominik Bartkiewicz 2018-08-14 12:02:52 MDT

Hi

Maybe I wasn't precise enough in the last comment.
I used "permanent" for nodes that stay in this state significantly longer than we expect.
For me, the first symptom of this problem is "Resending TERMINATE_JOB" that according to the log starts occur a couple minutes after restarting slurmctld. Because TERMINATE_JOB should be sent only for COMPLETING jobs, I assume that you always have some amount of completing jobs and you restart the controllers when the number of CG job gets above a threshold.

Dominik

Comment 101 Kilian Cavalotti 2018-08-14 12:09:59 MDT

(In reply to Dominik Bartkiewicz from comment #100)
> Maybe I wasn't precise enough in the last comment.
> I used "permanent" for nodes that stay in this state significantly longer
> than we expect.

Ah, thanks for the clarification.

> For me, the first symptom of this problem is "Resending TERMINATE_JOB" that
> according to the log starts occur a couple minutes after restarting
> slurmctld. Because TERMINATE_JOB should be sent only for COMPLETING jobs, I
> assume that you always have some amount of completing jobs and you restart
> the controllers when the number of CG job gets above a threshold.

I see. I think that the "Resending TERMINATE_JOB" messages that are sent right after the slurmctld start actually work: they seem to be correctly processed and make the number of CG jobs/nodes decrease and go back to ~0.

The "Resending TERMINATE_JOB" messages that are sent *before* the slurmctld restart are the ones that are likely problematic, as they don't seem to be processed at all and cause the accumulation of those CG jobs.

So, to summarize: at times, the number of CG jobs and nodes start to steadily increase, TERMINATE_JOB messages seem to be ignored and not processed, and CG jobs keep accumulating until we restart the Slurm controller.
After the restart, TERMINATE_JOB messages are sent again for all the CG jobs, they are correctly processed, and the number of completing jobs goes back to normal.


Cheers,
-- 
Kilian

Comment 102 Dominik Bartkiewicz 2018-08-20 05:05:14 MDT

Hi

Do you have any news?
Could you send  one more timemy slurmctld log?

Dominik

Comment 103 Kilian Cavalotti 2018-08-20 09:09:02 MDT

Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #102)
> Do you have any news?
> Could you send  one more timemy slurmctld log?

We were kind of waiting on you for updates. :)

The issue hasn't been happening as of late, the last significant occurrence happened on Aug 15th around 01:10 but was mitigated by our slurmctld restart cron script. I'm attaching the logs, which still show repeated "Resending TERMINATE_JOB" messages.

Let me know if you need anything else.

Cheers,
--
Kilian

Comment 104 Kilian Cavalotti 2018-08-20 09:09:39 MDT

Created attachment 7647 [details]
slurmctld logs for the last Aug 15 occurrence

Comment 105 Kilian Cavalotti 2018-08-22 15:30:30 MDT

Created attachment 7676 [details]
tracejobs+gdb logs

The issue just happened again. Here are the requested logs:
* slurmctld logs with TraceJob debug flag
* scontrol completing output
* for the last 8 nodes listed in that output, result of gdb 'call find_node_record("<nod_name>")'

Please advise.

Comment 106 Kilian Cavalotti 2018-08-22 17:46:32 MDT

We've been battling with this situation for about 4 hours now, but it doesn't look like the number of CG jobs is decreasing.

We've even marked the completing nodes as DOWN to clear up the CG jobs, it worked for a minute, and then the counter of completing jobs skyrocketed again. We're currently at more than 3,800 jobs in CG state and without any clear way to make that count go down. We restarted the controllers multiple times, even rebooted the physical machines, without any significant change.

I'm rising the severity of this issue to Sev1, it's pretty much a site down situation right now.

Would it help to give you access to the machine directly? We really need to find a solution to this problem, it has been going on for way to long. :(

Thanks!

Comment 107 Kilian Cavalotti 2018-08-22 18:25:33 MDT

One aggravating factor we noticed is that those completing jobs tend to be re-queued when the controller restarts, although they were in CG state before, which means that they were already terminated (or in the process of being terminated).

This obviously adds fuel to the fire and generates even more work for the scheduler, like an avalanche, when it tries to cleanup those CG jobs.

After having restarted the controller a few times, they end up being set aside and put in the "JobHoldMaxRequeue" state, which finally allows the scheduler to recover:

# squeue | grep -c JobHoldMaxRequeue
921

Why is the reason for those jobs to be requeued? Shouldn't they just be marked as COMPLETED (or FAILED) instead?

Comment 108 Kilian Cavalotti 2018-08-22 21:46:29 MDT

New fact that we discovered: some users are using snakemake (https://snakemake.readthedocs.io/en/stable/) to submit jobs automatically and keep a number of jobs in queue at the maximal permitted level.

It seems to be quite harmful for the scheduler when it tries to cleanup completing jobs and keep seeing new job submissions to replace the terminated ones.

Stopping those snakemake automated submission processes instantly brought back the number of CG jobs to regular levels.

I don't think that's the root cause of the problem, but definitely an exacerbating factor. Any idea how we could mitigate those situations? We're using jobs_per_user_per_hour in SchedulerParameters already, but is there any other way?

Comment 110 Dominik Bartkiewicz 2018-08-23 10:27:38 MDT

Hi

Sorry, that yesterday I didn't respond, it was after my normal working hours.

I am still analyzing all logs.
I have only one idea and it’s to try to disable slurmctld epilog.
This should make "completing logic" more straightforward.

Marshall noticed that enables SchedulerParameters=defer also can help.

Dominik

Comment 111 Kilian Cavalotti 2018-08-23 10:37:34 MDT

(In reply to Dominik Bartkiewicz from comment #110)
> Sorry, that yesterday I didn't respond, it was after my normal working hours.
> 
> I am still analyzing all logs.
> I have only one idea and it’s to try to disable slurmctld epilog.
> This should make "completing logic" more straightforward.
> 
> Marshall noticed that enables SchedulerParameters=defer also can help.

Thanks for the suggestions! I've implemented both.

Cheers,
-- 
Kilian

Comment 113 Marshall Garey 2018-08-23 12:57:52 MDT

Is the system running again? Are the number of jobs in COMPLETING state decreasing?

Comment 114 Kilian Cavalotti 2018-08-23 13:19:22 MDT

(In reply to Marshall Garey from comment #113)
> Is the system running again? Are the number of jobs in COMPLETING state
> decreasing?

Yes, as I noted in https://bugs.schedmd.com/show_bug.cgi?id=5443#c108, we were able to get the system back to normal after killing those snakemake processes. We've been back at regular number of CG jobs since 8pm last night.

I realize how easy it could have been to miss this in the middle of the other comments, and I forgot to get the bug back to Sev2, so I'm doing that now.

Cheers,
--
Kilian

Comment 115 Marshall Garey 2018-08-23 13:55:09 MDT

Okay. I did see comment 108, but since it was still at a sev-1, I wasn't sure.

Can you continue monitoring the system and let us know if the two changes (removing slurmctld epilog and adding SchedulerParameters=defer) are helping?


Just a bit more information about each of those changes:

defer works a bit differently than sched_min_interval (which I see you already have), so having both should help. defer in particular can help with large amounts of job submissions or job completions. Basically, it lessens some of the work slurmctld has to do (including creating threads and portions of code inside locks, including the main scheduler), hopefully giving it more time to handle everything else.

Removing the slurmctld epilog should also reduce overhead and give the slurmctld more time to respond to everything else.

Comment 116 Kilian Cavalotti 2018-08-23 17:01:09 MDT

(In reply to Marshall Garey from comment #115)
> Can you continue monitoring the system and let us know if the two changes
> (removing slurmctld epilog and adding SchedulerParameters=defer) are helping?

Absolutely! Although I feel like they would be most helpful for the recovery phase and avoid cascading situations, but not so much to prevent the issue from happening in the first place, right?


> Just a bit more information about each of those changes:
> 
> defer works a bit differently than sched_min_interval (which I see you
> already have), so having both should help. defer in particular can help with
> large amounts of job submissions or job completions. Basically, it lessens
> some of the work slurmctld has to do (including creating threads and
> portions of code inside locks, including the main scheduler), hopefully
> giving it more time to handle everything else.

Ideally, we'd like to avoid having to use defer, because our users love the reactivity and comfort of getting an interactive shell on a compute node in seconds rather than minutes. But we'll try it for now.

> Removing the slurmctld epilog should also reduce overhead and give the
> slurmctld more time to respond to everything else.

That we can live with without any problem: our slurmctld epilog was mostly dumping a snapshot of "sacct -j jobid" into a central location, to complete the information about the job that is already collected in the prolog. But it's more a convenience thing that anything else, because we can still query sacct after the fact.

Thanks!
-- 
Kilian

Comment 117 Dominik Bartkiewicz 2018-08-24 04:34:12 MDT

Hi

I am glad to hear that these options stabilize slurmctld.

Node record looks suspicious, especially "last_response = 0" but I still don't know if this is just result of overloaded slurmctld or is a cause of it.
Just to be sure weren't these gdb outputs generated just after slurmctld was restarted?
I want to look at job records too.

Could you generate this for one of "resending REQUEST_TERMINATE_JOB' completing job:
gdb -p slurmctld_pid --batch -ex 'call find_job_record(<job_id>)' -ex 'p *$1'?


Dominik

Comment 118 Marshall Garey 2018-08-24 09:57:08 MDT

(In reply to Kilian Cavalotti from comment #116)
> (In reply to Marshall Garey from comment #115)
> > Can you continue monitoring the system and let us know if the two changes
> > (removing slurmctld epilog and adding SchedulerParameters=defer) are helping?
> 
> Absolutely! Although I feel like they would be most helpful for the recovery
> phase and avoid cascading situations, but not so much to prevent the issue
> from happening in the first place, right?

They should help prevent it too, actually. We're doing more testing with and without the EpilogSlurmctld to see how that affects things.


> Ideally, we'd like to avoid having to use defer, because our users love the
> reactivity and comfort of getting an interactive shell on a compute node in
> seconds rather than minutes. But we'll try it for now.

On my test system (I know, not a real system), it still responds within a few seconds (2-4 seconds on average), rather than (seemingly) instantaneous.

Comment 120 Kilian Cavalotti 2018-08-24 10:24:03 MDT

(In reply to Dominik Bartkiewicz from comment #117)
> Node record looks suspicious, especially "last_response = 0" but I still
> don't know if this is just result of overloaded slurmctld or is a cause of
> it.
> Just to be sure weren't these gdb outputs generated just after slurmctld was
> restarted?

Well, depends how you defined "just" :)

The gdb traces have been taken at 14:17 on Aug 22, and here are the start dates of the controller:

Aug 22 14:15:11 sh-sl01 slurmctld[102853]: slurmctld version 17.11.8 started on cluster sherlock
Aug 22 14:18:11 sh-sl01 slurmctld[111305]: slurmctld version 17.11.8 started on cluster sherlock

So slurmctld started 2 mins before and restarted again 1 minute after.
But I'm afraid it's because it crashed both times, not because I restarted it:

-rw------- 1 slurm slurm 3513364480 Aug 22 14:12 /var/spool/slurm.state/core.64771
-rw------- 1 slurm slurm  902508544 Aug 22 14:18 /var/spool/slurm.state/core.160096


> I want to look at job records too.
> 
> Could you generate this for one of "resending REQUEST_TERMINATE_JOB'
> completing job:
> gdb -p slurmctld_pid --batch -ex 'call find_job_record(<job_id>)' -ex 'p
> *$1'?

We don't have any CG job right now, but we still have the core dumps from Aug 22, when this last happened, and the logs from right before the crash with a list of jobids that were being resent REQUEST_TERMINATE_JOB.
The thing is that I can't call the find_job_record() function from a core dump, it needs a live process.

Is there any way to get that information from the core dump?

Cheers,
--
Kilian

Comment 122 Dominik Bartkiewicz 2018-08-27 04:26:45 MDT

Hi

You can try gdb script like this:
gdb --batch -x gdb.script <slurmctld_path> <core>

gdb.script:
set $i = job_list->head      
while ($i)       
 set $job =  (struct job_record *)($i->data)
 if $job->job_id == 114
  p *$job
 end
 set $i = $i -> next
end


Dominik

Comment 123 Kilian Cavalotti 2018-08-27 09:52:38 MDT

(In reply to Dominik Bartkiewicz from comment #122)
> Hi
> 
> You can try gdb script like this:
> gdb --batch -x gdb.script <slurmctld_path> <core>
> 
> gdb.script:
> set $i = job_list->head      
> while ($i)       
>  set $job =  (struct job_record *)($i->data)
>  if $job->job_id == 114
>   p *$job
>  end
>  set $i = $i -> next
> end


Thank you, that works.

I just realized though, that the core dumps we have have been generated by segfaults (reported in other bugs) *after* a controller restart, so I'm not sure the information in there would be too meaningful.

But anyway, here are the records for a few of the jobids that were being resent TERMINATE_JOB requests:


$1 = {account = 0x7fab64361280 "skachru", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab64315ae0 "sh-ln06",
  alloc_resp_port = 37716, alloc_sid = 39284, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 14435,
  assoc_ptr = 0x292b990, batch_flag = 0, batch_host = 0x7fab6441b950 "sh-115-04", billable_tres = 1, bit_flags = 0,
  burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
  cpu_cnt = 1, cr_enabled = 1, db_index = 463645307, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab64639100,
  direct_set_prio = 0, end_time = 1534971236, end_time_exp = 4294967294, epilog_running = false, exit_code = 4294967294,
  fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab643920f0 "",
  gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab640cc3c0 "", gres_used = 0x0, group_id = 24086, job_id = 23944668,
  job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab6463c630, job_state = 32772,
  kill_on_node_fail = 1, last_sched_eval = 1534971049, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
    tres = 0x7fab64638960}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab6438b840 "bash",
  network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab6463c2c0 "sh-115-04", node_addr = 0x7fab646396e0,
  node_bitmap = 0x7fab6463c2f0, node_bitmap_cg = 0x7fabb402f1c0, node_cnt = 1, node_cnt_wag = 1,
  nodes_completing = 0x7fabb0154550 "sh-115-04", origin_cluster = 0x7fab646394b0 "sherlock", other_port = 43338, pack_job_id = 0,
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab640e77b0 "skachru", part_ptr_list = 0x0,
  part_nodes_missing = false, part_ptr = 0x7fab04482440, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0,
  preempt_in_progress = false, priority = 20428, priority_array = 0x0, prio_factors = 0x7fab646392b0, profile = 0, qos_id = 3,
  qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0,
  resv_ptr = 0x0, requid = 333157, resp_host = 0x7fab64639480 "10.10.0.66", sched_nodes = 0x0, select_jobinfo = 0x7fab64639510,
  spank_job_env = 0x7fab641a5730, spank_job_env_size = 1, start_protocol_ver = 8192, start_time = 1534971049, state_desc = 0x0,
  state_reason = 35, state_reason_prev = 35, step_list = 0x617af20, suspend_time = 0, time_last_active = 1534972273,
  time_limit = 120, time_min = 0, tot_sus_time = 3, total_cpus = 1, total_nodes = 1, tres_req_cnt = 0x7fab64638990,
  tres_req_str = 0x7fab646393c0 "1=1,2=25600,4=1", tres_fmt_req_str = 0x7fab64639420 "cpu=1,mem=25G,node=1",
  tres_alloc_cnt = 0x7fab6463c550, tres_alloc_str = 0x7fabb4424b20 "1=1,2=25600,3=18446744073709551614,4=1,5=1",
  tres_fmt_alloc_str = 0x7fabb422f470 "cpu=1,mem=25G,node=1,billing=1", user_id = 333157, user_name = 0x7fab842dc0a0 "bbagley",
  wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0,
  best_switch = true, wait4switch_start = 0}


$1 = {account = 0x7fab841ae700 "linder", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab842fee10 "sh-ln03", alloc_resp_port = 0, alloc_sid = 274530, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3988, assoc_ptr = 0x29cd8e0, batch_flag = 1, batch_host = 0x7fab94b40ee0 "sh-105-06", billable_tres = 2, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, db_index = 463650236, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab842580f0, direct_set_prio = 0, end_time = 1534972175, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab94b3ed70 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab94b7ec40 "", gres_used = 0x0, group_id = 6959, job_id = 23945006, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab94b0a1a0, job_state = 32772, kill_on_node_fail = 1, last_sched_eval = 1534972073, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fab84257b80}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab842577e0 "s71x22.sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab94176050 "sh-105-06", node_addr = 0x7fab9414b7d0, node_bitmap = 0x7fab94402360, node_bitmap_cg = 0x7fab4004de10, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7fabb02cbcb0 "sh-105-06", origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab9401ed90 "owners,cee,normal", part_ptr_list = 0x42fbe20, part_nodes_missing = false, part_ptr = 0x7fab044832e0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 7167, priority_array = 0x7fab842584c0, prio_factors = 0x7fab842582a0, profile = 0, qos_id = 3, qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 241285, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7fab84257b50, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1534972073, state_desc = 0x0, state_reason = 35, state_reason_prev = 35, step_list = 0x7fabac4da2c0, suspend_time = 0, time_last_active = 1534972273, time_limit = 59, time_min = 0, tot_sus_time = 8, total_cpus = 2, total_nodes = 1, tres_req_cnt = 0x7fab84257860, tres_req_str = 0x7fab84257a40 "1=2,2=8000,4=1", tres_fmt_req_str = 0x7fab84257af0 "cpu=2,mem=8000M,node=1", tres_alloc_cnt = 0x7fab94b3ed20, tres_alloc_str = 0x7fab4004ee60 "1=2,2=8000,3=18446744073709551614,4=1,5=2", tres_fmt_alloc_str = 0x7fab40810fe0 "cpu=2,mem=8000M,node=1,billing=2", user_id = 241285, user_name = 0x7faad880e9a0 "berkin", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


$1 = {account = 0x7fab847fd890 "akundaje", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab84826860 "sh-105-53", alloc_resp_port = 0, alloc_sid = 50361, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 8371, assoc_ptr = 0x29581e0, batch_flag = 1, batch_host = 0x7fab8c038c50 "sh-105-67", billable_tres = 2, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, db_index = 463644395, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab8405d3a0, direct_set_prio = 0, end_time = 1534971537, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab8c7e44b0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab8c406920 "", gres_used = 0x0, group_id = 48931, job_id = 23944626, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab8c05ce40, job_state = 32771, kill_on_node_fail = 1, last_sched_eval = 1534970961, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fab840d73e0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab847cc6c0 "zN-331", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab8c3e4250 "sh-105-67", node_addr = 0x7fab8c329a20, node_bitmap = 0x7fab8c713800, node_bitmap_cg = 0x7fab28300800, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7fabb01d6d70 "sh-105-67", origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab8c03a4e0 "sfgf,akundaje,normal,bigmem,owners", part_ptr_list = 0x4608df0, part_nodes_missing = false, part_ptr = 0x7fab040f52e0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 17066, priority_array = 0x7fab84717470, prio_factors = 0x7fab847fd800, profile = 0, qos_id = 3, qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7fab845b3830, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1534970961, state_desc = 0x0, state_reason = 0, state_reason_prev = 35, step_list = 0x5aa0f90, suspend_time = 0, time_last_active = 1534972273, time_limit = 240, time_min = 0, tot_sus_time = 5, total_cpus = 2, total_nodes = 1, tres_req_cnt = 0x7fab84826810, tres_req_str = 0x7fab84386850 "1=1,2=40000,4=1", tres_fmt_req_str = 0x7fab843868b0 "cpu=1,mem=40000M,node=1", tres_alloc_cnt = 0x7fab8c286bb0, tres_alloc_str = 0x7fab28233a20 "1=2,2=40000,3=18446744073709551614,4=1,5=2", tres_fmt_alloc_str = 0x7fab284add40 "cpu=2,mem=40000M,node=1,billing=2", user_id = 312359, user_name = 0x7fab584f9030 "marinovg", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

Cheers,
-- 
Kilian

Comment 127 Kilian Cavalotti 2018-08-29 17:47:39 MDT

Hi all, 

Just a quick update to note that disabling the slurmctld epilog and enabling defer doesn't make the problem go away:

# scontrol show config | egrep  "EpilogSlurmctld|SchedulerParameters"
EpilogSlurmctld         = (null)
SchedulerParameters     = default_queue_depth=5000,partition_job_depth=2000,bf_continue,bf_yield_interval=1500000,bf_yield_sleep=750000,bf_busy_nodes,bf_interval=60,bf_max_time=120,bf_window=10080,bf_job_part_count_reserve=10,bf_max_job_test=5000,bf_max_job_array_resv=10,bf_max_job_user_part=100,bf_resolution=3600,bf_min_prio_reserve=3000,kill_invalid_depend,nohold_on_prolog_fail,pack_serial_at_end,preempt_youngest_first,reduce_completing_frag,batch_sched_delay=20,sched_min_interval=100000,build_queue_timeout=5000000,jobs_per_user_per_hour=2000,max_rpc_cnt=128,defer

# scontrol completing | wc -l
206

Cheers,
-- 
Kilian

Comment 128 Dominik Bartkiewicz 2018-08-30 05:30:30 MDT

Hi

After analyzing job and node state data I was expecting this. None of my suspicions, unfortunately, didn't confirm. Now I am working on eliminating some cork in rpc/agent code (but for now I didn't find anything "big" enough to explain this), maybe this is just effect of overloaded slurmctld. 

Dominik

Comment 129 Kilian Cavalotti 2018-09-04 09:54:04 MDT

An additional thing I just noticed: we're using smail from contribs/ as a MailProg, and it looks like for each mail that smail generates, it gets the cluster name with the following command:

ClusterName=`/usr/bin/scontrol show config | grep ClusterName | awk '{printf("[%s]", $3)}'`

Since emails are generated and sent from the controller nodes, where slurm.conf is present, maybe that potentially load-generating scontrol call could be replaced by something lighter like this?

ClusterName=$(awk -F= '/ClusterName/ {print $2}' ${SLURM_CONF:-/etc/slurm/slurm.conf})

Comment 130 Dominik Bartkiewicz 2018-09-04 11:55:31 MDT

Hi

Thanks for pointing this out,
This definitely can create a significant additional load on slurmctld.
I will include this in my test environment, maybe this is the kay to recreate this problem.

Dominik

Comment 141 Dominik Bartkiewicz 2018-10-04 07:34:24 MDT

Hi

I have some news:
thanks for pointing out that smail uses scontrol to get clustername:
We addressed this problem in:
https://github.com/SchedMD/slurm/commit/9a36112969ea0519ce70830153e91eeb67a0211d
https://github.com/SchedMD/slurm/commit/d07505a8750112f0c42de314a4b520660efde141

We also localized a few spots where we can increase the performance of sending RPCs, I hope this enhancement will be in 18.08.2.

Dominik

Comment 143 Kilian Cavalotti 2018-10-04 09:38:34 MDT

Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #141)
> I have some news:
> thanks for pointing out that smail uses scontrol to get clustername:
> We addressed this problem in:
> https://github.com/SchedMD/slurm/commit/
> 9a36112969ea0519ce70830153e91eeb67a0211d
> https://github.com/SchedMD/slurm/commit/
> d07505a8750112f0c42de314a4b520660efde141

Thank you!
That first commit is about the job_submit Lua plugin though, so I assume that was https://github.com/SchedMD/slurm/commit/0324a04c007943862cf6e7ad0eae43799895b56c instead, but that's great news!

> We also localized a few spots where we can increase the performance of
> sending RPCs, I hope this enhancement will be in 18.08.2.

Great to hear too!

On our end, we've mostly managed to mitigate the issue by using `SchedulerParameters=defer`, which is not entirely satisfactory. So we will be glad to give a try to those RPCs improvements in 18.08.2 when they're available. And once we manage to solve our DB upgrade problems (#5662).

Thanks again!
-- 
Kilian

Comment 152 Dominik Bartkiewicz 2018-10-19 06:40:21 MDT

Hi

This commit should increase the RPCs performance of slurmctld,
https://github.com/SchedMD/slurm/commit/2b25aa4e555
Unfortunately, we decided that this change can go only to 19.05.


Dominik

Comment 157 Dominik Bartkiewicz 2018-10-30 09:00:55 MDT

Hi

Did you manage to upgrade to 18.08?
From other customers, we get the info that 18.08 seems to be faster and more responsive.

This commit makes another optimization in code that is related to rpcs/agent.
https://github.com/SchedMD/slurm/commit/85e97dc2ac3f31b

Dominik

Comment 158 Kilian Cavalotti 2018-10-30 09:07:10 MDT

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #157)
> Did you manage to upgrade to 18.08?

Not yet, unfortunately, but we're looking forward to it!

> From other customers, we get the info that 18.08 seems to be faster and more
> responsive.
> 
> This commit makes another optimization in code that is related to rpcs/agent.
> https://github.com/SchedMD/slurm/commit/85e97dc2ac3f31b

Nice! I assume this will be in 18.08.4?

Cheers,
-- 
Kilian

Comment 159 Dominik Bartkiewicz 2018-10-31 04:45:28 MDT

Hi

Yes, it will be in 18.08.4.
So I'm waiting for an update.
I hope in 18.08 you will be able to remove defer from SchedulerParameters.

Dominik

Comment 160 Dominik Bartkiewicz 2018-11-07 13:51:26 MST

Hi

While we are waiting for an update to 18.08, I will drop severity to 3,
but if you think this should be higher, feel free to bump it up again. 

Dominik

Comment 161 Kilian Cavalotti 2018-11-08 10:38:57 MST

(In reply to Dominik Bartkiewicz from comment #160)
> Hi
> 
> While we are waiting for an update to 18.08, I will drop severity to 3,
> but if you think this should be higher, feel free to bump it up again. 

Sounds good to me!

FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll keep you posted.

Cheers,
-- 
Kilian

Comment 162 Kilian Cavalotti 2018-11-28 15:05:26 MST

(In reply to Kilian Cavalotti from comment #161)
> FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll
> keep you posted.

We've just finished our migration and are now running 18.08.3.

Cheers,
-- 
Kilian

Comment 163 Kilian Cavalotti 2018-11-29 18:36:18 MST

(In reply to Kilian Cavalotti from comment #162)
> (In reply to Kilian Cavalotti from comment #161)
> > FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll
> > keep you posted.
> 
> We've just finished our migration and are now running 18.08.3.

So far so good. Despite a few initial segfaults, we were able to remove the "defer" scheduling parameter, and the controller seems to be able to manage the load: no CG jobs accumulation to report so far.

Cheers,
-- 
Kilian

Comment 164 Dominik Bartkiewicz 2018-11-30 05:02:04 MST

Hi

That is good news. Let's keep this open for a while to be sure that on 18.08 this works as well.

Dominik

Comment 165 Dominik Bartkiewicz 2018-12-06 07:18:23 MST

Hi

Let me know, when this ticket is ready to be closed.

Dominik

Comment 166 Kilian Cavalotti 2018-12-14 10:24:37 MST

Hi Dominik,

We've been running 18.08 in production for about two weeks now, and it behaves definitely better than 17.11 in that matter, so I think we can close this now.

We'll re-open if the issue arises again.

Thanks!
-- 
Kilian

Comment 168 Kilian Cavalotti 2019-01-11 18:01:32 MST

(In reply to Kilian Cavalotti from comment #166)
> We've been running 18.08 in production for about two weeks now, and it
> behaves definitely better than 17.11 in that matter, so I think we can close
> this now.

We're actually still seeing this happening from time to time. Much less often than in 17.11, for sure. We were able to re-enable our PrologSlurmctld and remove defer from SchdulerParamaters and it works fine most of the time. But there still definitely are events that still make completing jobs accumulate and hang until we restart slurmctld.

Cheers,
-- 
Kilian

Comment 172 Felip Moll 2019-11-29 06:38:57 MST

*** Ticket 7635 has been marked as a duplicate of this ticket. ***