Bug 11103 - Job steps left running if proctrack_p_wait is unable to remove cgroup
Summary: Job steps left running if proctrack_p_wait is unable to remove cgroup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 20.02.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-16 12:53 MDT by David Gloe
Modified: 2021-08-26 08:58 MDT (History)
4 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
bug11103 20.11 patch (3.53 KB, patch)
2021-07-20 04:49 MDT, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Gloe 2021-03-16 12:53:58 MDT
During a recent 48 hour endurance run on a Cray Internal system, we found many compute nodes with processes still running from jobs that had already completed. Investigation showed that the job step's cgroup still existed, and these error messages were seen in the slurmd log:

proctrack_p_wait: Unable to destroy container 66624 in cgroup plugin, giving up after 128 sec
eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated

I believe that this sequence of events caused this:

1. Lustre connectivity goes down
2. Application process attempts to write to Lustre and hangs in the kernel
3. Job gets canceled, proctrack/cgroup attempts to kill processes
4. proctrack/cgroup gives up killing processes and removing cgroup
5. Lustre connectivity is restored, processes continue running

I think we could improve this situation by making the proctrack_p_wait timeout configurable, and/or draining the node if proctrack_p_wait is unable to kill the processes.
Comment 2 Felip Moll 2021-03-17 09:32:07 MDT
(In reply to David Gloe from comment #0)
> During a recent 48 hour endurance run on a Cray Internal system, we found
> many compute nodes with processes still running from jobs that had already
> completed. Investigation showed that the job step's cgroup still existed,
> and these error messages were seen in the slurmd log:
> 
> proctrack_p_wait: Unable to destroy container 66624 in cgroup plugin, giving
> up after 128 sec
> eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated
> 
> I believe that this sequence of events caused this:
> 
> 1. Lustre connectivity goes down
> 2. Application process attempts to write to Lustre and hangs in the kernel
> 3. Job gets canceled, proctrack/cgroup attempts to kill processes
> 4. proctrack/cgroup gives up killing processes and removing cgroup
> 5. Lustre connectivity is restored, processes continue running
> 
> I think we could improve this situation by making the proctrack_p_wait
> timeout configurable, and/or draining the node if proctrack_p_wait is unable
> to kill the processes.

proctrack_p_wait sleeps several times waiting for the processes to end. It starts sleeping 1 second in the first try, then 2, then 4, 8, 16, 32 and 64. That makes a total wait time of 126 seconds.

If you have the UnkillableStepTimeout set to a value less than 126, you will get the desired effect of draining the node. If it is bigger, then you will see this situation.

That's because the step_terminate_monitor will wait for this configured time before draining the node, and if it turns that proctrack ends earlier then the node won't be drained.

See _monitor() in step_terminate_monitor.c.

Moreover you can set the UnkillableStepProgram to do further actions if this situation happens.

But I see your point, you're right that proctrack_p_wait timeout in cgroup code is hardcoded, which makes that if UnkillableStepTimeout is greater than 126 then it won't behave as one would expect.

What is your UnkillableStepTimeout?
Comment 3 Felip Moll 2021-03-17 09:33:42 MDT
> proctrack_p_wait sleeps several times waiting for the processes to end. It
> starts sleeping 1 second in the first try, then 2, then 4, 8, 16, 32 and 64.
> That makes a total wait time of 126 seconds.

Sorry, 127 seconds is what I meant in my last comment.
Comment 4 David Gloe 2021-03-17 09:40:55 MDT
We have UnkillableStepTimeout set to 3600. We can try reducing the UnkillableStepTimeout to below 127 and see if that helps.
Comment 5 Luke Yeager 2021-03-19 09:56:48 MDT
(In reply to Felip Moll from comment #2)
> If you have the UnkillableStepTimeout set to a value less than 126, you will
> get the desired effect of draining the node. If it is bigger, then you will
> see this situation.
Yikes! I'm fixing this on our clusters ASAP. The solution recommended by David is fine, but please also update slurm.conf(5) and/or add a warning in the slurmctld startup log about this.
Comment 6 Felip Moll 2021-03-22 05:59:31 MDT
(In reply to Luke Yeager from comment #5)
> (In reply to Felip Moll from comment #2)
> > If you have the UnkillableStepTimeout set to a value less than 126, you will
> > get the desired effect of draining the node. If it is bigger, then you will
> > see this situation.
> Yikes! I'm fixing this on our clusters ASAP. The solution recommended by
> David is fine, but please also update slurm.conf(5) and/or add a warning in
> the slurmctld startup log about this.

Yep, that's something I will need to fix/improve.

David, let me know if changing that value fixed things for you.
Comment 7 Luke Yeager 2021-03-23 12:56:04 MDT
Actually, I can't reproduce this. For me, the unkillable step results in the node being drained with "Kill task failed" as expected.

Relevant config:
> $ scontrol show config
> ProctrackType           = proctrack/cgroup
> SelectType              = select/cons_tres
> SLURM_VERSION           = 20.11.5
> UnkillableStepTimeout   = 180 sec
Comment 8 David Gloe 2021-03-24 07:05:26 MDT
Luke, could you check the state of the node between the 128 seconds of proctrack_p_wait and the UnkillableStepTimeout? The behavior we're seeing is that the node goes back to idle state so that more jobs run on that node.
Comment 9 Luke Yeager 2021-03-24 10:31:21 MDT
(In reply to David Gloe from comment #8)
> Luke, could you check the state of the node between the 128 seconds of
> proctrack_p_wait and the UnkillableStepTimeout? The behavior we're seeing is
> that the node goes back to idle state so that more jobs run on that node.
I increased the timeout from 180sec to 600sec and now I can see the issue that David is reporting. I'm using a slightly modified version of this trick to simulate a hung process: https://unix.stackexchange.com/a/305640.

> Mar 24 09:12:37 compute-node-735 slurm-prolog[3908899]: START user=lyeager job=104002
> Mar 24 09:12:44 compute-node-735 slurm-prolog[3909126]: END user=lyeager job=104002 after 7 seconds
> Mar 24 09:12:44 compute-node-735 slurmd[3908816]: prolog for job 104002 ran for 7 seconds
> Mar 24 09:12:46 compute-node-735 slurmd[3908816]: StepId=104002.interactive job_mem:0MB step_mem:0MB
> Mar 24 09:12:57 compute-node-735 kernel: <1>Inserting memory module
> Mar 24 09:13:06 compute-node-735 kernel: <1>going to sleep
> Mar 24 09:13:19 compute-node-735 slurmstepd[3909135]: error: *** STEP 104002.interactive ON compute-node-735 CANCELLED AT 2021-03-24T09:13:19 ***
> Mar 24 09:13:19 compute-node-735 slurmstepd[3909128]: done with job
> Mar 24 09:15:45 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 120 seconds.
> Mar 24 09:17:46 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 241 seconds.
> Mar 24 09:18:04 compute-node-735 slurmstepd[3909135]: error: proctrack_p_wait: Unable to destroy container 3909135 in cgroup plugin, giving up after 128 sec
> Mar 24 09:19:04 compute-node-735 slurmstepd[3909135]: error: eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated
> Mar 24 09:19:06 compute-node-735 slurmstepd[3909135]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
> Mar 24 09:19:06 compute-node-735 slurmstepd[3909135]: done with job
> Mar 24 09:19:09 compute-node-735 slurm-epilog[3909392]: START user=lyeager job=104002
> Mar 24 09:19:12 compute-node-735 slurm-epilog[3909820]: END user=lyeager job=104002 after 3 seconds
> Mar 24 09:19:47 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 362 seconds.
And indeed, the node is not drained anymore with this sequence of events. That's not good. Fortunately, we already have an extra paranoid check for stray processes in our epilog (left over from some previous fire-fighting) so we are able to catch this. And we also haven't seen it in the wild yet. But thanks to David for reporting this! Good to know about!

Note that I'm seeing nearly 5minutes between the cancellation of the job and the "Unable to destroy container" message. So it seems the threshold is actually higher than 127 seconds for some reason.
Comment 10 Felip Moll 2021-04-29 09:56:37 MDT
> Note that I'm seeing nearly 5minutes between the cancellation of the job and
> the "Unable to destroy container" message. So it seems the threshold is
> actually higher than 127 seconds for some reason.

I missed that the sleep is done at the start of the loop, so it is not 127 but 256 seconds (the last wait is useless).

I am preparing a fix for this.
Comment 17 Jim Nordby 2021-07-19 13:10:08 MDT
Just curious - how's that fix going?
Comment 20 Felip Moll 2021-07-20 04:49:49 MDT
Created attachment 20443 [details]
bug11103 20.11 patch

Hi,

Sorry for not having updated you before. There is a fix pending for review for some time ago but due to the 21.08 approaching and other critical issues the QA team still haven't pushed this one in.

It will probably be part of 21.08 but not fixed in 20.11.

Just in case you're really interested in the fix I hand over this standalone patch to you. This applies to 20.11 cleanly, if you need instructions on how to do so let me know.

I will inform when the fix is officially in our production release.
Comment 30 Felip Moll 2021-08-26 05:38:40 MDT
Hi David,

First of all sorry for the delay on the resolution of this bug.

Finally a fix has been included in the latest 21.08.0 release, so it will be in from now on. The change is applied to cgroup and pgid plugins.

There's a bit change in logic (which should not cause any negative impact); now the delay between retries will be of maximum 32 seconds, it made no sense to wait up to 127 seconds to retry. Also the last sigkill, which its status was not controlled won't run if the timeout is reached.

commit dbb9fa27b6d91b0bbbc5b28267bd1dc135f9b353
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Tue May 18 14:36:55 2021 +0200
Commit:     Tim Wickberg <tim@schedmd.com>
CommitDate: Wed Aug 25 09:22:20 2021 -0600

    Wait UnkillableStepTimeout in proctrack_g_wait()
    
    There was a hardcoded value of 256 seconds in proctrack_g_wait() in the
    pgid and cgroup plugins that did not match the wait of the
    step_terminate_monitor.
    
    That caused values of UnkillableStepTimeout larger than 256 to not drain
    the node when an unkillable step existed, because proctrack finished before
    this 256 second timeout and the monitor was terminated.
    
    With this fix, proctrack_g_wait() will wait UnkillableStepTimeout instead.
    
    Also change the delay between tests to a maximum of 32 seconds, instead of
    exponentially increasing it every time.
    
    Bug 11103

Thanks for your waiting.
I am closing this as resolved.