Bug 2096 - step_extern not cleaned up at job exit
Summary: step_extern not cleaned up at job exit
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 15.08.2
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-11-03 07:41 MST by Ryan Cox
Modified: 2015-12-01 09:20 MST (History)
1 user (show)

See Also:
Site: BYU - Brigham Young University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.5 16.05.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ryan Cox 2015-11-03 07:41:18 MST
I have noticed that processes in the step_extern cgroups are not cleaned up automatically when a job exits.  I do not know if this is intentional or not.  I could easily add something to epilog to kill all tasks in step_extern/tasks, if needed, but it would be nice if it gets cleaned up automatically.  Otherwise stray tasks can be left behind.

As far as I can tell, anything that is actually launched under Slurm gets cleaned up but not anything in the step_extern cgroup.
Comment 1 David Bigagli 2015-11-04 03:25:48 MST
Hi Ryan, we are looking at your report and get back to you later.

David
Comment 2 David Bigagli 2015-11-04 21:39:08 MST
Hi Ryan, I cannot reproduce your issue. In my test the step_extern directory is
always cleaned no matter if I use srun, salloc or sbatch. 
The only thing I noticed is the step_batch is not cleaned when a job is 
terminated by scancel, but this seems to be a different issue.
I am running on CentOS 6.7 (Final). Which distro are you running and
were your job terminated?

David
Comment 3 David Bigagli 2015-11-04 22:56:40 MST
I tried CentOS7 observing the same behaviour. #2105 is logged to track
the batch_step not being cleaned up.

David
Comment 4 Ryan Cox 2015-11-05 02:05:45 MST
It appears that I left out a crucial detail (oops).  It won't clean it up if something is adopted into step_extern from pam_slurm_adopt.

The easiest way to reproduce is to do something like salloc in one window.  If you don't have pam_slurm_adopt set up, simply ssh into the node then write your pid to the step_extern/tasks file.  If pam_slurm_adopt is set up, simply ssh in and check the tasks file for your pid.  Then from the salloc window, exit the job.  The ssh connection won't terminate automatically and step_extern will still be there.

This is on RHEL6.
Comment 5 David Bigagli 2015-11-05 02:36:47 MST
Thanks for the info, will try it.

David
Comment 6 David Bigagli 2015-11-06 01:12:13 MST
I can confirm this, the slurmstepd that creates the step_extern does not
clean up after the job finishes. The exception is when JobAcctGatherType
is set to jobacct_gather/cgroup and there are no pids in the step_extern/tasks
file, then that plugin removes the step_extern, however if there are active
pids in the tasks then the rmdir will fail.

There is also no code to terminate these external pids and that's why the
ssh connection is active. I don't have the pam_slurm_adopt code installed
but I can easily simulate it by stating an xterm and putting it's pid
in the tasks file. When the job finishes xterm happily stays around.

I am not sure yet why the release agent does not clean up the directory.
Perhaps it needs to be updated as well. We will develop a fix for the 
above problems.

David
Comment 7 Tim Wickberg 2015-11-13 07:57:00 MST
Ryan - 

Do you know if this is still a problem after all the other related step_extern changes?

David - if you have time next week can you attempt to reproduce on 15.08.4? Enough of the step_extern stuff has changed that it may be resolved now.

- Tim
Comment 8 Ryan Cox 2015-11-13 08:37:15 MST
(In reply to Tim Wickberg from comment #7)
> Ryan - 
> 
> Do you know if this is still a problem after all the other related
> step_extern changes?

I tried it early today at e7a94813bc996a6b22ffeb13cd17d8b0442aaeff and it hasn't been fixed.  It could be related to the processes not being adopted into the cpuset cgroup.  See bug 2097 comment 28.
Comment 9 David Bigagli 2015-11-15 20:26:55 MST
I am looking at this now and will update you guys later.

David
Comment 10 David Bigagli 2015-11-16 01:12:56 MST
I think there 2 issues here.

1) Too many cgroups are created.
Since the extern processes are handled by the jobacct cgroup plugin, there
is no need to create step_extern directories under cpuset, device and freezer
cgroups. Only memory and cpuacct are necessary to capture the remote pids
and track them for accounting purpose.

2) There is no code that cleans up the pids in step_extern. I have a patch for
that which reads the step_extern/task_0/tasks file and send SIGTERM followed
bu SIGKILL to processes listed in the file.

I will have a complete patch in a day or so.

David
Comment 11 Ryan Cox 2015-11-16 05:28:15 MST
>1) Too many cgroups are created.
>Since the extern processes are handled by the jobacct cgroup plugin,
>there
>is no need to create step_extern directories under cpuset, device and
>freezer
>cgroups. Only memory and cpuacct are necessary to capture the remote
>pids
>and track them for accounting purpose.

I'm not quite sure I understand this.  Maybe for purposes of cleaning up processes and for accounting, you don't need the cpuset or devices cgroups.  For controlling which cpus and gpus the job has access to you do.

>2) There is no code that cleans up the pids in step_extern. I have a
>patch for
>that which reads the step_extern/task_0/tasks file and send SIGTERM
>followed
>bu SIGKILL to processes listed in the file.
>
>I will have a complete patch in a day or so.

I'm in the airport and can't test it at the moment, but I think that a new task is created each time stepd_add_extern_pid is called.  You may already be planning to handle that but I figured I should point it out.
Comment 12 David Bigagli 2015-11-16 05:49:04 MST
Hi in the current implementation only the jobacct gather cgroup plugin handles the external processes, the adopted ones. What would be the purpose of having the external pids in the other cgroups like freezer or cpuset? Unless you want to treat these pids just like all the pids of the job itself. If this is the case way more work needs to be done.

On November 16, 2015 8:28:15 PM GMT+01:00, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2096
>
>--- Comment #11 from Ryan Cox <ryan_cox@byu.edu> ---
>>1) Too many cgroups are created.
>>Since the extern processes are handled by the jobacct cgroup plugin,
>>there
>>is no need to create step_extern directories under cpuset, device and
>>freezer
>>cgroups. Only memory and cpuacct are necessary to capture the remote
>>pids
>>and track them for accounting purpose.
>
>I'm not quite sure I understand this.  Maybe for purposes of cleaning
>up
>processes and for accounting, you don't need the cpuset or devices
>cgroups. 
>For controlling which cpus and gpus the job has access to you do.
>
>>2) There is no code that cleans up the pids in step_extern. I have a
>>patch for
>>that which reads the step_extern/task_0/tasks file and send SIGTERM
>>followed
>>bu SIGKILL to processes listed in the file.
>>
>>I will have a complete patch in a day or so.
>
>I'm in the airport and can't test it at the moment, but I think that a
>new task
>is created each time stepd_add_extern_pid is called.  You may already
>be
>planning to handle that but I figured I should point it out.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
>You are watching someone on the CC list of the bug.
>You are watching the assignee of the bug.
Comment 13 Ryan Cox 2015-11-16 09:49:36 MST
(In reply to David Bigagli from comment #12)
> Hi in the current implementation only the jobacct gather cgroup plugin
> handles the external processes, the adopted ones. What would be the purpose
> of having the external pids in the other cgroups like freezer or cpuset?
> Unless you want to treat these pids just like all the pids of the job
> itself. If this is the case way more work needs to be done.

Basically the extern step + stepd_add_extern_pid + pam_slurm_adopt is meant to provide the following for ssh-launched processes:
1) Proper limit enforcement  (cpuset, memory, and devices cgroups)
2) Accounting in the slurm database
3) Cleanup of stray processes when the job exits

#1 was done when I used cgroups directly but the cpusets functionality was broken when stepd_add_extern_pid replaced the direct usage of cgroups.

#2 was identified as needing a new RPC in bug 2097 comment 15.  That new RPC (stepd_add_extern_pid) was what broke #1's cpuset cgroup usage (I'm not sure about devices since we don't currently use it).  #2 now works

#3 is this bug report.
Comment 14 Ryan Cox 2015-11-16 10:25:29 MST
Here is some additional information in case you want more details:
http://tech.ryancox.net/2015/11/pamslurmadopt.html
http://tech.ryancox.net/2015/04/caller-id-handling-ssh-launched-processes-in-slurm.html
Comment 15 David Bigagli 2015-11-17 00:44:58 MST
In commit 4f8741873e285d I have added the code to clean up the external processes
in step_extern/task_0/tasks file, once there are no running task under the
given cgroup the cgroup itself can be cleaned using rmdir().

David
Comment 16 Danny Auble 2015-11-25 06:34:10 MST
Ryan, David's patch appears to work for cleaning up stuff.  Let us know otherwise, in my testing it seemed to work anyway leaving the processes dead and directory all cleaned up afterwards.

Please reopen this if you find otherwise.
Comment 17 Danny Auble 2015-12-01 04:28:40 MST
Ryan, I just did a altered this slightly in commit's

7ff083533e2e
8a49f409ad21

Let me know if you see anything else on this.  I think this will clean things up correctly now.
Comment 18 Ryan Cox 2015-12-01 09:20:30 MST
Thanks.  Hopefully I'll have time to test it this week along with the newer fixes for bug 2097.