1054 – proctrack/cgroup not ending processes

Bug 1054 - proctrack/cgroup not ending processes

Summary: proctrack/cgroup not ending processes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other bugs)
Version:	14.03.4
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-08-21 09:34 MDT by Jon Loptien
Modified:	2014-08-27 06:52 MDT (History)
CC List:	1 user (show)

See Also:
Site:	University of Colorado
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
cgroup.conf (296 bytes, application/octet-stream) 2014-08-21 10:04 MDT, Jon Loptien	Details
slurm.conf (5.36 KB, application/octet-stream) 2014-08-21 10:05 MDT, Jon Loptien	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jon Loptien 2014-08-21 09:34:58 MDT

Hello,
We are using cgroups to try to track the processes and limit memory to core usage on shared nodes. We have noticed that not all processes from a job are cancelled when walltime is reached or the user scancel's the job.

Our tests show that whichever step is running at the time is not killed but any subsequent tasks won't start. A simple test we ran is:
#!/bin/bash

echo "Start 100"
sleep 100
echo "Start 200"
sleep 200
echo "end 200"

Then cancel or set the time limit shorter than the sleeps and look for the sleep process on the assigned node.

We are running RHEL 6.3 with:
Linux cnode0101 2.6.32-279.5.2.el6.x86_64 #1 SMP Tue Aug 14 11:36:39 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Slurm version is 14.03.4. The slurmd logs look like they are killing the job okay, they just aren't cleaning up the job processes.

Let me know what other information you need.

Thanks,
Jon

Comment 1 Danny Auble 2014-08-21 09:59:52 MDT

Hey Jon, I am unable to reproduce this using your reproducer.

I have tried with a 2.6.32-358 kernel as well as a 3.13.0-34 and it appears to clean things up as expected.

Could you send your slurm.conf as well as your cgroup.conf?

Comment 2 Jon Loptien 2014-08-21 10:04:02 MDT

Created attachment 1152 [details]
cgroup.conf

Comment 3 Jon Loptien 2014-08-21 10:05:05 MDT

Created attachment 1153 [details]
slurm.conf

Comment 4 Danny Auble 2014-08-21 10:10:10 MDT

Thanks Jon, those look fine to me.  Just to verify, you are starting your script with sbatch correct?

Could you send me the output of the slurmd.log while your job is running until it is finished?  At level "debug" would be nice.

Comment 5 Jon Loptien 2014-08-21 10:23:04 MDT

This was the sleep job with a 2 minute time limit and the second sleep was still running after the job timed out.


[2014-08-21T16:14:16.112] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2014-08-21T16:14:16.113] debug:  Calling /curc/slurm/slurm/14.03.4/sbin/slurmstepd spank prolog
[2014-08-21T16:14:16.133] Reading slurm.conf file: /curc/slurm/slurm/etc/slurm.conf
[2014-08-21T16:14:16.136] Running spank/prolog for jobid [147160] uid [69309]
[2014-08-21T16:14:16.136] spank: opening plugin stack /curc/slurm/slurm/etc/plugstack.conf
[2014-08-21T16:14:16.136] spank: /curc/slurm/slurm/current/lib/slurm/spank_pbs.so: no callbacks in this context
[2014-08-21T16:14:16.144] debug:  [job 147160] attempting to run prolog [/curc/slurm/slurm/scripts/prolog]
[2014-08-21T16:14:21.377] Launching batch job 147160 for UID 69309
[2014-08-21T16:14:21.384] debug level is 6.
[2014-08-21T16:14:21.385] switch NONE plugin loaded
[2014-08-21T16:14:21.385] setup for a batch_job
[2014-08-21T16:14:21.386] AcctGatherProfile NONE plugin loaded
[2014-08-21T16:14:21.386] AcctGatherEnergy NONE plugin loaded
[2014-08-21T16:14:21.387] AcctGatherInfiniband NONE plugin loaded
[2014-08-21T16:14:21.387] AcctGatherFilesystem NONE plugin loaded
[2014-08-21T16:14:21.388] Reading acct_gather.conf file /curc/slurm/slurm/etc/acct_gather.conf
[2014-08-21T16:14:21.389] Reading cgroup.conf file /curc/slurm/slurm/etc/cgroup.conf
[2014-08-21T16:14:21.390] Job accounting gather cgroup plugin loaded
[2014-08-21T16:14:21.390] jobacct_gather dynamic logging disabled
[2014-08-21T16:14:21.393] [147160] laying out the 1 tasks on 1 hosts cnode0104 dist 2
[2014-08-21T16:14:21.394] [147160] Message thread started pid = 23927
[2014-08-21T16:14:21.395] [147160] Reading cgroup.conf file /curc/slurm/slurm/etc/cgroup.conf
[2014-08-21T16:14:21.396] [147160] task/cgroup: now constraining jobs allocated cores
[2014-08-21T16:14:21.396] [147160] task/cgroup/memory: total:96865M allowed:100%(enforced), swap:0%(permissive), max:100%(96865M) max+swap:100%(193730M) min:30M
[2014-08-21T16:14:21.396] [147160] task/cgroup: now constraining jobs allocated memory
[2014-08-21T16:14:21.396] [147160] task/cgroup: loaded
[2014-08-21T16:14:21.396] [147160] Reading cgroup.conf file /curc/slurm/slurm/etc/cgroup.conf
[2014-08-21T16:14:21.397] [147160] Checkpoint plugin loaded: checkpoint/none
[2014-08-21T16:14:21.397] [147160] mpi type = (null)
[2014-08-21T16:14:21.405] [147160] Before call to spank_init()
[2014-08-21T16:14:21.405] [147160] spank: opening plugin stack /curc/slurm/slurm/etc/plugstack.conf
[2014-08-21T16:14:21.405] [147160] spank: /curc/slurm/slurm/etc/plugstack.conf:1: Loaded plugin spank_pbs.so
[2014-08-21T16:14:21.405] [147160] After call to spank_init()
[2014-08-21T16:14:21.405] [147160] mpi type = (null)
[2014-08-21T16:14:21.405] [147160] mpi/none: slurmstepd prefork
[2014-08-21T16:14:21.405] [147160] unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
[2014-08-21T16:14:21.405] [147160] unable to get parameter 'cpus' for '/cgroup/cpuset/slurm'
[2014-08-21T16:14:21.406] [147160] task/cgroup: job abstract cores are '22'
[2014-08-21T16:14:21.406] [147160] task/cgroup: step abstract cores are '22'
[2014-08-21T16:14:21.406] [147160] task/cgroup: job physical cores are '10'
[2014-08-21T16:14:21.406] [147160] task/cgroup: step physical cores are '10'
[2014-08-21T16:14:21.413] [147160] cpu_freq_cgroup_validate: request value =            0         0
[2014-08-21T16:14:21.413] [147160]   jobid=147160, stepid=4294967294, tasks=1 cpu/task=0, cpus=1
[2014-08-21T16:14:21.413] [147160]   cpu_bind_type=   0, cpu_bind map=(null)
[2014-08-21T16:14:21.413] [147160]   step logical cores = 22, step physical cores = 10
[2014-08-21T16:14:21.413] [147160] task/cgroup: /slurm/uid_69309/job_147160: alloc=1700MB mem.limit=1700MB memsw.limit=unlimited
[2014-08-21T16:14:21.413] [147160] task/cgroup: /slurm/uid_69309/job_147160/step_4294967294: alloc=1700MB mem.limit=1700MB memsw.limit=unlimited
[2014-08-21T16:14:21.413] [147160] Uncached user/gid: loptienj/69309
[2014-08-21T16:14:21.672] [147160] Entering _setup_normal_io
[2014-08-21T16:14:21.672] [147160] Uncached user/gid: loptienj/69309
[2014-08-21T16:14:21.737] [147160] Leaving  _setup_normal_io
[2014-08-21T16:14:21.737] [147160] debug level = 2
[2014-08-21T16:14:21.737] [147160] Uncached user/gid: loptienj/69309
[2014-08-21T16:14:21.738] [147160] task 0 (23934) started 2014-08-21T16:14:21
[2014-08-21T16:14:21.738] [147160] Setting slurmstepd oom_adj to -1000
[2014-08-21T16:14:21.743] [147160] adding task 0 pid 23934 on node 0 to jobacct
[2014-08-21T16:14:21.743] [147160] jobacct_gather_cgroup_cpuacct_attach_task: jobid 147160 stepid 4294967294 taskid 0 max_task_id 0
[2014-08-21T16:14:21.747] [147160] jobacct_gather_cgroup_memory_attach_task: jobid 147160 stepid 4294967294 taskid 0 max_task_id 0
[2014-08-21T16:14:21.755] [147160] 23934 mem size 0 112288 time 0(0+0)
[2014-08-21T16:14:21.755] [147160] _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq 
[2014-08-21T16:14:21.755] [147160] _get_sys_interface_freq_line: filename = /proc/cpuinfo 
[2014-08-21T16:14:21.755] [147160] cpunfo_frequency= 2793002
[2014-08-21T16:14:21.755] [147160] Task average frequency = 2793002 pid 23934 mem size 0 112288 time 0(0+0)
[2014-08-21T16:14:21.755] [147160] energycounted = 0
[2014-08-21T16:14:21.755] [147160] getjoules_task energy = 0
[2014-08-21T16:14:21.755] [147160] Job 147160 memory used:0 limit:1740800 KB
[2014-08-21T16:14:21.757] [147160] job_container none plugin loaded
[2014-08-21T16:14:21.763] [147160] Uncached user/gid: loptienj/69309
[2014-08-21T16:14:21.764] [147160] spank: spank_pbs.so: task_init = 0
[2014-08-21T16:14:21.764] [147160] [job 147160] attempting to run slurm task_prolog [/curc/slurm/slurm/scripts/taskprolog]
[2014-08-21T16:14:21.770] [147160] export name:SLURM_SCRATCH:val:/local/scratch/147160:
[2014-08-21T16:14:21.774] [147160] export name:PBS_NODEFILE:val:/local/scratch/147160/147160.nodelist:
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_CPU in environment
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_FSIZE in environment
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_DATA in environment
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_STACK in environment
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_CORE in environment
[2014-08-21T16:14:21.774] [147160] Couldn't find SLURM_RLIMIT_RSS in environment
[2014-08-21T16:14:21.775] [147160] Couldn't find SLURM_RLIMIT_NPROC in environment
[2014-08-21T16:14:21.775] [147160] Couldn't find SLURM_RLIMIT_NOFILE in environment
[2014-08-21T16:14:21.775] [147160] Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[2014-08-21T16:14:21.775] [147160] Couldn't find SLURM_RLIMIT_AS in environment
[2014-08-21T16:14:21.775] [147160] Set task rss(1700 MB)
[2014-08-21T16:14:47.719] debug2: got this type of message 1011
[2014-08-21T16:14:47.719] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2014-08-21T16:14:47.719] debug:  attempting to run health_check [/curc/slurm/slurm/scripts/health_check.sh]
[2014-08-21T16:15:32.767] debug2: got this type of message 1011
[2014-08-21T16:15:32.767] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2014-08-21T16:15:32.768] debug:  attempting to run health_check [/curc/slurm/slurm/scripts/health_check.sh]
[2014-08-21T16:16:17.149] debug2: got this type of message 1011
[2014-08-21T16:16:17.149] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2014-08-21T16:16:17.150] debug:  attempting to run health_check [/curc/slurm/slurm/scripts/health_check.sh]
[2014-08-21T16:16:28.423] debug2: got this type of message 6009
[2014-08-21T16:16:28.423] debug2: Processing RPC: REQUEST_KILL_TIMELIMIT
[2014-08-21T16:16:28.464] [147160] auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2014-08-21T16:16:28.464] debug2: container signal 996 to job 147160.4294967294
[2014-08-21T16:16:28.464] [147160] Handling REQUEST_SIGNAL_CONTAINER
[2014-08-21T16:16:28.464] [147160] _handle_signal_container for step=147160.4294967294 uid=0 signal=996
[2014-08-21T16:16:28.464] [147160] *** JOB 147160 CANCELLED AT 2014-08-21T16:16:28 DUE TO TIME LIMIT ***
[2014-08-21T16:16:28.466] debug2: No steps in jobid 147160 to send signal 15
[2014-08-21T16:16:28.467] Job 147160: timeout: sent SIGTERM to 0 active steps
[2014-08-21T16:16:28.467] debug:  _rpc_terminate_job, uid = 515
[2014-08-21T16:16:28.469] debug:  credential for job 147160 revoked
[2014-08-21T16:16:28.505] debug2: container signal 18 to job 147160.4294967294
[2014-08-21T16:16:28.505] [147160] Handling REQUEST_SIGNAL_CONTAINER
[2014-08-21T16:16:28.505] [147160] _handle_signal_container for step=147160.4294967294 uid=0 signal=18
[2014-08-21T16:16:28.505] [147160] killing process 23934 (slurm_task) with signal 18
[2014-08-21T16:16:28.505] [147160] Sent signal 18 to 147160.4294967294
[2014-08-21T16:16:28.508] debug2: container signal 15 to job 147160.4294967294
[2014-08-21T16:16:28.508] [147160] Handling REQUEST_SIGNAL_CONTAINER
[2014-08-21T16:16:28.508] [147160] _handle_signal_container for step=147160.4294967294 uid=0 signal=15
[2014-08-21T16:16:28.508] [147160] killing process 23934 (slurm_task) with signal 15
[2014-08-21T16:16:28.508] [147160] Sent signal 15 to 147160.4294967294
[2014-08-21T16:16:28.508] [147160] unable to open '/cgroup/memory/slurm/uid_69309/job_147160/step_4294967294/task_0/memory.stat' for reading : No such file or directory
[2014-08-21T16:16:28.509] [147160] unable to get parameter 'memory.stat' for '/cgroup/memory/slurm/uid_69309/job_147160/step_4294967294/task_0'
[2014-08-21T16:16:28.510] Purging vestigial job script /curc/slurm/slurm/state/job147160/slurm_script
[2014-08-21T16:16:28.511] debug2: set revoke expiration for jobid 147160 to 1408660588 UTS
[2014-08-21T16:16:28.514] debug:  Waiting for job 147160's prolog to complete
[2014-08-21T16:16:28.514] debug:  Finished wait for job 147160's prolog to complete
[2014-08-21T16:16:28.514] debug:  Calling /curc/slurm/slurm/14.03.4/sbin/slurmstepd spank epilog
[2014-08-21T16:16:28.517] Reading slurm.conf file: /curc/slurm/slurm/etc/slurm.conf
[2014-08-21T16:16:28.520] Running spank/epilog for jobid [147160] uid [69309]
[2014-08-21T16:16:28.520] spank: opening plugin stack /curc/slurm/slurm/etc/plugstack.conf
[2014-08-21T16:16:28.521] spank: /curc/slurm/slurm/current/lib/slurm/spank_pbs.so: no callbacks in this context
[2014-08-21T16:16:28.526] debug:  [job 147160] attempting to run epilog [/curc/slurm/slurm/scripts/epilog]
[2014-08-21T16:16:28.530] debug:  completed epilog for jobid 147160
[2014-08-21T16:16:28.532] debug:  Job 147160: sent epilog complete msg: rc = 0
[2014-08-21T16:16:46.103] debug2: got this type of message 4005

Comment 6 Danny Auble 2014-08-21 10:48:52 MDT

One interesting thing is

2014-08-21T16:14:21.405] [147160] unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
[2014-08-21T16:14:21.405] [147160] unable to get parameter 'cpus' for '/cgroup/cpuset/slurm'

I am wondering if that is related or not.  I see a couple of other cgroup errors as well.

Do you happen to have a newer kernel to test against?  I am wondering if the 2.6.32-279 kernel hasn't been patched with the cgroup mods in 2.6.38.  I know there were many issues before that in the kernel.

I am also wondering if the stepd is still there or not.  Usually when one finishes you will see a line like

[2014-08-21T15:31:07.072] [65542] done with job

I don't see that in the log, so either the stepd died somehow or it is still there.

Could you verify this?  If it is not there perhaps look for core files in the log directory.

Comment 7 Danny Auble 2014-08-26 06:39:00 MDT

Jon, have you been able to look at the questions from the last Comment?

Comment 8 Jon Loptien 2014-08-26 06:57:06 MDT

I have been working on getting a newer kernel version to test on. Unfortunately all of our nodes are diskless so I can't easily update to a newer version.

I do not see the stepd process once the job finishes and I do not see any core files. I have seen the "done with job" line for some of my tests but not others and for the majority (95+% of the time) the job step keeps running after the job finishes. Rarely the step will die, but that may be the sleep actually finishing before I track it down.

I will update the ticket when I test a newer kernel; hopefully this afternoon or tomorrow.

Thanks.

Comment 9 Jon Loptien 2014-08-27 06:48:28 MDT

I was able to try a newer kernel and all the processes were cleaned up. We will work to get all of our compute nodes to a working kernel version.

Thanks for your help.

Comment 10 David Bigagli 2014-08-27 06:52:37 MDT

Fixed by new kernel.

David