1242 – cgroup memory enforcement sometimes hangs machine

Bug 1242 - cgroup memory enforcement sometimes hangs machine

Summary: cgroup memory enforcement sometimes hangs machine

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other bugs)
Version:	14.03.9
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-11-06 05:13 MST by wettstein
Modified:	2014-12-12 09:09 MST (History)
CC List:	2 users (show)

See Also:
Site:	University of Chicago
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (8.62 KB, application/octet-stream) 2014-11-06 05:13 MST, wettstein	Details
cgroup.conf (323 bytes, text/plain) 2014-11-06 05:14 MST, wettstein	Details
node dmesg (179.30 KB, text/plain) 2014-11-06 05:15 MST, wettstein	Details
htop with matlab processes (800.71 KB, image/png) 2014-11-06 07:33 MST, wettstein	Details
kernel dmesg (991.85 KB, text/plain) 2014-11-13 01:12 MST, wettstein	Details
memory hog (356 bytes, patch) 2014-11-13 09:47 MST, David Bigagli	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description wettstein 2014-11-06 05:13:43 MST

Created attachment 1414 [details]
slurm.conf

I've had cgroup memory enforcement turned on for several months now. It typically works fine and kills jobs that go over the memory limit, but there are at least some number of jobs that cause bad things to happen. For these jobs, when the oom-killer gets invoked the job stays alive and makes the node almost completely unresponsive. It isn't possible to ssh to the node, if an ssh session is already open it is no longer possible to actually run commands. However, if I have an ssh session open with top running, I can kill job processes and make the node become functional again. The majority of jobs that get in this state are running matlab, but I've seen plenty of other non-matlab jobs do this as well.

I do currently have a matlab job I can use to reproduce this issue. It does take about 30 minutes to get to the state where this actually happens.

I saw something from IBM about potentially a clocksource issue with memory cgroup enforcement. There was a recommendation to switch from tsc to hpet. I tried that, but there was no change. Here is that link:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20(HPC)%20Central/page/Preventing%20System%20Issues%20Related%20to%20Memory%20Overcommitment

I'll attach my slurm.conf, cgroup.conf, and a dmesg from the node I was testing on.

Have you heard of anyone else experiencing problems like this ?

Comment 1 wettstein 2014-11-06 05:14:23 MST

Created attachment 1415 [details]
cgroup.conf

Comment 2 wettstein 2014-11-06 05:15:38 MST

Created attachment 1416 [details]
node dmesg

This is a dmesg from a node where I started the matlab job that causes the problems and then I manually killed it since I had top running on that node.

Comment 3 David Bigagli 2014-11-06 06:56:34 MST

Hi,
   no we have not heard of this. From what I can recall you are the first one
that is reporting the problem. The matlab executable uses lot of memory and
hangs the node even if running under cgroup control. Is this a correct
description?

David

Comment 4 David Bigagli 2014-11-06 07:05:38 MST

There is a note in the cgroup guide http://slurm.schedmd.com/cgroups.html
that reads:

There can be a serious performance problem with memory cgroups on conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due to contention between processors for a spinlock. This problem seems to have been completely fixed in the 2.6.38 kernel.

I am mentioning this since your kernel is  2.6.32.

David

Comment 5 wettstein 2014-11-06 07:32:35 MST

(In reply to David Bigagli from comment #4)
> There is a note in the cgroup guide http://slurm.schedmd.com/cgroups.html
> that reads:
> 
> There can be a serious performance problem with memory cgroups on
> conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due
> to contention between processors for a spinlock. This problem seems to have
> been completely fixed in the 2.6.38 kernel.
> 
> I am mentioning this since your kernel is  2.6.32.
> 
> David

It is my understanding that the redhat 2.6.32 kernel has several backports to improve the cgroup memory performance. Is that not accurate ?

You are correct with your previous comment that matlab uses memory and then hangs the node when under cgroup control and even with the oom-killer killing at least some of the processes in the memory cgroups. It does not go to swap. I'll attach an htop screenshot of my processes after the hang occurs.

Comment 6 wettstein 2014-11-06 07:33:21 MST

Created attachment 1417 [details]
htop with matlab processes

Comment 7 David Bigagli 2014-11-06 07:52:54 MST

I don't know about the backports honestly I can do a little research on it.

Can you help me to understand the numbers in htop.
The machine has 32GB of ram, the resident set size of the matlab processes
are ~22GB, then how come the free be 25GB? Since the swap usage is 0 where
are the ~76GB of virtual space the matlab appears to use?

Comment 8 wettstein 2014-11-06 08:54:19 MST

On Thu, Nov 06, 2014 at 09:52:54PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
> 
> --- Comment #7 from David Bigagli <david@schedmd.com> ---
> 
> I don't know about the backports honestly I can do a little research on it.
> 
> Can you help me to understand the numbers in htop.
> The machine has 32GB of ram, the resident set size of the matlab processes
> are ~22GB, then how come the free be 25GB? Since the swap usage is 0 where
> are the ~76GB of virtual space the matlab appears to use?

VIRT is the amount of memory a process has mapped could access. This
includes memory mapped files. It doesn't directly translate to OS
memory+swap actually allocated.

RES is really the physical memory allocated to the process. There is
more than just the 22GB allocated to matlab due to other OS processes.
There is around 1 GB allocated to the GPFS daemon (mmfsd) for its
pagepool. This is pinned memory so it can't be swapped out. 

I'm wonder if the pinned memory from other processes is what is actually
causing this problem. I reduced MaxRAMPercent and AllowedRAMSpace to 95
and the job was killed as expected. I am testing again with 99 percent.
The only problem with this theory is that the job I am testing with only
requests 2GB of memory and 12 cores, so it should only be able to 24 GB
anyway.

Comment 9 David Bigagli 2014-11-06 08:59:22 MST

Thanks for the explanation. I am still wondering about the 25794MB of free space
when it should show 32000-22000=10000 or less as you say.
I was about to suggest to lower the AllowedRAMSpace just to get more room.

David

Comment 10 wettstein 2014-11-07 01:59:12 MST

On Thu, Nov 06, 2014 at 10:59:22PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
> 
> --- Comment #9 from David Bigagli <david@schedmd.com> ---
> 
> Thanks for the explanation. I am still wondering about the 25794MB of free
> space
> when it should show 32000-22000=10000 or less as you say.
> I was about to suggest to lower the AllowedRAMSpace just to get more room.

The 25749MB is used memory not free.

The test job was partially killed at 99%, but it still wasn't quite
right, so I've dropped the memory and swap allowed to 98% which may it
die more normally. I'll have to see what happens cluster-wide for this
setting. Lately I've been seeing 1-2 jobs per day causing this behavior
so I should be able to tell in a couple of days.

Comment 11 David Bigagli 2014-11-07 03:30:35 MST

Thanks. I was looking at my htop and for some reason was thinking it was
free memory. Now the numbers add up. :-) Keep us posted.

David

Comment 12 wettstein 2014-11-13 01:12:31 MST

Created attachment 1430 [details]
kernel dmesg

Comment 13 wettstein 2014-11-13 01:39:26 MST

I attached a file with kernel messages from another node that had problems. I had this set in cgroup.conf:

MaxRAMPercent=90
AllowedRAMSpace=90
AllowedSwapSpace=10

At 90%, this one behaved a little differently. It went completely unresponsive including network (no ping) for about 15 minutes. The network recovered but the node remained in a bad state for about another 20 minutes. Finally, the bad process was killed and the node became responsive again. slurmd ended up dying on the node due to the loss of network. Slurm is installed on GPFS and the node got kicked out of the GPFS cluster so the process died.

The logs start with this:

als invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
als cpuset=step_2 mems_allowed=0-1
Pid: 6523, comm: als Not tainted 2.6.32-431.20.5.el6.x86_64 #1

But, no process actually get killed by the oom after that first message. The kernel messages say pid 6523 is failing to allocate memory. Messages like this go on for an extended period of time:

als: page allocation failure. order:0, mode:0x120
Pid: 6523, comm: als Not tainted 2.6.32-431.20.5.el6.x86_64 #1

Finally, pid 6518 does actually killed which causes the node to recover:

Memory cgroup out of memory: Kill process 6518 (als) score 1000 or sacrifice child
Killed process 6518, UID 1582893990, (als) total-vm:34200724kB, anon-rss:29414128kB, file-rss:3668kB

At least in this instance the oom killer was initially trying to do something, but got hung up in bad ways.

I don't think slurm currently uses memory oom messages (memory.oom_control). Maybe one possibility would be for slurm to use those and kill the job cgroup processes itself when an oom notification is received ?

Comment 14 David Bigagli 2014-11-13 09:45:36 MST

I did some experimentation and I have found few things, perhaps you know
them already, but I think it is worth to mention them.
I think it is not only the memory size that triggers the oom but also
the memory access patterns. I have have slightly modified a memory hog
program provided by redhat:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

I attach the program. I created a cpuset with memory size of 5MB,
'echo 5242880 > memory.limit_in_bytes' and tried
several size of memory and I see that I can allocate way more than
5MB and it depends the block size I use for malloc, 1*MB, 5*MB etc.
Most of the time the OS can hadle the load allocating virtual space
and keeping the resident set size around 5MB more or less.
When I increase the allocation block size to 10*MB the program does at most
5 iteration. Then it gets killed by oom, if oom is disable it just hangs.
I mention this as it shows that the simple math of percentages
is perhaps not enough, this is something that we, on the Slurm side, we
should investigate more

Now let me answer your question about using oo messages. In theory we could
do that. First disable the memory.oom_control, then have the slurmstepd
create an event and get notified when the program exceedes the threshold
and kill it. I tried it and it works. I used the notification program that
is at the same url as above. Of course this is a development project
eventually for the next major release 15.08.

David

Comment 15 David Bigagli 2014-11-13 09:47:41 MST

Created attachment 1432 [details]
memory hog

Comment 16 David Bigagli 2014-11-18 07:35:34 MST

Please reopen if necessary.

David

Comment 17 wettstein 2014-11-19 01:09:44 MST

Pushing this off several months is a bit problematic. I'm seeing multiple machines per day affected by this problem. I really need a way to reliably kill jobs that exceed their memory limits because those nodes affect GPFS performance for the entire cluster. This specific problem causes file system snapshots to fail because the nodes can't quiesce io traffic. Previously, when I used proctrack/linuxproc, it failed to track certain processes which can allow excessive swapping on a machine. When this happens, multiple nodes (even those not swapping) can be expelled from the GPFS cluster due to slow response.

Comment 18 David Bigagli 2014-11-19 04:25:09 MST

Which processes proctrack/linux failed to track? If a process fork twice
it escapes the parent child relationship and it cannot be tracked.
For this you could try to use the proctrack/pgid which tracks by process
group id rather than parent child.

I believe that using cgroup is the right approach, however it has some
bugs that affect its functionality. I am not sure what we can do about that.

David

Comment 19 David Bigagli 2014-11-19 05:08:05 MST

One of my colleague suggested another alternative to track processes reliably.
It is proctrack/sgi_job. This needs a kernel module described here:

http://oss.sgi.com/projects/pagg/

David

Comment 20 wettstein 2014-11-19 05:15:08 MST

(In reply to David Bigagli from comment #19)
> One of my colleague suggested another alternative to track processes
> reliably.
> It is proctrack/sgi_job. This needs a kernel module described here:
> 
> http://oss.sgi.com/projects/pagg/
> 
> David

proctrack/cgroup can already reliably track process ids for a job in the freezer cgroup. Would it be possible to implement an option that enforces memory in proctrack/cgroup with the logic from proctrack/linux, but retrieving the pid list for a job from the freezer cgroup ? This would mean memory cgroups wouldn't even need to be enabled to enforce memory limits.

BTW, matlab and stata were probably the main programs that avoided proctrack/linux memory enforcement.

Comment 21 David Bigagli 2014-11-19 05:21:09 MST

If I understand correctly you mean while tracking the pids in the freezer
also kill them when they exceed memory limit. Is that correct?

David

Comment 22 wettstein 2014-11-19 05:49:38 MST

(In reply to David Bigagli from comment #21)
> If I understand correctly you mean while tracking the pids in the freezer
> also kill them when they exceed memory limit. Is that correct?
> 
> David

Actually, I see that without memory cgroups enabled, the jobacct_gather/linux does memory stats by walking the pid tree and then kills jobs/steps that have exceeded the memory limit. If I am understanding things correctly, then I think this would be an option to make jobacct_gather/linux read the freezer cgroups for the job/step pid list instead of walking a pid tree. Then I could just completely disable memory enforcement in cgroup.conf and jobacct_gather/linux would handle killing jobs/steps.

Would jobacct_gather/cgroup already work like this? It is still labeled as experimental in the man pages, so I've never tried it.

Comment 23 David Bigagli 2014-11-19 06:03:03 MST

Let me investigate this and get back to you.

David

Comment 24 David Bigagli 2014-11-20 06:27:59 MST

In theory it is possible but currently the code does not work this way.
Indeed using the jobacct_gather/cgroup works in the sense that all the pids
are under /cgroup/cpuset/ and  /cgroup/cpuacct trees. This would need 
a new development which I am not sure we want to do as this would be a hack
to workaround a cgroup bug.

I am surprised matlab forks twice usually these tools have other things to do.
It would be interesting if you can track the process tree and also the session
which these runaway processes belong. 

My suggestion would be to use the ProctrackType=proctrack/pgid which kills
all processes in the session regardless of the parent/child relationship.
This works very well and I doubt that these processes set a new session.

David

Comment 25 wettstein 2014-11-21 01:39:34 MST

The proctrack plugin doesn't actually matter for this. The jobacct_gather/linux plugin walks the process tree regardless of the proctrack plugin being used.

I retested with cgroup memory enforcement disabled to look at the jobacct_gather memory enforcement problems. I don't know if I misdiagnosed things before or there were different circumstances, but I can tell what is happening now. For the node I am testing, realmemory is set to 32000. If I submit a job requesting all of the memory and run something that will continue to allocate memory (like the mem-hog test from redhat), due to other processes on the node it is not possible for that process to ever reach that RSS memory limit. It will start swapping before it can ever reach that limit. If I change the memory request in the job to 30000, it does get killed by jobacct_gather/linux.

I don't think there is any equivalent to the cgroup AllowedRAMSpace for jobacct_gather/linux to limit the actual memory usage to a percentage of the user request. It would be kind of annoying from a user perspective if they had to use multiples of a number like 1875 (30000 / 16) for the --mem-per-cpu option. Maybe a lua submission filter could automatically adjust the memory request ?

Also, I finally found some stuff about the memory cgroup kernel problems:
https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html

I see at least some of the kernel patches mentioned in that thread were applied to a newer version of the redhat kernel than I'm currently using, so I'm probably going to try an updated kernel with memory cgroup enforcement as well to see if that changes the memory cgroup problem.

Comment 26 Moe Jette 2014-11-21 01:51:59 MST

(In reply to wettstein from comment #25)
> The proctrack plugin doesn't actually matter for this. The
> jobacct_gather/linux plugin walks the process tree regardless of the
> proctrack plugin being used.

Note there is also a jobacct_gather/cgroup plugin that identifies the jobs on the basis of cgroups.

> I retested with cgroup memory enforcement disabled to look at the
> jobacct_gather memory enforcement problems. I don't know if I misdiagnosed
> things before or there were different circumstances, but I can tell what is
> happening now. For the node I am testing, realmemory is set to 32000. If I
> submit a job requesting all of the memory and run something that will
> continue to allocate memory (like the mem-hog test from redhat), due to
> other processes on the node it is not possible for that process to ever
> reach that RSS memory limit. It will start swapping before it can ever reach
> that limit. If I change the memory request in the job to 30000, it does get
> killed by jobacct_gather/linux.
> 
> I don't think there is any equivalent to the cgroup AllowedRAMSpace for
> jobacct_gather/linux to limit the actual memory usage to a percentage of the
> user request. It would be kind of annoying from a user perspective if they
> had to use multiples of a number like 1875 (30000 / 16) for the
> --mem-per-cpu option. Maybe a lua submission filter could automatically
> adjust the memory request ?

Slurm does enforce real and virtual memory limits (here is a snippet of the code from src/slurmd/slurmd/req.c):
		if ((job_mem_info_ptr[i].mem_limit != 0) &&
		    (job_mem_info_ptr[i].mem_used >
		     job_mem_info_ptr[i].mem_limit)) {
			info("Job %u exceeded memory limit (%u>%u), "
			     "cancelling it", job_mem_info_ptr[i].job_id,
			     job_mem_info_ptr[i].mem_used,
			     job_mem_info_ptr[i].mem_limit);
			_cancel_step_mem_limit(job_mem_info_ptr[i].job_id,
					       NO_VAL);
		} else if ((job_mem_info_ptr[i].vsize_limit != 0) &&
			   (job_mem_info_ptr[i].vsize_used >
			    job_mem_info_ptr[i].vsize_limit)) {
			info("Job %u exceeded virtual memory limit (%u>%u), "
			     "cancelling it", job_mem_info_ptr[i].job_id,
			     job_mem_info_ptr[i].vsize_used,
			     job_mem_info_ptr[i].vsize_limit);
			_cancel_step_mem_limit(job_mem_info_ptr[i].job_id,
					       NO_VAL);
		}


Will the VSizeFactor configuration parameter satisfy your needs? This is from the slurm.conf man page:
VSizeFactor
Memory specifications in job requests apply to real memory size (also known
as resident set size). It is possible to enforce virtual memory limits for
both jobs and job steps by limiting their virtual memory to some percentage
of their real memory allocation. The VSizeFactor parameter specifies
the job's or job step's virtual memory limit as a percentage of its real
memory limit. For example, if a job's real memory limit is 500MB and
VSizeFactor is set to 101 then the job will be killed if its real memory
exceeds 500MB or its virtual memory exceeds 505MB (101 percent of the
real memory limit).
The default valus is 0, which disables enforcement of virtual memory limits.
The value may not exceed 65533 percent.

Comment 27 wettstein 2014-11-21 02:44:02 MST

I think setting a meaningful limit on VSize would probably be impossible. It is the entire address space of the program which includes memory mapped files and shared libraries.

Comment 28 David Bigagli 2014-11-25 04:27:54 MST

I think that proctrack does matter because if you use 

ProctrackType=proctrack/pgid

than all pids belonging to the session started by slurmstepd will be signalled
regardless if they fork or not. 

David

Comment 29 wettstein 2014-12-03 03:01:26 MST

In my last response, I said I retested the problem. The problem with
jobacct_gather/linux doing memory enforcement is that due to pinned
memory from other processes, the processes started during a batch job
can never actually reach the 32000MB RSS limit. They start to swap
before they can reach the limit. Depending on what is run, the process
may end up using enough swap to cause severe node performance problems
and/or invoking the machine OOM-killer.

Also, jobacct_gather/linux uses its own method to determine the
processes in a job. Changing the proctrack plugin does not change that,
so jobacct_gather/linux would collect the same memory stats regardless
of the proctrack plugin. The kill process from the proctrack plugin
would not matter since it is never going to be signalled.

I have not yet updated to the latest RHEL kernel to see if the cgroup
memory hangs are improved with that.

On Tue, Nov 25, 2014 at 06:27:54PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
> 
> --- Comment #28 from David Bigagli <david@schedmd.com> ---
> 
> I think that proctrack does matter because if you use 
> 
> ProctrackType=proctrack/pgid
> 
> than all pids belonging to the session started by slurmstepd will be signalled
> regardless if they fork or not. 
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 30 David Bigagli 2014-12-03 04:29:51 MST

Both sentences are correct:

->The proctrack plugin doesn't actually matter for this. The jobacct_gather/linux ->plugin walks the process tree regardless of the proctrack plugin being used.
            .
            .
->Also, jobacct_gather/linux uses its own method to determine the
->processes in a job. Changing the proctrack plugin does not change that,
->so jobacct_gather/linux would collect the same memory stats regardless
->of the proctrack plugin.

what I was trying to say is the following: jobacct_gather and proctrack serve
different purpose. One is used to gather accounting the other to track and 
act on processes of a job. Indeed you can use one without the other. That's why doing job cotnrol from the jobacct_gather is not the right solution.

Then I was suggesting to use proctrack/pgid to do job control in case the
applications you are dealing with fork and cannot be traced by proctrack/linuxproc.

David

Comment 31 David Bigagli 2014-12-03 05:37:59 MST

Actually I stand corrected on one part. 

The jobaccount_gather plugin uses the mechanism of the proctrack plugin to get the pid informations. The job_account gather plugin invokes the proctrack plugin which returns a list of pids from /proc, on linux, which then the jobaccount_gather uses the get the accounting info, while the proctrack 
itself uses the information for job control.

The code that collects the pid in the case of jobaccount_gather/linux is
in src/plugins/jobacct_gather/linux/jobacct_gather_p_poll_data() which
invokes the jag_common_poll_data()/_get_precs() in common_jag.c
which in turns invokes the proctrack plugin.

Sorry for the confusion.

David

Comment 32 wettstein 2014-12-03 06:02:08 MST

Ok. That's good to know it works like that. That does make sense.

The proctrack/cgroup plugin is almost certainly the better plugin to use
on linux.

The original problem is still what I need to have addressed, though.
Here is a recap of the current options/problems:

proctrack/cgroup with ConstrainRAMSpace=yes hangs the machine in certain
instances. There may be improvements with the Redhat 6.6 kernel for this
situation (2.6.32-504+), but I haven't had the time to set up a new node
image with this kernel.

Disabling ConstrainRAMSpace=yes makes jobacct_gather/linux collect job
memory stats and kill when over the limit. At least in our
circumstances, jobs that request all of the memory in a machine won't be
killed because they can never exceed the memory limit. This is the case
on nodes with 32 GB of RAM and RealMemory set to 32000. Setting
RealMemory to something else would be a usability problem for users.

I'll try to get an image using the new kernel set up this week. 

An option for jobacct_gather/linux to kill on a percentage of the
requested memory might seem like a possibility if cgroup memory
enforcement remains a problem.

On Wed, Dec 03, 2014 at 07:37:59PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
> 
> --- Comment #31 from David Bigagli <david@schedmd.com> ---
> 
> Actually I stand corrected on one part. 
> 
> The jobaccount_gather plugin uses the mechanism of the proctrack plugin to get
> the pid informations. The job_account gather plugin invokes the proctrack
> plugin which returns a list of pids from /proc, on linux, which then the
> jobaccount_gather uses the get the accounting info, while the proctrack 
> itself uses the information for job control.
> 
> The code that collects the pid in the case of jobaccount_gather/linux is
> in src/plugins/jobacct_gather/linux/jobacct_gather_p_poll_data() which
> invokes the jag_common_poll_data()/_get_precs() in common_jag.c
> which in turns invokes the proctrack plugin.
> 
> Sorry for the confusion.
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 33 wettstein 2014-12-03 08:18:45 MST

I updated a node to the latest kernel release from Redhat
(2.6.32-504.1.3.el6.x86_64). The test job I have still made the machine
get in a very bad state when it reached the memory limits.

Comment 34 David Bigagli 2014-12-05 05:51:54 MST

Have you tried to reproduce this outside of slurm? Create the cgroup manually,
set the memory limits and then start the application and put its pid into the tasks file.

I think that 2.6 kernel series are behind, CentOS 7 has 3.10 and Ubuntu 14.10
3.16.

David

Comment 35 wettstein 2014-12-09 03:31:59 MST

Switching from a Redhat 6 based operating system is not going to happen
for a long time, so we'll continue to use the redhat 2.6.32 kernel line.

I have hardcoded a 90% memory limit factor and disabled memory cgroup
enforcement. Slurm now successfully kills the jobs that I was testing that
started swapping before they could exceed the memory limit. I've made
this change live on our system.

Here is the diff:

diff --git a/src/common/slurm_jobacct_gather.c b/src/common/slurm_jobacct_gather.c
index bc76656..e981a3d 100644
--- a/src/common/slurm_jobacct_gather.c
+++ b/src/common/slurm_jobacct_gather.c
@@ -536,7 +536,7 @@ extern int jobacct_gather_set_mem_limit(uint32_t job_id, uint32_t step_id,

        jobacct_job_id      = job_id;
        jobacct_step_id     = step_id;
-       jobacct_mem_limit   = mem_limit * 1024; /* MB to KB */
+       jobacct_mem_limit   = mem_limit * 1024 * 0.9;   /* MB to KB. Hardcode 90% RSS factor (wettstein) */
        jobacct_vmem_limit  = jobacct_mem_limit;
        jobacct_vmem_limit *= (slurm_get_vsize_factor() / 100.0);
        return SLURM_SUCCESS;
diff --git a/src/slurmd/slurmd/req.c b/src/slurmd/slurmd/req.c
index f3ea177..1a0ce87 100644
--- a/src/slurmd/slurmd/req.c
+++ b/src/slurmd/slurmd/req.c
@@ -2046,6 +2046,7 @@ _enforce_job_mem_limit(void)

        vsize_factor = slurm_get_vsize_factor();
        for (i=0; i<job_cnt; i++) {
+        job_mem_info_ptr[i].mem_limit *= 0.9; /* hardcode 90% RSS factor (wettstein)*/
                job_mem_info_ptr[i].vsize_limit = job_mem_info_ptr[i].
                        mem_limit;
                job_mem_info_ptr[i].vsize_limit *= (vsize_factor / 100.0);




On Fri, Dec 05, 2014 at 07:51:54PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
> 
> --- Comment #34 from David Bigagli <david@schedmd.com> ---
> 
> Have you tried to reproduce this outside of slurm? Create the cgroup manually,
> set the memory limits and then start the application and put its pid into the
> tasks file.
> 
> I think that 2.6 kernel series are behind, CentOS 7 has 3.10 and Ubuntu 14.10
> 3.16.
> 
> David
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

Comment 36 David Bigagli 2014-12-12 09:09:12 MST

Thanks for the update. We close this problem for now please reopen if
necessary.

David