Created attachment 1414 [details] slurm.conf I've had cgroup memory enforcement turned on for several months now. It typically works fine and kills jobs that go over the memory limit, but there are at least some number of jobs that cause bad things to happen. For these jobs, when the oom-killer gets invoked the job stays alive and makes the node almost completely unresponsive. It isn't possible to ssh to the node, if an ssh session is already open it is no longer possible to actually run commands. However, if I have an ssh session open with top running, I can kill job processes and make the node become functional again. The majority of jobs that get in this state are running matlab, but I've seen plenty of other non-matlab jobs do this as well. I do currently have a matlab job I can use to reproduce this issue. It does take about 30 minutes to get to the state where this actually happens. I saw something from IBM about potentially a clocksource issue with memory cgroup enforcement. There was a recommendation to switch from tsc to hpet. I tried that, but there was no change. Here is that link: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20(HPC)%20Central/page/Preventing%20System%20Issues%20Related%20to%20Memory%20Overcommitment I'll attach my slurm.conf, cgroup.conf, and a dmesg from the node I was testing on. Have you heard of anyone else experiencing problems like this ?
Created attachment 1415 [details] cgroup.conf
Created attachment 1416 [details] node dmesg This is a dmesg from a node where I started the matlab job that causes the problems and then I manually killed it since I had top running on that node.
Hi, no we have not heard of this. From what I can recall you are the first one that is reporting the problem. The matlab executable uses lot of memory and hangs the node even if running under cgroup control. Is this a correct description? David
There is a note in the cgroup guide http://slurm.schedmd.com/cgroups.html that reads: There can be a serious performance problem with memory cgroups on conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due to contention between processors for a spinlock. This problem seems to have been completely fixed in the 2.6.38 kernel. I am mentioning this since your kernel is 2.6.32. David
(In reply to David Bigagli from comment #4) > There is a note in the cgroup guide http://slurm.schedmd.com/cgroups.html > that reads: > > There can be a serious performance problem with memory cgroups on > conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due > to contention between processors for a spinlock. This problem seems to have > been completely fixed in the 2.6.38 kernel. > > I am mentioning this since your kernel is 2.6.32. > > David It is my understanding that the redhat 2.6.32 kernel has several backports to improve the cgroup memory performance. Is that not accurate ? You are correct with your previous comment that matlab uses memory and then hangs the node when under cgroup control and even with the oom-killer killing at least some of the processes in the memory cgroups. It does not go to swap. I'll attach an htop screenshot of my processes after the hang occurs.
Created attachment 1417 [details] htop with matlab processes
I don't know about the backports honestly I can do a little research on it. Can you help me to understand the numbers in htop. The machine has 32GB of ram, the resident set size of the matlab processes are ~22GB, then how come the free be 25GB? Since the swap usage is 0 where are the ~76GB of virtual space the matlab appears to use?
On Thu, Nov 06, 2014 at 09:52:54PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1242 > > --- Comment #7 from David Bigagli <david@schedmd.com> --- > > I don't know about the backports honestly I can do a little research on it. > > Can you help me to understand the numbers in htop. > The machine has 32GB of ram, the resident set size of the matlab processes > are ~22GB, then how come the free be 25GB? Since the swap usage is 0 where > are the ~76GB of virtual space the matlab appears to use? VIRT is the amount of memory a process has mapped could access. This includes memory mapped files. It doesn't directly translate to OS memory+swap actually allocated. RES is really the physical memory allocated to the process. There is more than just the 22GB allocated to matlab due to other OS processes. There is around 1 GB allocated to the GPFS daemon (mmfsd) for its pagepool. This is pinned memory so it can't be swapped out. I'm wonder if the pinned memory from other processes is what is actually causing this problem. I reduced MaxRAMPercent and AllowedRAMSpace to 95 and the job was killed as expected. I am testing again with 99 percent. The only problem with this theory is that the job I am testing with only requests 2GB of memory and 12 cores, so it should only be able to 24 GB anyway.
Thanks for the explanation. I am still wondering about the 25794MB of free space when it should show 32000-22000=10000 or less as you say. I was about to suggest to lower the AllowedRAMSpace just to get more room. David
On Thu, Nov 06, 2014 at 10:59:22PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1242 > > --- Comment #9 from David Bigagli <david@schedmd.com> --- > > Thanks for the explanation. I am still wondering about the 25794MB of free > space > when it should show 32000-22000=10000 or less as you say. > I was about to suggest to lower the AllowedRAMSpace just to get more room. The 25749MB is used memory not free. The test job was partially killed at 99%, but it still wasn't quite right, so I've dropped the memory and swap allowed to 98% which may it die more normally. I'll have to see what happens cluster-wide for this setting. Lately I've been seeing 1-2 jobs per day causing this behavior so I should be able to tell in a couple of days.
Thanks. I was looking at my htop and for some reason was thinking it was free memory. Now the numbers add up. :-) Keep us posted. David
Created attachment 1430 [details] kernel dmesg
I attached a file with kernel messages from another node that had problems. I had this set in cgroup.conf: MaxRAMPercent=90 AllowedRAMSpace=90 AllowedSwapSpace=10 At 90%, this one behaved a little differently. It went completely unresponsive including network (no ping) for about 15 minutes. The network recovered but the node remained in a bad state for about another 20 minutes. Finally, the bad process was killed and the node became responsive again. slurmd ended up dying on the node due to the loss of network. Slurm is installed on GPFS and the node got kicked out of the GPFS cluster so the process died. The logs start with this: als invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 als cpuset=step_2 mems_allowed=0-1 Pid: 6523, comm: als Not tainted 2.6.32-431.20.5.el6.x86_64 #1 But, no process actually get killed by the oom after that first message. The kernel messages say pid 6523 is failing to allocate memory. Messages like this go on for an extended period of time: als: page allocation failure. order:0, mode:0x120 Pid: 6523, comm: als Not tainted 2.6.32-431.20.5.el6.x86_64 #1 Finally, pid 6518 does actually killed which causes the node to recover: Memory cgroup out of memory: Kill process 6518 (als) score 1000 or sacrifice child Killed process 6518, UID 1582893990, (als) total-vm:34200724kB, anon-rss:29414128kB, file-rss:3668kB At least in this instance the oom killer was initially trying to do something, but got hung up in bad ways. I don't think slurm currently uses memory oom messages (memory.oom_control). Maybe one possibility would be for slurm to use those and kill the job cgroup processes itself when an oom notification is received ?
I did some experimentation and I have found few things, perhaps you know them already, but I think it is worth to mention them. I think it is not only the memory size that triggers the oom but also the memory access patterns. I have have slightly modified a memory hog program provided by redhat: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html I attach the program. I created a cpuset with memory size of 5MB, 'echo 5242880 > memory.limit_in_bytes' and tried several size of memory and I see that I can allocate way more than 5MB and it depends the block size I use for malloc, 1*MB, 5*MB etc. Most of the time the OS can hadle the load allocating virtual space and keeping the resident set size around 5MB more or less. When I increase the allocation block size to 10*MB the program does at most 5 iteration. Then it gets killed by oom, if oom is disable it just hangs. I mention this as it shows that the simple math of percentages is perhaps not enough, this is something that we, on the Slurm side, we should investigate more Now let me answer your question about using oo messages. In theory we could do that. First disable the memory.oom_control, then have the slurmstepd create an event and get notified when the program exceedes the threshold and kill it. I tried it and it works. I used the notification program that is at the same url as above. Of course this is a development project eventually for the next major release 15.08. David
Created attachment 1432 [details] memory hog
Please reopen if necessary. David
Pushing this off several months is a bit problematic. I'm seeing multiple machines per day affected by this problem. I really need a way to reliably kill jobs that exceed their memory limits because those nodes affect GPFS performance for the entire cluster. This specific problem causes file system snapshots to fail because the nodes can't quiesce io traffic. Previously, when I used proctrack/linuxproc, it failed to track certain processes which can allow excessive swapping on a machine. When this happens, multiple nodes (even those not swapping) can be expelled from the GPFS cluster due to slow response.
Which processes proctrack/linux failed to track? If a process fork twice it escapes the parent child relationship and it cannot be tracked. For this you could try to use the proctrack/pgid which tracks by process group id rather than parent child. I believe that using cgroup is the right approach, however it has some bugs that affect its functionality. I am not sure what we can do about that. David
One of my colleague suggested another alternative to track processes reliably. It is proctrack/sgi_job. This needs a kernel module described here: http://oss.sgi.com/projects/pagg/ David
(In reply to David Bigagli from comment #19) > One of my colleague suggested another alternative to track processes > reliably. > It is proctrack/sgi_job. This needs a kernel module described here: > > http://oss.sgi.com/projects/pagg/ > > David proctrack/cgroup can already reliably track process ids for a job in the freezer cgroup. Would it be possible to implement an option that enforces memory in proctrack/cgroup with the logic from proctrack/linux, but retrieving the pid list for a job from the freezer cgroup ? This would mean memory cgroups wouldn't even need to be enabled to enforce memory limits. BTW, matlab and stata were probably the main programs that avoided proctrack/linux memory enforcement.
If I understand correctly you mean while tracking the pids in the freezer also kill them when they exceed memory limit. Is that correct? David
(In reply to David Bigagli from comment #21) > If I understand correctly you mean while tracking the pids in the freezer > also kill them when they exceed memory limit. Is that correct? > > David Actually, I see that without memory cgroups enabled, the jobacct_gather/linux does memory stats by walking the pid tree and then kills jobs/steps that have exceeded the memory limit. If I am understanding things correctly, then I think this would be an option to make jobacct_gather/linux read the freezer cgroups for the job/step pid list instead of walking a pid tree. Then I could just completely disable memory enforcement in cgroup.conf and jobacct_gather/linux would handle killing jobs/steps. Would jobacct_gather/cgroup already work like this? It is still labeled as experimental in the man pages, so I've never tried it.
Let me investigate this and get back to you. David
In theory it is possible but currently the code does not work this way. Indeed using the jobacct_gather/cgroup works in the sense that all the pids are under /cgroup/cpuset/ and /cgroup/cpuacct trees. This would need a new development which I am not sure we want to do as this would be a hack to workaround a cgroup bug. I am surprised matlab forks twice usually these tools have other things to do. It would be interesting if you can track the process tree and also the session which these runaway processes belong. My suggestion would be to use the ProctrackType=proctrack/pgid which kills all processes in the session regardless of the parent/child relationship. This works very well and I doubt that these processes set a new session. David
The proctrack plugin doesn't actually matter for this. The jobacct_gather/linux plugin walks the process tree regardless of the proctrack plugin being used. I retested with cgroup memory enforcement disabled to look at the jobacct_gather memory enforcement problems. I don't know if I misdiagnosed things before or there were different circumstances, but I can tell what is happening now. For the node I am testing, realmemory is set to 32000. If I submit a job requesting all of the memory and run something that will continue to allocate memory (like the mem-hog test from redhat), due to other processes on the node it is not possible for that process to ever reach that RSS memory limit. It will start swapping before it can ever reach that limit. If I change the memory request in the job to 30000, it does get killed by jobacct_gather/linux. I don't think there is any equivalent to the cgroup AllowedRAMSpace for jobacct_gather/linux to limit the actual memory usage to a percentage of the user request. It would be kind of annoying from a user perspective if they had to use multiples of a number like 1875 (30000 / 16) for the --mem-per-cpu option. Maybe a lua submission filter could automatically adjust the memory request ? Also, I finally found some stuff about the memory cgroup kernel problems: https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html I see at least some of the kernel patches mentioned in that thread were applied to a newer version of the redhat kernel than I'm currently using, so I'm probably going to try an updated kernel with memory cgroup enforcement as well to see if that changes the memory cgroup problem.
(In reply to wettstein from comment #25) > The proctrack plugin doesn't actually matter for this. The > jobacct_gather/linux plugin walks the process tree regardless of the > proctrack plugin being used. Note there is also a jobacct_gather/cgroup plugin that identifies the jobs on the basis of cgroups. > I retested with cgroup memory enforcement disabled to look at the > jobacct_gather memory enforcement problems. I don't know if I misdiagnosed > things before or there were different circumstances, but I can tell what is > happening now. For the node I am testing, realmemory is set to 32000. If I > submit a job requesting all of the memory and run something that will > continue to allocate memory (like the mem-hog test from redhat), due to > other processes on the node it is not possible for that process to ever > reach that RSS memory limit. It will start swapping before it can ever reach > that limit. If I change the memory request in the job to 30000, it does get > killed by jobacct_gather/linux. > > I don't think there is any equivalent to the cgroup AllowedRAMSpace for > jobacct_gather/linux to limit the actual memory usage to a percentage of the > user request. It would be kind of annoying from a user perspective if they > had to use multiples of a number like 1875 (30000 / 16) for the > --mem-per-cpu option. Maybe a lua submission filter could automatically > adjust the memory request ? Slurm does enforce real and virtual memory limits (here is a snippet of the code from src/slurmd/slurmd/req.c): if ((job_mem_info_ptr[i].mem_limit != 0) && (job_mem_info_ptr[i].mem_used > job_mem_info_ptr[i].mem_limit)) { info("Job %u exceeded memory limit (%u>%u), " "cancelling it", job_mem_info_ptr[i].job_id, job_mem_info_ptr[i].mem_used, job_mem_info_ptr[i].mem_limit); _cancel_step_mem_limit(job_mem_info_ptr[i].job_id, NO_VAL); } else if ((job_mem_info_ptr[i].vsize_limit != 0) && (job_mem_info_ptr[i].vsize_used > job_mem_info_ptr[i].vsize_limit)) { info("Job %u exceeded virtual memory limit (%u>%u), " "cancelling it", job_mem_info_ptr[i].job_id, job_mem_info_ptr[i].vsize_used, job_mem_info_ptr[i].vsize_limit); _cancel_step_mem_limit(job_mem_info_ptr[i].job_id, NO_VAL); } Will the VSizeFactor configuration parameter satisfy your needs? This is from the slurm.conf man page: VSizeFactor Memory specifications in job requests apply to real memory size (also known as resident set size). It is possible to enforce virtual memory limits for both jobs and job steps by limiting their virtual memory to some percentage of their real memory allocation. The VSizeFactor parameter specifies the job's or job step's virtual memory limit as a percentage of its real memory limit. For example, if a job's real memory limit is 500MB and VSizeFactor is set to 101 then the job will be killed if its real memory exceeds 500MB or its virtual memory exceeds 505MB (101 percent of the real memory limit). The default valus is 0, which disables enforcement of virtual memory limits. The value may not exceed 65533 percent.
I think setting a meaningful limit on VSize would probably be impossible. It is the entire address space of the program which includes memory mapped files and shared libraries.
I think that proctrack does matter because if you use ProctrackType=proctrack/pgid than all pids belonging to the session started by slurmstepd will be signalled regardless if they fork or not. David
In my last response, I said I retested the problem. The problem with jobacct_gather/linux doing memory enforcement is that due to pinned memory from other processes, the processes started during a batch job can never actually reach the 32000MB RSS limit. They start to swap before they can reach the limit. Depending on what is run, the process may end up using enough swap to cause severe node performance problems and/or invoking the machine OOM-killer. Also, jobacct_gather/linux uses its own method to determine the processes in a job. Changing the proctrack plugin does not change that, so jobacct_gather/linux would collect the same memory stats regardless of the proctrack plugin. The kill process from the proctrack plugin would not matter since it is never going to be signalled. I have not yet updated to the latest RHEL kernel to see if the cgroup memory hangs are improved with that. On Tue, Nov 25, 2014 at 06:27:54PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1242 > > --- Comment #28 from David Bigagli <david@schedmd.com> --- > > I think that proctrack does matter because if you use > > ProctrackType=proctrack/pgid > > than all pids belonging to the session started by slurmstepd will be signalled > regardless if they fork or not. > > David > > -- > You are receiving this mail because: > You reported the bug.
Both sentences are correct: ->The proctrack plugin doesn't actually matter for this. The jobacct_gather/linux ->plugin walks the process tree regardless of the proctrack plugin being used. . . ->Also, jobacct_gather/linux uses its own method to determine the ->processes in a job. Changing the proctrack plugin does not change that, ->so jobacct_gather/linux would collect the same memory stats regardless ->of the proctrack plugin. what I was trying to say is the following: jobacct_gather and proctrack serve different purpose. One is used to gather accounting the other to track and act on processes of a job. Indeed you can use one without the other. That's why doing job cotnrol from the jobacct_gather is not the right solution. Then I was suggesting to use proctrack/pgid to do job control in case the applications you are dealing with fork and cannot be traced by proctrack/linuxproc. David
Actually I stand corrected on one part. The jobaccount_gather plugin uses the mechanism of the proctrack plugin to get the pid informations. The job_account gather plugin invokes the proctrack plugin which returns a list of pids from /proc, on linux, which then the jobaccount_gather uses the get the accounting info, while the proctrack itself uses the information for job control. The code that collects the pid in the case of jobaccount_gather/linux is in src/plugins/jobacct_gather/linux/jobacct_gather_p_poll_data() which invokes the jag_common_poll_data()/_get_precs() in common_jag.c which in turns invokes the proctrack plugin. Sorry for the confusion. David
Ok. That's good to know it works like that. That does make sense. The proctrack/cgroup plugin is almost certainly the better plugin to use on linux. The original problem is still what I need to have addressed, though. Here is a recap of the current options/problems: proctrack/cgroup with ConstrainRAMSpace=yes hangs the machine in certain instances. There may be improvements with the Redhat 6.6 kernel for this situation (2.6.32-504+), but I haven't had the time to set up a new node image with this kernel. Disabling ConstrainRAMSpace=yes makes jobacct_gather/linux collect job memory stats and kill when over the limit. At least in our circumstances, jobs that request all of the memory in a machine won't be killed because they can never exceed the memory limit. This is the case on nodes with 32 GB of RAM and RealMemory set to 32000. Setting RealMemory to something else would be a usability problem for users. I'll try to get an image using the new kernel set up this week. An option for jobacct_gather/linux to kill on a percentage of the requested memory might seem like a possibility if cgroup memory enforcement remains a problem. On Wed, Dec 03, 2014 at 07:37:59PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1242 > > --- Comment #31 from David Bigagli <david@schedmd.com> --- > > Actually I stand corrected on one part. > > The jobaccount_gather plugin uses the mechanism of the proctrack plugin to get > the pid informations. The job_account gather plugin invokes the proctrack > plugin which returns a list of pids from /proc, on linux, which then the > jobaccount_gather uses the get the accounting info, while the proctrack > itself uses the information for job control. > > The code that collects the pid in the case of jobaccount_gather/linux is > in src/plugins/jobacct_gather/linux/jobacct_gather_p_poll_data() which > invokes the jag_common_poll_data()/_get_precs() in common_jag.c > which in turns invokes the proctrack plugin. > > Sorry for the confusion. > > David > > -- > You are receiving this mail because: > You reported the bug.
I updated a node to the latest kernel release from Redhat (2.6.32-504.1.3.el6.x86_64). The test job I have still made the machine get in a very bad state when it reached the memory limits.
Have you tried to reproduce this outside of slurm? Create the cgroup manually, set the memory limits and then start the application and put its pid into the tasks file. I think that 2.6 kernel series are behind, CentOS 7 has 3.10 and Ubuntu 14.10 3.16. David
Switching from a Redhat 6 based operating system is not going to happen for a long time, so we'll continue to use the redhat 2.6.32 kernel line. I have hardcoded a 90% memory limit factor and disabled memory cgroup enforcement. Slurm now successfully kills the jobs that I was testing that started swapping before they could exceed the memory limit. I've made this change live on our system. Here is the diff: diff --git a/src/common/slurm_jobacct_gather.c b/src/common/slurm_jobacct_gather.c index bc76656..e981a3d 100644 --- a/src/common/slurm_jobacct_gather.c +++ b/src/common/slurm_jobacct_gather.c @@ -536,7 +536,7 @@ extern int jobacct_gather_set_mem_limit(uint32_t job_id, uint32_t step_id, jobacct_job_id = job_id; jobacct_step_id = step_id; - jobacct_mem_limit = mem_limit * 1024; /* MB to KB */ + jobacct_mem_limit = mem_limit * 1024 * 0.9; /* MB to KB. Hardcode 90% RSS factor (wettstein) */ jobacct_vmem_limit = jobacct_mem_limit; jobacct_vmem_limit *= (slurm_get_vsize_factor() / 100.0); return SLURM_SUCCESS; diff --git a/src/slurmd/slurmd/req.c b/src/slurmd/slurmd/req.c index f3ea177..1a0ce87 100644 --- a/src/slurmd/slurmd/req.c +++ b/src/slurmd/slurmd/req.c @@ -2046,6 +2046,7 @@ _enforce_job_mem_limit(void) vsize_factor = slurm_get_vsize_factor(); for (i=0; i<job_cnt; i++) { + job_mem_info_ptr[i].mem_limit *= 0.9; /* hardcode 90% RSS factor (wettstein)*/ job_mem_info_ptr[i].vsize_limit = job_mem_info_ptr[i]. mem_limit; job_mem_info_ptr[i].vsize_limit *= (vsize_factor / 100.0); On Fri, Dec 05, 2014 at 07:51:54PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=1242 > > --- Comment #34 from David Bigagli <david@schedmd.com> --- > > Have you tried to reproduce this outside of slurm? Create the cgroup manually, > set the memory limits and then start the application and put its pid into the > tasks file. > > I think that 2.6 kernel series are behind, CentOS 7 has 3.10 and Ubuntu 14.10 > 3.16. > > David > > -- > You are receiving this mail because: > You reported the bug.
Thanks for the update. We close this problem for now please reopen if necessary. David