Bug 2713

Summary:	MemSpecLimit behaviour changes
Product:	Slurm	Reporter:	Akmal Madzlan <akmalm>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, brian, paull
Version:	16.05.9
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	15.08.12 16.05.0-rc3	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	conf conf2 cgroup.conf

Description Akmal Madzlan 2016-05-08 19:18:31 MDT

I just want to clarify something

Previously in 14.11, MemSpecLimit = Amount of memory a job can use before being killed by OOM Killer

Right now in 15.08, MemSpecLimit = Amount of memory reserved for the system

Am I right?

In our setup, for example with 256GB nodes, we set MemSpecLimit = ~250GB so job can only use up until that before being killed.

But I noticed that it has changed in 15.08. Setting MemSpecLimit = ~250GB causing job can only request for ~6GB memory.

If that is right, shouldn't memory.max_usage_in_bytes in the cgroup being set to RealMemory - MemSpecLimit? Right now I can see that it is being set to MemSpecLimit and it will cause job to be killed too early.

Comment 1 Alejandro Sanchez 2016-05-08 20:42:41 MDT

Hi Akmal,

In version 15.08.0-pre4, this commit[1] removed the specialized memory from being available for allocation to jobs and this other commit[2] disabled the OOM Killer in slurmd and slurmstepd's memory cgroup when using MemSpecLimit. As a consequence of commit[1], if you set MemSpecLimit = ~250GB, these 250GB are not available anymore for allocation to jobs.

[1] https://github.com/SchedMD/slurm/commit/43849c5ecddf3a3b489e4b408485ed8e6e2e46d6

[2] https://github.com/SchedMD/slurm/commit/bcfcb9d7d2452c12ae219cc42dfb9d25487d7b8f

Comment 2 Akmal Madzlan 2016-05-09 13:42:43 MDT

Hi Alex,

When I set MemSpecLimit to 100MB in one the node,

slurmd-kud13: debug:  system cgroup: system memory cgroup initialized
slurmd-kud13: debug3: parameter 'memory.limit_in_bytes' set to '104857600' for '/cgroup/memory/slurm_kud13/system' 
slurmd-kud13: debug3: parameter 'memory.oom_control' set to '1' for '/cgroup/memory/slurm_kud13/system'
slurmd-kud13: Resource spec: system cgroup memory limit set to 100 MB

It caused jobs running on that node to be able to use only 100MB of RAM. Is that how MemSpecLimit supposed to work?

Comment 3 Akmal Madzlan 2016-05-09 13:53:17 MDT

To clarify, I can request memory more than 100MB

For example

sbatch -pkud13 --mem=9947 --wrap="/d/home/akmalm/re"

But when I check the memory usage with ps, it seems like it can only use up to 100MB, or whatever value MemSpecLimit is set to

Comment 4 Alejandro Sanchez 2016-05-10 22:49:22 MDT

Akmal, I've setup slurm-15.08.11. Configured the node compute1 with:

RealMemory=3896
MemSpecLimit=3800

As a consequence, 3800MB of memory are reserved for the combination of slurmd and slurmstepd daemons. The remaining memory, 3896(RealMemory)-3800(MemSpecLimit)=96MB are available for job allocation. Let's see what happens when submitting different job's --mem:

alex@pc:~/t$ srun --mem=3000 hostname
srun: Force Terminated job 20009
srun: error: Unable to allocate resources: Requested node configuration is not available
alex@pc:~/t$ srun --mem=1000 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available
alex@pc:~/t$ srun --mem=100 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available
alex@pc:~/t$ srun --mem=96 hostname
pc
alex@pc:~/t$

When I started slurmd, this message was logged:

[2016-05-11T13:29:07.804] Resource spec: system cgroup memory limit set to 3800 MB

Does it make sense now?

Comment 5 Akmal Madzlan 2016-05-10 23:50:01 MDT

Alright, let say I have this node with 16GB memory.
I set aside 1024MB ram for slurmd & slurmstepd using MemSpecLimit

NodeName=kud13 CPUs=8 RealMemory=15947 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN Feature=localdisk,intel MemSpecLimit=1024

Now the memory that can be allocated by jobs is 15947 - 1024 = 14923

So I submit a job

[akmalm@kud13 ~]$ sbatch -pkud13 --mem=14924 --wrap="hostname"                                  
sbatch: error: Batch job submission failed: Requested node configuration is not available
[akmalm@kud13 ~]$ sbatch -pkud13 --mem=14923 --wrap="hostname"                                  
Submitted batch job 880

Yeah, no problem with that. I get it

But

Let say I have a job that need at least 10000MB to run properly. "/d/home/akmalm/re" is a binary that will try to use as much RAM as possible

[akmalm@kud13 ~]$ sbatch -pkud13 --mem=10000 --wrap="/d/home/akmalm/re"                                                                                                                        
Submitted batch job 882

Let see their memory usage.

[akmalm@kud13 ~]$ mem /d/home/akmalm/re
1019.71MB 25633 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
986.234MB 25633 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
997.891MB 25633 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
963.051MB 25633 akmalm /d/home/akmalm/re

where mem is

[akmalm@kud13 ~]$ type mem
mem is a function
mem () 
{ 
    ps -eo rss,pid,euser,args:100 --sort %mem | grep -v grep | grep -i $@ | awk '{printf $1/1024 "MB"; $1=""; print }'
}

It seems like the job is limited to 1024. It cant fully utilize the available memory.
Lets try increasing MemSpecLimit

NodeName=kud13 CPUs=8 RealMemory=15947 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN Feature=localdisk,intel MemSpecLimit=2048

15947 - 2048 = 13899 available RAM for jobs

So I submit the same job

[akmalm@kud13 ~]$ sbatch -pkud13 --mem=10000 --wrap="/d/home/akmalm/re"                                                                                                                        
Submitted batch job 884
[akmalm@kud13 ~]$ mem /d/home/akmalm/re                                                                                                                                                        
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
790.355MB 28557 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2005.84MB 28557 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2047.06MB 28557 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2009.95MB 28557 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2019.21MB 28557 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2027.43MB 28557 akmalm /d/home/akmalm/re

Yeah, now it is limited to 2048MB of ram. Then whats the point in using --mem=10000?

Comment 6 Alejandro Sanchez 2016-05-11 00:18:08 MDT

Yes you are right in the sense that even if the job can allocate RealMemory - MemSpecLimit, the process will be constrained to use up to MemSpecLimit, even if the difference of RealMemory - MemSpecLimit is higher than MemSpecLimit. Let me internally discuss whether this is the intended design or not and what's the purpose of being able to allocate --mem > MemSpecLimit if you can only use up to MemSpecLimit. We'll come back to you with an answer.

Comment 20 Alejandro Sanchez 2016-05-16 01:36:03 MDT

Akmal, could I have a quick look at your site's slurm.conf?

Comment 26 Akmal Madzlan 2016-05-16 12:21:29 MDT

Created attachment 3092 [details]
conf

Comment 27 Akmal Madzlan 2016-05-16 12:22:28 MDT

Created attachment 3093 [details]
conf2

Hi Alex,
I've attached our slurm configuration

Comment 29 Alejandro Sanchez 2016-05-16 19:21:01 MDT

Hi Akmal. Problem occurs when TaskPlugin=task/cgroup is not enabled and working. Slurmd pid and MemSpecLimit are placed in this part of the cgroups hierarchy:

/path/to/cgroup/memory/slurm_<node>/system/{tasks, memory.limit_in_bytes}

The root cause of the problem was that because of the inheritance nature of cgroups, all children of slurmd (slurmstepd, slurmstepd's children and so on) were also placed in that tasks file, and thus were constrained to MemSpecLimit too. When TaskPlugin=task/cgroup is enabled, the pid of the slurmstepd's children (launched apps), are placed in a different point in the hierarchy:

/path/to/cgroup/slurm_<node>/uid_<uid>/job_<jobid>/step_<stepid>/task_<taskid>

and thus not constrained to MemSpecLimit anymore. We've almost ready a patch which makes MemSpecLimit require TaskPlugin=task/cgroup enabled, we'll come back to you by the end of today or tomorrow at most with the patch ready. In your slurm.conf TaskPlugin is configured with task/none. Once the patch is applied, if your slurm.conf continues without task/none, these messages should be logged in your slurmd.log:

slurmd: error: Resource spec: cgroup job confinement not configured. MemSpecLimit requires TaskPlugin=task/cgroup enabled
slurmd: error: Resource spec: system cgroup memory limit disabled

And MemSpecLimit won't take any effect. On the other hand, with task/cgroup something similar to these messages will be logged (depending on your configured MemSpecLimit and cgroup.conf's AllowedRAMSpace numbers):

slurmd: debug:  system cgroup: system memory cgroup initialized
slurmd: Resource spec: system cgroup memory limit set to 1000 MB
[2016-05-16T16:57:34.804] [20014.0] task/cgroup: /slurm_compute1/uid_1000/job_20014/step_0: alloc=2896MB
 mem.limit=1448MB memsw.limit=1448MB

In this case, I had RealMemory=3896, MemSpecLimit=1000 and AllowedRAMSpace=50 (50%). So RealMemory - MemSpecLimit was available for the job (see alloc=2896) but since I had AllowedRAMSpace=50 (percent) mem.limit=1448MB memsw.limit=1448MB (half of alloc=2896). Hope this helps you imagine and understand how this will work after patch is applied.

We're also thinking that since we're making MemSpecLimit require task/cgroup and maybe you don't want to enable task/cgroup, you could get a similar behavior from the scheduler by under-specifying the Memory in the node config instead and not making use of MemSpecLimit.

All in all, I hope we'll come back with the patch ready as I said by the end of today or tomorrow at most. Thanks for your patience and collaboration.

Comment 30 Akmal Madzlan 2016-05-16 20:58:36 MDT

TaskPLugin is set to task/cgroup

[akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)

MemSpecLimit is set to 2048

[akmalm@kud13 ~]$ scontrol show node kud13
NodeName=kud13 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel
   Gres=(null)
   NodeAddr=kud13 NodeHostName=kud13 Version=15.08
   OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1
   MemSpecLimit=2048
   State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A
   BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Submitted the same job as before

[akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re"
Submitted batch job 1111

But memory usage still constrained to MemSpecLimit

[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2002.44MB 10822 akmalm /d/home/akmalm/re

Do I need any additional param in cgroup.conf for this to work?


> We're also thinking that since we're making MemSpecLimit require task/cgroup 
> and maybe you don't want to enable task/cgroup, you could get a similar 
> behavior from the scheduler by under-specifying the Memory in the node config
> instead and not making use of MemSpecLimit.

Actually what we're trying to do is to prevent jobs from using too much memory and causing the nodes to be in a bad state.

We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as long as it doesnt use all available memory, causing the nodes to OOM and need a reboot

What is your suggestion to achieve that?

Under-specifying the memory probably wont work in our case since we're using FastSchedule=0

Comment 31 Alejandro Sanchez 2016-05-16 21:39:48 MDT

(In reply to Akmal Madzlan from comment #30)
> TaskPLugin is set to task/cgroup
> 
> [akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin
> TaskPlugin              = task/cgroup
> TaskPluginParam         = (null type)

Probably the slurm_common.conf you attached is deprecated then, since it states:

alex@pc:~/Downloads$ grep -ri taskplugin slurm_common.conf 
TaskPlugin=task/none
alex@pc:~/Downloads$

If you changed it, could you please double check you restarted the slurmctld as well as all slurmd's?

Could you also please double check that in the slurmd.log file where a test job is executed you can find a message indicating something like this?

[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB memsw.limit=500MB
[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB memsw.limit=500MB

which is indicating the path in the cgroup hierarchy.

> 
> MemSpecLimit is set to 2048
> 
> [akmalm@kud13 ~]$ scontrol show node kud13
> NodeName=kud13 Arch=x86_64 CoresPerSocket=4
>    CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel
>    Gres=(null)
>    NodeAddr=kud13 NodeHostName=kud13 Version=15.08
>    OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1
>    MemSpecLimit=2048
>    State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A
>    BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> 
> Submitted the same job as before
> 
> [akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re"
> Submitted batch job 1111
> 
> But memory usage still constrained to MemSpecLimit
> 
> [akmalm@kud13 ~]$ mem /d/home/akmalm/re
> 2002.44MB 10822 akmalm /d/home/akmalm/re
> 
> Do I need any additional param in cgroup.conf for this to work?
> 
> 
> > We're also thinking that since we're making MemSpecLimit require task/cgroup 
> > and maybe you don't want to enable task/cgroup, you could get a similar 
> > behavior from the scheduler by under-specifying the Memory in the node config
> > instead and not making use of MemSpecLimit.
> 
> Actually what we're trying to do is to prevent jobs from using too much
> memory and causing the nodes to be in a bad state.
> 
> We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as
> long as it doesnt use all available memory, causing the nodes to OOM and
> need a reboot
> 
> What is your suggestion to achieve that?

task/cgroup with MemSpecLimit should do this. Let's make it work before looking for workarounds.

> 
> Under-specifying the memory probably wont work in our case since we're using
> FastSchedule=0

Comment 32 Akmal Madzlan 2016-05-16 21:56:35 MDT

Yeah I changed the config after reading your previous reply.

I've restarted both slurmctld and slurmd but still doesnt work.

I dont think I saw that message in the log. I'll take another later
On May 17, 2016 6:39 PM, <bugs@schedmd.com> wrote:

> *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=2713#c31> on bug
> 2713 <https://bugs.schedmd.com/show_bug.cgi?id=2713> from Alejandro Sanchez
> <alex@schedmd.com> *
>
> (In reply to Akmal Madzlan from comment #30 <https://bugs.schedmd.com/show_bug.cgi?id=2713#c30>)> TaskPLugin is set to task/cgroup
> >
> > [akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin
> > TaskPlugin              = task/cgroup
> > TaskPluginParam         = (null type)
>
> Probably the slurm_common.conf you attached is deprecated then, since it
> states:
>
> alex@pc:~/Downloads$ grep -ri taskplugin slurm_common.conf
> TaskPlugin=task/none
> alex@pc:~/Downloads$
>
> If you changed it, could you please double check you restarted the slurmctld as
> well as all slurmd's?
>
> Could you also please double check that in the slurmd.log file where a test job
> is executed you can find a message indicating something like this?
>
> [2016-05-16T18:41:30.628] [20023.0] task/cgroup:
> /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB
> memsw.limit=500MB
> [2016-05-16T18:41:30.628] [20023.0] task/cgroup:
> /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB
> memsw.limit=500MB
>
> which is indicating the path in the cgroup hierarchy.
> >
> > MemSpecLimit is set to 2048
> >
> > [akmalm@kud13 ~]$ scontrol show node kud13
> > NodeName=kud13 Arch=x86_64 CoresPerSocket=4
> >    CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel
> >    Gres=(null)
> >    NodeAddr=kud13 NodeHostName=kud13 Version=15.08
> >    OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1
> >    MemSpecLimit=2048
> >    State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A
> >    BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >
> > Submitted the same job as before
> >
> > [akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re"
> > Submitted batch job 1111
> >
> > But memory usage still constrained to MemSpecLimit
> >
> > [akmalm@kud13 ~]$ mem /d/home/akmalm/re
> > 2002.44MB 10822 akmalm /d/home/akmalm/re
> >
> > Do I need any additional param in cgroup.conf for this to work?
> >
> >
> > > We're also thinking that since we're making MemSpecLimit require task/cgroup
> > > and maybe you don't want to enable task/cgroup, you could get a similar
> > > behavior from the scheduler by under-specifying the Memory in the node config
> > > instead and not making use of MemSpecLimit.
> >
> > Actually what we're trying to do is to prevent jobs from using too much
> > memory and causing the nodes to be in a bad state.
> >
> > We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as
> > long as it doesnt use all available memory, causing the nodes to OOM and
> > need a reboot
> >
> > What is your suggestion to achieve that?
>
> task/cgroup with MemSpecLimit should do this. Let's make it work before looking
> for workarounds.
> >
> > Under-specifying the memory probably wont work in our case since we're using
> > FastSchedule=0
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 33 Alejandro Sanchez 2016-05-16 22:02:22 MDT

Can you attach your cgroup.conf? I'm preparing another patch so that MemSpecLimit is not dependant on ConstrainCores=no. If you have ConstrainCores=no MemSpecLimit won't work until second patch is applied.

Comment 36 Akmal Madzlan 2016-05-16 23:49:29 MDT

Created attachment 3097 [details]
cgroup.conf

Here is our cgroup.conf
We dont use ConstrainCore in that file

I dont see any line like

[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB memsw.limit=500MB
[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB memsw.limit=500MB

in my slurmd log. Do I need to set any specific DebugFlags?

Comment 37 Alejandro Sanchez 2016-05-17 00:24:06 MDT

Akmal could you please set ConstrainRAMSpace=yes in your cgroup.conf and try again? you can set SlurmdDebug=7 to increase slurmd loggin verbosity for a while. When problem gets solved you can set it back to 3. I think we'll need to modify again the second patch to require MemSpecLimit dependant on ConstrainRAMSpace=yes. I see the default value is no. It was working for me because I tried it with no but cgroups got activated because I had ConstrainSwapSpace=yes. Anyhow, try with ConstrainRAMSpace=yes and let me know the results and we'll take care of the patch. Thanks!

Comment 38 Akmal Madzlan 2016-05-17 02:16:42 MDT

Hi Alex,
ConstrainRAMSpace works

sbatch -pkud13 --mem=4000 --wrap="/d/home/akmalm/re"

task/cgroup: /slurm_kud13/uid_1419/job_1121/step_batch: alloc=4000MB mem.limit=4000MB memsw.limit=unlimited

But now the memory is limited to what the job requests

Is there any way to make it use more than that, but still not exceeding (RealMemory - MemSpecLimit) ?

Comment 39 Alejandro Sanchez 2016-05-17 02:41:44 MDT

It should be limited to:

MIN(--mem, (RealMemory - MemSpecLimit) * AllowedRAMSpace/100))

What happens if you request the job without --mem? It should not exceed RealMemory - MemSpecLimit. At least it doesn't in my testing machine.

Comment 40 Akmal Madzlan 2016-05-17 02:53:57 MDT

> What happens if you request the job without --mem? It should not exceed
> RealMemory - MemSpecLimit. At least it doesn't in my testing machine.

Yeah it works.
But we usually specify --mem so job will get on nodes with at least that amount of memory

Comment 43 Alejandro Sanchez 2016-05-17 03:33:37 MDT

Well that's the expected --mem behavior. If instead you use --mem-per-cpu, job could allocate a minimum of --mem-per-cpu but it will continue allocating until RealMemory - MemSpecLimit is reached, where it will be top constrained. --mem instead is more restrictive and will not let the job allocate more than --mem. So far so good, now it's time for us to work and make Slurm behave so that:

MemSpecLimit requires TaskPlugin=task/cgroup AND ConstrainRAMSpace but do not depend on ConstrainCores.

Hopefully tomorrow we'll come back with a patch (or more than one) for that.

Comment 44 Akmal Madzlan 2016-05-17 09:05:55 MDT

Alright

Thanks Alex!

Comment 45 Akmal Madzlan 2016-05-17 09:48:40 MDT

I tried --mem-per-cpu, I think it behave similar to --mem

[akmalm@kud13 ~]$ sbatch -pkud13 --wrap="/d/home/akmalm/re" --mem-per-cpu=1024
Submitted batch job 1127
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
986.598MB 20119 akmalm /d/home/akmalm/re
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
987.086MB 20119 akmalm /d/home/akmalm/re

Comment 46 Alejandro Sanchez 2016-05-17 21:10:24 MDT

Akmal you are right. I had setup a node with 2 CPUs, and requested --mem-per-cpu=100. When I ps'ed the program rss and saw it over 100MB I assumed --mem-per-cpu was behaving differently than --mem and thought it could go over what I requested, but I've tried again now and realized that rss doesn't go over 200MB, which is 100MB * 2 CPUs. So yes, --mem-per-cpu behaves similar to --mem in that sense. We'll come back to you with the patch as agreed.

Comment 49 Alejandro Sanchez 2016-05-18 01:58:51 MDT

Akmal, patch will be included in 15.08.12. It's also already available through this link, just proceed as needed:

https://github.com/SchedMD/slurm/commit/f4ebc793.patch

Please, let me know if you have any more questions or if we can close this bug. Thank you.

Comment 50 Alejandro Sanchez 2016-05-18 02:10:29 MDT

Note that the patch is only logging an error message and preventing MemSpecLimt to be configured without task/cgroup or without ConstrainRAMSpace. You don't need to apply the patch itself, just do not enable MemSpecLimit without the other two options enabled.

Comment 52 paull 2017-02-17 13:27:53 MST

Hi Support,

Was there ever a fix for this?

I have:

ConstrainRamSpace=yes in cgroup.conf
TaskPlugin=task/cgroup

I run a job that sets --mem=1000 on a node who has 50G of free mem, the job is killed with "Memory cgroup out of memory: Kill process 353 (stress) score 1000 or sacrifice child".

The job is constrained to the --mem=1000 which we don't want to happen. We want the job to be killed at the MemSpecLimit threshold.

Thanks,
Paul

Comment 53 paull 2017-02-17 13:28:53 MST

Removed from Resolved to Unconfirmed status for previous message.

Also, this is from testing against 16.05.9.1 + commit f6d42fdbb293ca (for powersaving) + commit 0ea581a72ae7c (for SLURM_BITSTR_LEN).

Comment 56 Alejandro Sanchez 2017-02-20 06:21:12 MST

Paul,

In order to configure MemSpecLimit on a node, TaskPlugin=task/cgroup and ContrainRAMSPace=yes are required. Otherwise, these errors will be logged in that node slurmd.log:

slurmd: error: Resource spec: cgroup job confinement not configured. MemSpecLimit requires TaskPlugin=task/cgroup and ConstrainRAMSpace=yes in cgroup.conf
slurmd: error: Resource spec: system cgroup memory limit disabled

And system cgroup memory limit will be disabled.

Now, you should note that MemSpecLimit is a limit on combined real memory allocation for compute node daemons (slurmd, slurmstepd), in megabytes. This memory is _not_ available to job allocations. The _daemons_ won't be killed when they exhaust the memory allocation (ie. the OOM Killer is disabled for the daemon's memory cgroup).

If you set MemSpecLimit, the memory left available for the node will be RealMemory - MemSpecLimit.

So for instance, if your node RealMemory=5000 and MemSpecLimit=1000, the memory available for job allocations on that node will be 4000 (MB).

Now, if the job doesn't request memory, which means that it didn't request --mem nor --mem-per-cpu _and_ DefMemPerNode/CPU are not configured, Slurm will try to allocate the whole node RealMemory to that job, but since 1000MB are reserved for compute node daemons as MemSpecLimit, you would get this error:

srun: error: Unable to allocate resources: Requested node configuration is not available

If instead, memory is requested, for instance with --mem=2000 (or by configuring DefMemPerNode=2000), there will be two different behaviors depending on cgroup.conf. When the job reaches 2000MB, if you have ConstrainSwapSpace=yes and AllowedSwapSpace=0, then the job will be OOM-killed. If instead you have ConstrainSwapSpace=no, then the job won't allocate more than 2000MB of RAM, but will start using swap space once it reaches 2000MB.

Please, let me know if this makes sense. Is there something that isn't working as I describe? I've been playing around and doing some tests this morning in my test machine and I got the described behavior, which looks correct at first glance to me.

Comment 58 paull 2017-02-20 16:46:32 MST

Yes it all makes sense. Unfortunately, this is preventing us from utilizing the feature.

Let me explain what we need and see if there are any suggestions for how to accomplish this:

nodeA information:
RealMemory=100GB
Desired Memory Limit for job, for nodeA=95GB

nodeB information:
RealMemory=50GB
Desired Memory Limit for job, for nodeB=45GB

Job requests a node with enough resources to handle 70G: --mem=70000

Job runs on nodeA but never on nodeB due to resource request.

Job Possibility #1: As the job runs, it ends up actually needing 85G, which is >70GB requested (--mem=70000). It is still allowed to continue running and complete.

Job Possibility #2: As the job runs, it ends up actually needing 100G, which is >95GB (Desired Memory Limit for job, for nodeA=95GB). Once it exceeds the limit, it is OOM Killed.


Any suggested configurations?

Thanks,
Paul

Comment 60 Alejandro Sanchez 2017-02-21 07:38:43 MST

(In reply to paull from comment #58)
> Yes it all makes sense. Unfortunately, this is preventing us from utilizing
> the feature.
> 
> Let me explain what we need and see if there are any suggestions for how to
> accomplish this:
> 
> nodeA information:
> RealMemory=100GB
> Desired Memory Limit for job, for nodeA=95GB
> 
> nodeB information:
> RealMemory=50GB
> Desired Memory Limit for job, for nodeB=45GB
> 
> Job requests a node with enough resources to handle 70G: --mem=70000
> 
> Job runs on nodeA but never on nodeB due to resource request.
> 
> Job Possibility #1: As the job runs, it ends up actually needing 85G, which
> is >70GB requested (--mem=70000). It is still allowed to continue running
> and complete.
> 
> Job Possibility #2: As the job runs, it ends up actually needing 100G, which
> is >95GB (Desired Memory Limit for job, for nodeA=95GB). Once it exceeds the
> limit, it is OOM Killed.
> 
> 
> Any suggested configurations?
> 
> Thanks,
> Paul

I'd suggest making a partition for the 100GB nodes and another for the 50GB nodes. Then in the partition with 100GB nodes I'd set DefMemPerNode=97280 (95GB) and MemSpecLimit=5120 (5GB). This way jobs in this partition will be able to allocate up to 95GB (unless user specifies a smaller amount with --mem) and 5GB will be reserved for slurmd+slurmstepd daemons use, jobs should not be able to use these 5GB. Similarly, I'd configure the 50GB nodes partition with DefMemPerNode=46080 (45GB) and MemSpecLimit=5120 (5GB).

Note also that at some point in this bug Akmal was concerned that while monitoring the job memory usage with ps -eo rss, he saw that the rss never utilized up to RealMemory - MemSpecLimit. Well, this is because the virtual memory is decomposed in different parts, and virtual memory size (vsz) I believe is the sum of total_rss + total_cache (+ total_swap). If you want that no swap space is utilized for the job and barely all allocated memory to be rss and/or cache, then we've to configure Slurm cgroups to disable swap usage. To do that, I've been doing some tests and something like this should work:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=100
ConstrainSwapSpace=yes
AllowedSwapSpace=0
MaxSwapPercent=0 <- Note that MaxSwapPercent was not hinted before in the bug
TaskAffinity=no

That would explain why Akmal was confused on why jobs could not fully utilize the allocatable memory for the jobs. Please, let me know if this config works well for you or if you notice any behavior that you don't fully understand.

Comment 61 Alejandro Sanchez 2017-03-10 06:16:37 MST

Paul/Akmal - any question from my last comment? Can we go ahead and close this bug?

Comment 62 paull 2017-03-10 09:52:11 MST

Yes that is fine. Looks like we will have to approach this from a different aspect as you have pointed out.

Comment 63 Alejandro Sanchez 2017-03-10 09:54:20 MST

(In reply to paull from comment #62)
> Yes that is fine. Looks like we will have to approach this from a different
> aspect as you have pointed out.

Ok. Feel free to reopen if after upgrade there's still something you don't understand/strange on MemSpecLimit.