Summary: | MemSpecLimit behaviour changes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | alex, brian, paull |
Version: | 16.05.9 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
Bull/Atos Sites: | --- | Confidential Site: | --- |
Cray Sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 15.08.12 16.05.0-rc3 | |
Target Release: | --- | DevPrio: | --- |
Attachments: |
conf
conf2 cgroup.conf |
Description
Akmal Madzlan
2016-05-08 19:18:31 MDT
Hi Akmal, In version 15.08.0-pre4, this commit[1] removed the specialized memory from being available for allocation to jobs and this other commit[2] disabled the OOM Killer in slurmd and slurmstepd's memory cgroup when using MemSpecLimit. As a consequence of commit[1], if you set MemSpecLimit = ~250GB, these 250GB are not available anymore for allocation to jobs. [1] https://github.com/SchedMD/slurm/commit/43849c5ecddf3a3b489e4b408485ed8e6e2e46d6 [2] https://github.com/SchedMD/slurm/commit/bcfcb9d7d2452c12ae219cc42dfb9d25487d7b8f Hi Alex, When I set MemSpecLimit to 100MB in one the node, slurmd-kud13: debug: system cgroup: system memory cgroup initialized slurmd-kud13: debug3: parameter 'memory.limit_in_bytes' set to '104857600' for '/cgroup/memory/slurm_kud13/system' slurmd-kud13: debug3: parameter 'memory.oom_control' set to '1' for '/cgroup/memory/slurm_kud13/system' slurmd-kud13: Resource spec: system cgroup memory limit set to 100 MB It caused jobs running on that node to be able to use only 100MB of RAM. Is that how MemSpecLimit supposed to work? To clarify, I can request memory more than 100MB For example sbatch -pkud13 --mem=9947 --wrap="/d/home/akmalm/re" But when I check the memory usage with ps, it seems like it can only use up to 100MB, or whatever value MemSpecLimit is set to Akmal, I've setup slurm-15.08.11. Configured the node compute1 with: RealMemory=3896 MemSpecLimit=3800 As a consequence, 3800MB of memory are reserved for the combination of slurmd and slurmstepd daemons. The remaining memory, 3896(RealMemory)-3800(MemSpecLimit)=96MB are available for job allocation. Let's see what happens when submitting different job's --mem: alex@pc:~/t$ srun --mem=3000 hostname srun: Force Terminated job 20009 srun: error: Unable to allocate resources: Requested node configuration is not available alex@pc:~/t$ srun --mem=1000 hostname srun: error: Unable to allocate resources: Requested node configuration is not available alex@pc:~/t$ srun --mem=100 hostname srun: error: Unable to allocate resources: Requested node configuration is not available alex@pc:~/t$ srun --mem=96 hostname pc alex@pc:~/t$ When I started slurmd, this message was logged: [2016-05-11T13:29:07.804] Resource spec: system cgroup memory limit set to 3800 MB Does it make sense now? Alright, let say I have this node with 16GB memory. I set aside 1024MB ram for slurmd & slurmstepd using MemSpecLimit NodeName=kud13 CPUs=8 RealMemory=15947 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN Feature=localdisk,intel MemSpecLimit=1024 Now the memory that can be allocated by jobs is 15947 - 1024 = 14923 So I submit a job [akmalm@kud13 ~]$ sbatch -pkud13 --mem=14924 --wrap="hostname" sbatch: error: Batch job submission failed: Requested node configuration is not available [akmalm@kud13 ~]$ sbatch -pkud13 --mem=14923 --wrap="hostname" Submitted batch job 880 Yeah, no problem with that. I get it But Let say I have a job that need at least 10000MB to run properly. "/d/home/akmalm/re" is a binary that will try to use as much RAM as possible [akmalm@kud13 ~]$ sbatch -pkud13 --mem=10000 --wrap="/d/home/akmalm/re" Submitted batch job 882 Let see their memory usage. [akmalm@kud13 ~]$ mem /d/home/akmalm/re 1019.71MB 25633 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 986.234MB 25633 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 997.891MB 25633 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 963.051MB 25633 akmalm /d/home/akmalm/re where mem is [akmalm@kud13 ~]$ type mem mem is a function mem () { ps -eo rss,pid,euser,args:100 --sort %mem | grep -v grep | grep -i $@ | awk '{printf $1/1024 "MB"; $1=""; print }' } It seems like the job is limited to 1024. It cant fully utilize the available memory. Lets try increasing MemSpecLimit NodeName=kud13 CPUs=8 RealMemory=15947 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN Feature=localdisk,intel MemSpecLimit=2048 15947 - 2048 = 13899 available RAM for jobs So I submit the same job [akmalm@kud13 ~]$ sbatch -pkud13 --mem=10000 --wrap="/d/home/akmalm/re" Submitted batch job 884 [akmalm@kud13 ~]$ mem /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 790.355MB 28557 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 2005.84MB 28557 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 2047.06MB 28557 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 2009.95MB 28557 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 2019.21MB 28557 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 2027.43MB 28557 akmalm /d/home/akmalm/re Yeah, now it is limited to 2048MB of ram. Then whats the point in using --mem=10000? Yes you are right in the sense that even if the job can allocate RealMemory - MemSpecLimit, the process will be constrained to use up to MemSpecLimit, even if the difference of RealMemory - MemSpecLimit is higher than MemSpecLimit. Let me internally discuss whether this is the intended design or not and what's the purpose of being able to allocate --mem > MemSpecLimit if you can only use up to MemSpecLimit. We'll come back to you with an answer. Akmal, could I have a quick look at your site's slurm.conf? Created attachment 3092 [details]
conf
Created attachment 3093 [details]
conf2
Hi Alex,
I've attached our slurm configuration
Hi Akmal. Problem occurs when TaskPlugin=task/cgroup is not enabled and working. Slurmd pid and MemSpecLimit are placed in this part of the cgroups hierarchy: /path/to/cgroup/memory/slurm_<node>/system/{tasks, memory.limit_in_bytes} The root cause of the problem was that because of the inheritance nature of cgroups, all children of slurmd (slurmstepd, slurmstepd's children and so on) were also placed in that tasks file, and thus were constrained to MemSpecLimit too. When TaskPlugin=task/cgroup is enabled, the pid of the slurmstepd's children (launched apps), are placed in a different point in the hierarchy: /path/to/cgroup/slurm_<node>/uid_<uid>/job_<jobid>/step_<stepid>/task_<taskid> and thus not constrained to MemSpecLimit anymore. We've almost ready a patch which makes MemSpecLimit require TaskPlugin=task/cgroup enabled, we'll come back to you by the end of today or tomorrow at most with the patch ready. In your slurm.conf TaskPlugin is configured with task/none. Once the patch is applied, if your slurm.conf continues without task/none, these messages should be logged in your slurmd.log: slurmd: error: Resource spec: cgroup job confinement not configured. MemSpecLimit requires TaskPlugin=task/cgroup enabled slurmd: error: Resource spec: system cgroup memory limit disabled And MemSpecLimit won't take any effect. On the other hand, with task/cgroup something similar to these messages will be logged (depending on your configured MemSpecLimit and cgroup.conf's AllowedRAMSpace numbers): slurmd: debug: system cgroup: system memory cgroup initialized slurmd: Resource spec: system cgroup memory limit set to 1000 MB [2016-05-16T16:57:34.804] [20014.0] task/cgroup: /slurm_compute1/uid_1000/job_20014/step_0: alloc=2896MB mem.limit=1448MB memsw.limit=1448MB In this case, I had RealMemory=3896, MemSpecLimit=1000 and AllowedRAMSpace=50 (50%). So RealMemory - MemSpecLimit was available for the job (see alloc=2896) but since I had AllowedRAMSpace=50 (percent) mem.limit=1448MB memsw.limit=1448MB (half of alloc=2896). Hope this helps you imagine and understand how this will work after patch is applied. We're also thinking that since we're making MemSpecLimit require task/cgroup and maybe you don't want to enable task/cgroup, you could get a similar behavior from the scheduler by under-specifying the Memory in the node config instead and not making use of MemSpecLimit. All in all, I hope we'll come back with the patch ready as I said by the end of today or tomorrow at most. Thanks for your patience and collaboration. TaskPLugin is set to task/cgroup
[akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin
TaskPlugin = task/cgroup
TaskPluginParam = (null type)
MemSpecLimit is set to 2048
[akmalm@kud13 ~]$ scontrol show node kud13
NodeName=kud13 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel
Gres=(null)
NodeAddr=kud13 NodeHostName=kud13 Version=15.08
OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1
MemSpecLimit=2048
State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A
BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Submitted the same job as before
[akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re"
Submitted batch job 1111
But memory usage still constrained to MemSpecLimit
[akmalm@kud13 ~]$ mem /d/home/akmalm/re
2002.44MB 10822 akmalm /d/home/akmalm/re
Do I need any additional param in cgroup.conf for this to work?
> We're also thinking that since we're making MemSpecLimit require task/cgroup
> and maybe you don't want to enable task/cgroup, you could get a similar
> behavior from the scheduler by under-specifying the Memory in the node config
> instead and not making use of MemSpecLimit.
Actually what we're trying to do is to prevent jobs from using too much memory and causing the nodes to be in a bad state.
We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as long as it doesnt use all available memory, causing the nodes to OOM and need a reboot
What is your suggestion to achieve that?
Under-specifying the memory probably wont work in our case since we're using FastSchedule=0
(In reply to Akmal Madzlan from comment #30) > TaskPLugin is set to task/cgroup > > [akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin > TaskPlugin = task/cgroup > TaskPluginParam = (null type) Probably the slurm_common.conf you attached is deprecated then, since it states: alex@pc:~/Downloads$ grep -ri taskplugin slurm_common.conf TaskPlugin=task/none alex@pc:~/Downloads$ If you changed it, could you please double check you restarted the slurmctld as well as all slurmd's? Could you also please double check that in the slurmd.log file where a test job is executed you can find a message indicating something like this? [2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB memsw.limit=500MB [2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB memsw.limit=500MB which is indicating the path in the cgroup hierarchy. > > MemSpecLimit is set to 2048 > > [akmalm@kud13 ~]$ scontrol show node kud13 > NodeName=kud13 Arch=x86_64 CoresPerSocket=4 > CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel > Gres=(null) > NodeAddr=kud13 NodeHostName=kud13 Version=15.08 > OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1 > MemSpecLimit=2048 > State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A > BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10 > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > Submitted the same job as before > > [akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re" > Submitted batch job 1111 > > But memory usage still constrained to MemSpecLimit > > [akmalm@kud13 ~]$ mem /d/home/akmalm/re > 2002.44MB 10822 akmalm /d/home/akmalm/re > > Do I need any additional param in cgroup.conf for this to work? > > > > We're also thinking that since we're making MemSpecLimit require task/cgroup > > and maybe you don't want to enable task/cgroup, you could get a similar > > behavior from the scheduler by under-specifying the Memory in the node config > > instead and not making use of MemSpecLimit. > > Actually what we're trying to do is to prevent jobs from using too much > memory and causing the nodes to be in a bad state. > > We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as > long as it doesnt use all available memory, causing the nodes to OOM and > need a reboot > > What is your suggestion to achieve that? task/cgroup with MemSpecLimit should do this. Let's make it work before looking for workarounds. > > Under-specifying the memory probably wont work in our case since we're using > FastSchedule=0 Yeah I changed the config after reading your previous reply. I've restarted both slurmctld and slurmd but still doesnt work. I dont think I saw that message in the log. I'll take another later On May 17, 2016 6:39 PM, <bugs@schedmd.com> wrote: > *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=2713#c31> on bug > 2713 <https://bugs.schedmd.com/show_bug.cgi?id=2713> from Alejandro Sanchez > <alex@schedmd.com> * > > (In reply to Akmal Madzlan from comment #30 <https://bugs.schedmd.com/show_bug.cgi?id=2713#c30>)> TaskPLugin is set to task/cgroup > > > > [akmalm@kud13 ~]$ scontrol show config | grep TaskPlugin > > TaskPlugin = task/cgroup > > TaskPluginParam = (null type) > > Probably the slurm_common.conf you attached is deprecated then, since it > states: > > alex@pc:~/Downloads$ grep -ri taskplugin slurm_common.conf > TaskPlugin=task/none > alex@pc:~/Downloads$ > > If you changed it, could you please double check you restarted the slurmctld as > well as all slurmd's? > > Could you also please double check that in the slurmd.log file where a test job > is executed you can find a message indicating something like this? > > [2016-05-16T18:41:30.628] [20023.0] task/cgroup: > /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB > memsw.limit=500MB > [2016-05-16T18:41:30.628] [20023.0] task/cgroup: > /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB > memsw.limit=500MB > > which is indicating the path in the cgroup hierarchy. > > > > MemSpecLimit is set to 2048 > > > > [akmalm@kud13 ~]$ scontrol show node kud13 > > NodeName=kud13 Arch=x86_64 CoresPerSocket=4 > > CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=1.59 Features=localdisk,intel > > Gres=(null) > > NodeAddr=kud13 NodeHostName=kud13 Version=15.08 > > OS=Linux RealMemory=15947 AllocMem=0 FreeMem=7118 Sockets=2 Boards=1 > > MemSpecLimit=2048 > > State=IDLE ThreadsPerCore=1 TmpDisk=674393 Weight=1 Owner=N/A > > BootTime=2016-05-12T18:45:46 SlurmdStartTime=2016-05-17T17:38:10 > > CapWatts=n/a > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > > > Submitted the same job as before > > > > [akmalm@kud13 ~]$ sbatch -pkud13 --mem=1000 --wrap="/d/home/akmalm/re" > > Submitted batch job 1111 > > > > But memory usage still constrained to MemSpecLimit > > > > [akmalm@kud13 ~]$ mem /d/home/akmalm/re > > 2002.44MB 10822 akmalm /d/home/akmalm/re > > > > Do I need any additional param in cgroup.conf for this to work? > > > > > > > We're also thinking that since we're making MemSpecLimit require task/cgroup > > > and maybe you don't want to enable task/cgroup, you could get a similar > > > behavior from the scheduler by under-specifying the Memory in the node config > > > instead and not making use of MemSpecLimit. > > > > Actually what we're trying to do is to prevent jobs from using too much > > memory and causing the nodes to be in a bad state. > > > > We dont mind if jobs is requesting 1GB of memory, but end up using 4GB, as > > long as it doesnt use all available memory, causing the nodes to OOM and > > need a reboot > > > > What is your suggestion to achieve that? > > task/cgroup with MemSpecLimit should do this. Let's make it work before looking > for workarounds. > > > > Under-specifying the memory probably wont work in our case since we're using > > FastSchedule=0 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Can you attach your cgroup.conf? I'm preparing another patch so that MemSpecLimit is not dependant on ConstrainCores=no. If you have ConstrainCores=no MemSpecLimit won't work until second patch is applied. Created attachment 3097 [details]
cgroup.conf
Here is our cgroup.conf
We dont use ConstrainCore in that file
I dont see any line like
[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023: alloc=500MB mem.limit=500MB memsw.limit=500MB
[2016-05-16T18:41:30.628] [20023.0] task/cgroup: /slurm_compute1/uid_1000/job_20023/step_0: alloc=500MB mem.limit=500MB memsw.limit=500MB
in my slurmd log. Do I need to set any specific DebugFlags?
Akmal could you please set ConstrainRAMSpace=yes in your cgroup.conf and try again? you can set SlurmdDebug=7 to increase slurmd loggin verbosity for a while. When problem gets solved you can set it back to 3. I think we'll need to modify again the second patch to require MemSpecLimit dependant on ConstrainRAMSpace=yes. I see the default value is no. It was working for me because I tried it with no but cgroups got activated because I had ConstrainSwapSpace=yes. Anyhow, try with ConstrainRAMSpace=yes and let me know the results and we'll take care of the patch. Thanks! Hi Alex, ConstrainRAMSpace works sbatch -pkud13 --mem=4000 --wrap="/d/home/akmalm/re" task/cgroup: /slurm_kud13/uid_1419/job_1121/step_batch: alloc=4000MB mem.limit=4000MB memsw.limit=unlimited But now the memory is limited to what the job requests Is there any way to make it use more than that, but still not exceeding (RealMemory - MemSpecLimit) ? It should be limited to: MIN(--mem, (RealMemory - MemSpecLimit) * AllowedRAMSpace/100)) What happens if you request the job without --mem? It should not exceed RealMemory - MemSpecLimit. At least it doesn't in my testing machine. > What happens if you request the job without --mem? It should not exceed
> RealMemory - MemSpecLimit. At least it doesn't in my testing machine.
Yeah it works.
But we usually specify --mem so job will get on nodes with at least that amount of memory
Well that's the expected --mem behavior. If instead you use --mem-per-cpu, job could allocate a minimum of --mem-per-cpu but it will continue allocating until RealMemory - MemSpecLimit is reached, where it will be top constrained. --mem instead is more restrictive and will not let the job allocate more than --mem. So far so good, now it's time for us to work and make Slurm behave so that: MemSpecLimit requires TaskPlugin=task/cgroup AND ConstrainRAMSpace but do not depend on ConstrainCores. Hopefully tomorrow we'll come back with a patch (or more than one) for that. Alright Thanks Alex! I tried --mem-per-cpu, I think it behave similar to --mem [akmalm@kud13 ~]$ sbatch -pkud13 --wrap="/d/home/akmalm/re" --mem-per-cpu=1024 Submitted batch job 1127 [akmalm@kud13 ~]$ mem /d/home/akmalm/re 986.598MB 20119 akmalm /d/home/akmalm/re [akmalm@kud13 ~]$ mem /d/home/akmalm/re 987.086MB 20119 akmalm /d/home/akmalm/re Akmal you are right. I had setup a node with 2 CPUs, and requested --mem-per-cpu=100. When I ps'ed the program rss and saw it over 100MB I assumed --mem-per-cpu was behaving differently than --mem and thought it could go over what I requested, but I've tried again now and realized that rss doesn't go over 200MB, which is 100MB * 2 CPUs. So yes, --mem-per-cpu behaves similar to --mem in that sense. We'll come back to you with the patch as agreed. Akmal, patch will be included in 15.08.12. It's also already available through this link, just proceed as needed: https://github.com/SchedMD/slurm/commit/f4ebc793.patch Please, let me know if you have any more questions or if we can close this bug. Thank you. Note that the patch is only logging an error message and preventing MemSpecLimt to be configured without task/cgroup or without ConstrainRAMSpace. You don't need to apply the patch itself, just do not enable MemSpecLimit without the other two options enabled. Hi Support, Was there ever a fix for this? I have: ConstrainRamSpace=yes in cgroup.conf TaskPlugin=task/cgroup I run a job that sets --mem=1000 on a node who has 50G of free mem, the job is killed with "Memory cgroup out of memory: Kill process 353 (stress) score 1000 or sacrifice child". The job is constrained to the --mem=1000 which we don't want to happen. We want the job to be killed at the MemSpecLimit threshold. Thanks, Paul Removed from Resolved to Unconfirmed status for previous message. Also, this is from testing against 16.05.9.1 + commit f6d42fdbb293ca (for powersaving) + commit 0ea581a72ae7c (for SLURM_BITSTR_LEN). Paul, In order to configure MemSpecLimit on a node, TaskPlugin=task/cgroup and ContrainRAMSPace=yes are required. Otherwise, these errors will be logged in that node slurmd.log: slurmd: error: Resource spec: cgroup job confinement not configured. MemSpecLimit requires TaskPlugin=task/cgroup and ConstrainRAMSpace=yes in cgroup.conf slurmd: error: Resource spec: system cgroup memory limit disabled And system cgroup memory limit will be disabled. Now, you should note that MemSpecLimit is a limit on combined real memory allocation for compute node daemons (slurmd, slurmstepd), in megabytes. This memory is _not_ available to job allocations. The _daemons_ won't be killed when they exhaust the memory allocation (ie. the OOM Killer is disabled for the daemon's memory cgroup). If you set MemSpecLimit, the memory left available for the node will be RealMemory - MemSpecLimit. So for instance, if your node RealMemory=5000 and MemSpecLimit=1000, the memory available for job allocations on that node will be 4000 (MB). Now, if the job doesn't request memory, which means that it didn't request --mem nor --mem-per-cpu _and_ DefMemPerNode/CPU are not configured, Slurm will try to allocate the whole node RealMemory to that job, but since 1000MB are reserved for compute node daemons as MemSpecLimit, you would get this error: srun: error: Unable to allocate resources: Requested node configuration is not available If instead, memory is requested, for instance with --mem=2000 (or by configuring DefMemPerNode=2000), there will be two different behaviors depending on cgroup.conf. When the job reaches 2000MB, if you have ConstrainSwapSpace=yes and AllowedSwapSpace=0, then the job will be OOM-killed. If instead you have ConstrainSwapSpace=no, then the job won't allocate more than 2000MB of RAM, but will start using swap space once it reaches 2000MB. Please, let me know if this makes sense. Is there something that isn't working as I describe? I've been playing around and doing some tests this morning in my test machine and I got the described behavior, which looks correct at first glance to me. Yes it all makes sense. Unfortunately, this is preventing us from utilizing the feature. Let me explain what we need and see if there are any suggestions for how to accomplish this: nodeA information: RealMemory=100GB Desired Memory Limit for job, for nodeA=95GB nodeB information: RealMemory=50GB Desired Memory Limit for job, for nodeB=45GB Job requests a node with enough resources to handle 70G: --mem=70000 Job runs on nodeA but never on nodeB due to resource request. Job Possibility #1: As the job runs, it ends up actually needing 85G, which is >70GB requested (--mem=70000). It is still allowed to continue running and complete. Job Possibility #2: As the job runs, it ends up actually needing 100G, which is >95GB (Desired Memory Limit for job, for nodeA=95GB). Once it exceeds the limit, it is OOM Killed. Any suggested configurations? Thanks, Paul (In reply to paull from comment #58) > Yes it all makes sense. Unfortunately, this is preventing us from utilizing > the feature. > > Let me explain what we need and see if there are any suggestions for how to > accomplish this: > > nodeA information: > RealMemory=100GB > Desired Memory Limit for job, for nodeA=95GB > > nodeB information: > RealMemory=50GB > Desired Memory Limit for job, for nodeB=45GB > > Job requests a node with enough resources to handle 70G: --mem=70000 > > Job runs on nodeA but never on nodeB due to resource request. > > Job Possibility #1: As the job runs, it ends up actually needing 85G, which > is >70GB requested (--mem=70000). It is still allowed to continue running > and complete. > > Job Possibility #2: As the job runs, it ends up actually needing 100G, which > is >95GB (Desired Memory Limit for job, for nodeA=95GB). Once it exceeds the > limit, it is OOM Killed. > > > Any suggested configurations? > > Thanks, > Paul I'd suggest making a partition for the 100GB nodes and another for the 50GB nodes. Then in the partition with 100GB nodes I'd set DefMemPerNode=97280 (95GB) and MemSpecLimit=5120 (5GB). This way jobs in this partition will be able to allocate up to 95GB (unless user specifies a smaller amount with --mem) and 5GB will be reserved for slurmd+slurmstepd daemons use, jobs should not be able to use these 5GB. Similarly, I'd configure the 50GB nodes partition with DefMemPerNode=46080 (45GB) and MemSpecLimit=5120 (5GB). Note also that at some point in this bug Akmal was concerned that while monitoring the job memory usage with ps -eo rss, he saw that the rss never utilized up to RealMemory - MemSpecLimit. Well, this is because the virtual memory is decomposed in different parts, and virtual memory size (vsz) I believe is the sum of total_rss + total_cache (+ total_swap). If you want that no swap space is utilized for the job and barely all allocated memory to be rss and/or cache, then we've to configure Slurm cgroups to disable swap usage. To do that, I've been doing some tests and something like this should work: CgroupMountpoint=/sys/fs/cgroup CgroupAutomount=yes ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes AllowedRAMSpace=100 ConstrainSwapSpace=yes AllowedSwapSpace=0 MaxSwapPercent=0 <- Note that MaxSwapPercent was not hinted before in the bug TaskAffinity=no That would explain why Akmal was confused on why jobs could not fully utilize the allocatable memory for the jobs. Please, let me know if this config works well for you or if you notice any behavior that you don't fully understand. Paul/Akmal - any question from my last comment? Can we go ahead and close this bug? Yes that is fine. Looks like we will have to approach this from a different aspect as you have pointed out. (In reply to paull from comment #62) > Yes that is fine. Looks like we will have to approach this from a different > aspect as you have pointed out. Ok. Feel free to reopen if after upgrade there's still something you don't understand/strange on MemSpecLimit. |