Bug 3820 - Exceeded step memory limit at some point
Summary: Exceeded step memory limit at some point
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other bugs)
Version: 17.02.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 3831 (view as bug list)
Depends on:
Blocks: 4625
  Show dependency treegraph
 
Reported: 2017-05-17 16:59 MDT by Jenny Williams
Modified: 2018-02-06 04:10 MST (History)
7 users (show)

See Also:
Site: University of North Carolina at Chapel Hill
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
scontrol show config |sort and cgroups.conf (7.25 KB, text/plain)
2017-05-17 16:59 MDT, Jenny Williams
Details
sacct listing for bwa example (27.53 KB, text/plain)
2017-05-19 09:40 MDT, Jenny Williams
Details
sacct listing for phylogen_tree example (1.26 KB, text/plain)
2017-05-19 09:40 MDT, Jenny Williams
Details
Revert to pre-17.02 behavior (520 bytes, patch)
2017-05-23 16:49 MDT, Tim Wickberg
Details | Diff
Page cache growth (3.90 KB, text/plain)
2017-12-20 08:32 MST, Alejandro Sanchez
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jenny Williams 2017-05-17 16:59:16 MDT
Created attachment 4581 [details]
scontrol show config |sort and cgroups.conf

We are having a problem specific to the error message

 "Excceded step memory limit at some point" 

This error at the end of a job output 

We see from testing that the case when this is triggered in our opinion spuriously is when the VDATA component of VMEM spikes for even a brief moment, and is unused by the application, where vdata is defined as the virtual memory size of the private data used by a process including the heap and shared library data.

In our test case the application is submitted requesting 8GB, and never uses more than 5.5GB of real memory.  We see via atop that about 2 min. into the run that VDATA jumps to either 36GB or 28GB - at this moment i dont recall but can verify if it is pertinent.  

# sacct -j 2140 --format=MaxRSS,MaxVMSize,MaxDiskRead,MaxDiskWrite,MaxPages,ExitCode,DerivedExitCode,JobID,ReqTRES%30,Start,Elapsed,State   
    MaxRSS  MaxVMSize  MaxDiskRead MaxDiskWrite MaxPages ExitCode DerivedExitCode        JobID                        ReqTRES               Start    Elapsed      State 
---------- ---------- ------------ ------------ -------- -------- --------------- ------------ ------------------------------ ------------------- ---------- ---------- 
                                                              0:0             0:0 2140                    cpu=1,mem=8G,node=1 2017-05-17T11:52:06   00:05:29  COMPLETED 
  5711856K    155516K     3309.39M     1639.43M      10K      0:0                 2140.batch                                  2017-05-17T11:52:06   00:05:29  COMPLETED 

# egrep 2140 /var/log/slurm/slurmd.log
[2017-05-17T11:52:06.911] _run_prolog: prolog with lock for job 2140 ran for 0 seconds
[2017-05-17T11:52:07.063] Launching batch job 2140 for UID 128306
[2017-05-17T11:52:07.089] [2140] task/cgroup: /slurm/uid_128306/job_2140: alloc=8192MB mem.limit=8192MB memsw.limit=9830MB
[2017-05-17T11:52:07.089] [2140] task/cgroup: /slurm/uid_128306/job_2140/step_batch: alloc=8192MB mem.limit=8192MB memsw.limit=9830MB
[2017-05-17T11:57:34.308] [2140] error: Exceeded step memory limit at some point.
[2017-05-17T11:57:35.048] [2140] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2017-05-17T11:57:35.050] [2140] done with job

We see and understand that the application and steps are all completing without error, however this error message is causing a good bit of issue with the user community unnecessarily.  Is there a configuration change or code patch that can be recommended to eliminate this message for the user in those cases where real memory allocated is not exceeded, and the application completes without error, but a brief jitter in the VDATA value seems to trigger this error??  In short we are not concerned about that specific memory event at all. To the random user the last line reading "error" carries more import than an exit code of "0:0"

We are testing using slurm-17.02.3-1 + the patch under attachmet 4185 from bug # 3531

RHEL7: 3.10.0-327.36.3.el7.x86_64
Our current configuration slurm.conf and cgroup.conf are attached.  plugstack.conf.d/* files are empty.

Thank you for your time
Jenny
Comment 1 Tim Wickberg 2017-05-18 21:31:36 MDT
There are two different counts that can lead to that, and I'm not sure which one you're seeing.

Either memory.memsw.failcnt and/or memory.failcnt being non-zero will lead to this warning.

While I can give you a local patch to surpress that warning, I don't think it's spurious and wouldn't recommend disabling this throughout your system. Although it does sound like the application manages to recover from bouncing off one of those two limit values.

I'd be really curious what causes that momentary spike in VDATA. Are there any details about the application you can share that may shed some light?
Comment 2 Jenny Williams 2017-05-19 09:40:13 MDT
Created attachment 4594 [details]
sacct listing for bwa example
Comment 3 Jenny Williams 2017-05-19 09:40:59 MDT
Created attachment 4595 [details]
sacct listing for phylogen_tree example
Comment 4 Jenny Williams 2017-05-19 09:41:21 MDT
Comment on attachment 4594 [details]
sacct listing for bwa example

48G runs see "Memory exceeded at some point ", but 120G jobs do not
Comment 5 Jenny Williams 2017-05-19 09:41:38 MDT
Comment on attachment 4595 [details]
sacct listing for phylogen_tree example

uses almost no RSS, but required 220Gn to run without "Memory exceeded error"
Comment 6 Jenny Williams 2017-05-19 09:45:20 MDT
What is consistent amongst the dozen or so users reporting the error is a - it is generated when using NFS but not when using GPFS and b - the io is either for very large files or for very many small files.  Typically the field is bioinformatics.  While we could test against GPFS, that is not a feasible solution in general given our setup.
Comment 7 Tim Wickberg 2017-05-23 16:49:26 MDT
Created attachment 4629 [details]
Revert to pre-17.02 behavior

I'm attaching a trivial patch which should prevent this while I work to understand the underlying cause.

It does appear that something in the Linux I/O model is counting some of the page cache against the job; one other customer is reporting this problem as well (bug 3831), and I'm working on reproducing this, and seeing if there's some other cgroup stat we could switch to.
Comment 8 Jenny Williams 2017-05-23 17:41:35 MDT
We can reliably reproduce the issue if that helps at all.

Thank you for the update and patch 

Jenny
Comment 9 Tim Wickberg 2017-06-21 16:23:07 MDT
*** Bug 3831 has been marked as a duplicate of this bug. ***
Comment 10 Tim Wickberg 2017-06-21 16:26:12 MDT
This is still unfortunately in progress from my end. We're looking into a number of fixes to the cgroup code before 17.11 is released, although it's unclear to me if there's an alternative way of gathering the memory usage data that would avoid this issue.

In the meantime, that patch should continue to work throughout the 17.02 release. I do not intend on disabling that diagnostic message at this time.

- Tim
Comment 12 Tim Wickberg 2017-09-27 12:21:47 MDT
Revisiting this based on some further discussion yesterday.

Right now, disabling the cgroup memory enforcement, and falling back to only using the periodic memory limit enforcement through jobacct_gather/linux may work around this.

Within the cgroup implementation, there is now a Kmem limit field that may provide a way to tune the amount of reclaimable kernel page caches that are associated with a given job. I'm looking into how that interoperates - the kernel documentation is a little vague on how this interacts with the other limits though.

If you have an application that readily triggers this behavior on NFSv3 mounts I would love to have that on hand when testing.

- Tim
Comment 13 David Matthews 2017-09-29 08:13:27 MDT
(In reply to Tim Wickberg from comment #12)
> Right now, disabling the cgroup memory enforcement, and falling back to only
> using the periodic memory limit enforcement through jobacct_gather/linux may
> work around this.

What's wrong with reverting to the pre-17.02 behaviour (i.e. applying the patch)? The new behaviour is clearly wrong and I certainly wouldn't want to disable the cgroup memory enforcement.

> If you have an application that readily triggers this behaviour on NFSv3
> mounts I would love to have that on hand when testing.

I can trigger this problem using a simple dd command, e.g.
dd if=/dev/zero of=/path/to/test/file bs=1MB count=1k conv=fdatasync
I get the same problem writing to local storage or NFSv3.
Comment 14 Tim Wickberg 2017-12-19 20:31:39 MST
I've still be playing with this, but don't have a good overall solution to this. I'm asking one of our other engineers to dig into this in further detail, and take a look at what the actual underlying behavior is from the Linux cgroup code.
Comment 16 David Matthews 2017-12-20 01:27:34 MST
Can you explain why you are not prepared to revert to the previous behaviour? Currently slurm will kill off a job which completed successfully because it, incorrectly, thinks it ran out of memory. How is this better than the previous behaviour?

This thread appears to give the key points about the cgroup behaviour:
https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ
(I included other links in bug 3831).
You cannot use memory.failcnt to determine if oom_killer has been activated.
Comment 17 Alejandro Sanchez 2017-12-20 08:32:51 MST
Created attachment 5795 [details]
Page cache growth

This is a personal interpretation on what I think is happening (so I might be wrong), but the more tests I do, the more convinced I am with my thinking.

First off, I want recall a few relevant things, that might be off-topic but want to mention:

1. There are two mechanisms in Slurm to enforce job memory limits. First is by interpreting the memory samples collected by JobAcctGatherType plugin and the second by using the TaskPlugin including at least 'task/cgroup' as a value and having ConstrainRAMSpace=yes in cgroup.conf. We recommend enforcing memory only with the TaskPlugin mechanism and only use JobAcctGatherType plugin to collect job stats information for accounting purposes, and not for memory enforcement purposes. To accomplish that I'd recommend adding 'NoOverMemoryKill' to JobAcctGatherParams. One of the disadvantages of using JobAcctGather plugin to enforce memory is that peaks of memory between samples might not be properly considered depending on the sampling frequency. Somehow TaskPlugin works like an 'interrupt' instead of a 'poll' mechanism.

2. For accounting purposes, better to use the jobacct_gather/linux instead of jobacct_gather/cgroup. Both should produce same results but cgroup is slower.

... After this preamble, now back on topic. I can reproduce what you mention, using 'dd' command to write zeros to a local disk (no need NFS to reproduce as you said). What I don't agree, and this can be discussed further, is that the behavior is incorrect. I think it is more related to what does the kernel account in terms of memory when you set the memory.limit_in_bytes.

If I submit the following job request:

alex@ibiza:~/t$ srun --mem=990M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync
1024+0 records in
1024+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 1.33275 s, 768 MB/s
alex@ibiza:~/t$

I don't see any memory being enforced. Note I request --mem=990M, and the 'dd' command reports copying 977M. If instead I submit a slightly different dd request, now requesting --mem=979M:

alex@ibiza:~/t$ srun --mem=979M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync
1024+0 records in
1024+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 1.27991 s, 800 MB/s
slurmstepd-compute1: error: Exceeded step memory limit at some point.
srun: error: compute1: task 0: Out Of Memory
alex@ibiza:~/t$

I get the Out Of Memory. So... what I did is right after submitting srun, execute the following monitoring script:

alex@ibiza:~/t$ cat monitor_memory.sh 
#!/bin/bash
stats_file=/sys/fs/cgroup/memory/slurm_compute1/uid_1000/job_$1/step_0/memory.stat
while [ -e "$stats_file" ]; do
	echo -e "\n"
	cat $stats_file
	sleep 0.2
done
alex@ibiza:~/t$

alex@ibiza:~/t$ ./monitor_memory.sh 20060 > /home/alex/t/mem.stats.txt

Inspecting the mem.stats.txt (see it attached), you can see that our requested --mem=979M is properly set:

hierarchical_memory_limit 1026555904 // 1026555904 / 1024 / 1024 = --mem=979M
hierarchical_memsw_limit 1026555904

And that the total_rss used by 'dd' stays constant at 1101824 bytes, /1024/1024 = 1MB:

alex@ibiza:~/t$ grep -w "total_rss" mem.stats.txt 
total_rss 1101824
total_rss 1101824
total_rss 1101824
total_rss 1101824
total_rss 1101824
total_rss 1101824
alex@ibiza:~/t$

BUT, I see a rapid progression growth for total_cache and total_dirty stats:

alex@ibiza:~/t$ egrep "total_cache|total_dirty" mem.stats.txt 
total_cache 203399168
total_dirty 203399168
total_cache 675606528
total_dirty 600109056
total_cache 986255360 around here memory.limit_in_bytes is reached
total_dirty 676917248
total_cache 1022705664
total_dirty 503906304
total_cache 1022742528
total_dirty 309915648
total_cache 1022754816
total_dirty 84475904
alex@ibiza:~/t$

Short answer: Page Cache is accounted so even if 'dd' rss footprint stays barely at 1MB, since Page Cache is accounted you can rapidly get into an Out Of Memory situation depending on your application.

If you look at https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt (2.2.1 Accounting details), you can read:

"All mapped anon pages (RSS) and cache pages (Page Cache) are accounted."

And if you look at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-memory you can read:

memory.limit_in_bytes
sets the maximum amount of user memory (including file cache).

Long answer:

Also describing some of the ideas in the following nice answer here:
https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di

-----------------------------------------------------
The problem that you're hitting here has to do with how the kernel manages I/O. When you make a write to your filesystem, that write isn't immediately committed to disk; that would be incredibly inefficient. Instead, writes are cached in an area of memory referred to as the page cache, and periodically written in chunks out to disk. The "dirty" section of your log describes the size of this page cache that hasn't been written out to disk yet:

> So what empties this dirty cache? Why isn't it doing it's job?

'Flush' on Linux is responsible for writing dirty pages out to disk. It's a daemon that wakes up periodically to determine if writes to disk are required, and, if so, performs them. If you are a C type of guy, start here. Flush is incredibly efficient; it does a great job of flushing stuff to disk when needed. And it's working exactly how it is supposed to.

... A program starts to write a big file, it buffers... buffers... and eventually hits it's hard limit, and starts calling the OOM manager. This isn't a failure of any particular component; it's expected behavior. Kind of. This sort of thing should be handled by cgroups, but it doesn't seem like it is.

> Ok, that makes sense. But it's useless. I still need to write a big file.

... There are a few options here for things you can try to tweak:

/proc/sys/vm/dirty_expire_centiseconds
This controls how long a page can be held in the dirty cache and written to disk. By default it's 30 seconds; try setting it lower to start pushing it out faster.

/proc/sys/vm/dirty_background_ratio
This controls what percentage of active memory flush is allowed to fill up before it starts forcing writes. There is a bit of fiddling that goes into sorting out the exact total here, but the easiest explanation is to just look at your total memory. By default it's 10% (on some distros it's 5%). Set this lower; it'll force writes out to disk sooner.

> Can't I just screw with the filesystem a bit?

Well, yeah. But make sure you test this out.. you could affect performance. On your mounts in /etc/fstab where you'll be writing this to, add the 'sync' mount option.

> Original answer:

Try reducing the blocksize used by DD:

dd if=/dev/zero of=test2 bs=512 count=1024000
You can only write one sector at a time (512 bytes on older HDDs, 4096 on newer). If DD is pushing writes to disk faster than the disk can accept them, it will start caching the writes in memory. That's why your file cache is growing.

-----------------------------------------------------

... So Page Cache contributes to the memory.limit_in_bytes, so that's how the kernel works and for now I don't think it is Slurm fault. 

We can discuss though how to mitigate this, and potentially reclassify as sev-5 so that new config options can be added to change these /proc/sys/vm/ parameters from Slurm, or otherwise I am open to proposals.

Hope this is helpful and let me know if I am not in the right path with my thinking.
Comment 18 Kaizaad 2017-12-26 13:21:09 MST
(In reply to David Matthews from comment #16)
> Can you explain why you are not prepared to revert to the previous
> behaviour? Currently slurm will kill off a job which completed successfully
> because it, incorrectly, thinks it ran out of memory. How is this better
> than the previous behaviour?
> 
> This thread appears to give the key points about the cgroup behaviour:
> https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ
> (I included other links in bug 3831).
> You cannot use memory.failcnt to determine if oom_killer has been activated.

I agree and I think this is the gist of the problem we are facing, using memory.failcnt to determine an OOM condition is incorrect, especially if programs processes large files and "bounce up" against the cgroup memory limit _without_ being OOM'd because the kernel is able to reclaim their page cache. This change seems to have been introduced here:

https://bugs.schedmd.com/show_bug.cgi?id=1963#c2

thanks
-k
Comment 19 Jenny Williams 2017-12-26 13:21:20 MST
 I am out of the office until Jan 2nd, 2018. If you have questions that require a reply please email research@unc.edu.

Regards,
Virginia Wililams
Systems Administrator
Research Computing
UNC Chapel Hill
Comment 20 Alejandro Sanchez 2017-12-27 06:12:35 MST
Ok. I see your point now. There are a few points I agree could be improved to provide a better feedback on Out of Memory conditions.

1. Since commit https://github.com/SchedMD/slurm/commit/707268a521ea9beb85 , a failcnt > 0 is logged as an error() instead of info().

2. Since commit https://github.com/SchedMD/slurm/commit/818a09e802587e68ab , a failcnt > 0 makes it so the step/job is recorded as OutOfMemory.

According to the kernel memory documentation[1], "failcnt shows the number of times that a usage counter hit its limit. When a memory cgroup hits a limit, failcnt increases and memory under it will be reclaimed."

... so yes I agree that failcnt > 0 doesn't necessarily imply that OOM actually killed the process. Under out of memory conditions the kernel might reclaim unused pages and that might solve the situation without any kill being involved. I think the best would be to reclassify this a sev-5 targeted for 18.08 and make some changes:

If failcnt > 0, don't consider it as an error, but simply log that usage hit the limit failcnt times. Perhaps also report the disaggregated memory.stats back so that users can discern how much RSS, Page Cache and the different factors contribute to the usage (that can be optional). Then implement a mechanism using the memory.oom_control to register notifications that should be more reliable than failcnt value. Perhaps something similar to Ryan Cox oom notifierd daemon.

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
Comment 23 Alejandro Sanchez 2018-01-04 04:52:40 MST
Hi Jenny. I've committed the following patch to 17.11.2+ :

https://github.com/SchedMD/slurm/commit/38e15bf74b14e8cbfc8b09305a

which will clarify the messages reported under out of memory conditions. Basically this will tell the users that when failcnt > 0, this does not necessarily mean that a problem happend, just that the memory[+swap] limit was hit at some point during execution. With this commit I am going ahead and close this bug since that was the original concern with the user community.

I've also opened a new Sev-5 bug 4578 commenting some ideas to provide a better and more detailed feedback back to the users/admins under OOM conditions: catching when oom-killer actually kills a process, potentially append the syslog kernel info when a process is killed and report the memory.stats disaggregated info when these events happen.
Comment 24 David Matthews 2018-01-05 03:27:21 MST
(In reply to Alejandro Sanchez from comment #23)
> With this commit I am going ahead
> and close this bug since that was the original concern with the user
> community.

Our issue is not just the message - it's that fact that jobs are failing which should have succeeded. Please see bug 3831 which was closed as a duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest 3831 should be re-opened.
Comment 25 Alejandro Sanchez 2018-01-05 03:57:37 MST
(In reply to David Matthews from comment #24)
> (In reply to Alejandro Sanchez from comment #23)
> > With this commit I am going ahead
> > and close this bug since that was the original concern with the user
> > community.
> 
> Our issue is not just the message - it's that fact that jobs are failing
> which should have succeeded. Please see bug 3831 which was closed as a
> duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest
> 3831 should be re-opened.

What is wrong with the updated message?

alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync
1024+0 records in
1024+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s
slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once during execution. This may or may not result in some failure.
srun: error: compute1: task 0: Out Of Memory
alex@ibiza:~/t$ sacct -j 20002
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20002                dd         p1      acct1          2 OUT_OF_ME+    0:125 
alex@ibiza:~/t$

It clearly states that the usage hit the limit at some point (which is totally true), but it also states that this may or may _not_ result in some failure (which is also true). It doesn't state that this is 100% an error, and it depends on whether the Kernel succeed to reclaim unused pages and OOM-Killer didn't actually trigger. 

Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just that the usage hit the limit at some point during execution and note also that the ExitCode is 0 (sucess). So don't understand why you state that the jobs are failing.

I know the feedback could be improved as per comments in bug 4578, but right now what the message states and what sacct reports in the State and ExitCode fields doesn't seem incorrect to me. Anyhow I am open to suggestions.
Comment 26 David Matthews 2018-01-05 04:09:40 MST
(In reply to Alejandro Sanchez from comment #25)
> Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just
> that the usage hit the limit at some point during execution and note also
> that the ExitCode is 0 (sucess). So don't understand why you state that the
> jobs are failing.
> 
> I know the feedback could be improved as per comments in bug 4578, but right
> now what the message states and what sacct reports in the State and ExitCode
> fields doesn't seem incorrect to me. Anyhow I am open to suggestions.

Please refer to bug 3831. The problem I reported results in jobs in the FAILED state. This has not been addressed as far as I can see.
Comment 27 Alejandro Sanchez 2018-01-05 04:58:12 MST
I've read bug 3831. Problem with current 17.11.2 is that you would like these messages:

slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once during execution. This may or may not result in some failure.
srun: error: compute1: task 0: Out Of Memory

to continue being presented to the user output, but instead of prepended with the word 'error', just show them as a info message. Is that correct?

(In reply to David Matthews from comment #26)
> Please refer to bug 3831. The problem I reported results in jobs in the
> FAILED state. This has not been addressed as far as I can see.

With 17.11.2 I don't see the job recorded as FAILED, but instead as OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment.
Comment 28 David Matthews 2018-01-05 05:22:59 MST
(In reply to Alejandro Sanchez from comment #27)
> I've read bug 3831. Problem with current 17.11.2 is that you would like
> these messages:
> 
> slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once
> during execution. This may or may not result in some failure.
> srun: error: compute1: task 0: Out Of Memory
> 
> to continue being presented to the user output, but instead of prepended
> with the word 'error', just show them as a info message. Is that correct?

I would prefer them to be an info message yes (because they don't necessarily indicate an error and users find this confusing). However, that is not my main concern.

> With 17.11.2 I don't see the job recorded as FAILED, but instead as
> OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment.

I'm not currently in a position to test 17.11. With 17.02 the situation remains unchanged.

For the moment I'll continue with 17.02 (patched) and will look at this again when we upgrade to 17.11.
Comment 29 Alejandro Sanchez 2018-01-05 05:33:25 MST
(In reply to David Matthews from comment #28)
> I would prefer them to be an info message yes (because they don't
> necessarily indicate an error and users find this confusing). However, that
> is not my main concern.

There is one problem though. If we change these messages from error() to info(), then they will not be displayed by default in the output of srun. If you read the srun man page for --slurmd-debug[1] option:

"The slurmd debug information is copied onto the stderr of the job. By default only errors are displayed. This option applies to job and step allocations."

[1] https://slurm.schedmd.com/srun.html#OPT_slurmd-debug

Since most of the users won't even know about this option, I am more inclined in leaving the messages and errors() and instruct the user community making them know that the messages doesn't necessarily indicate an application failure, just may or may not. What do you think?
 
> > With 17.11.2 I don't see the job recorded as FAILED, but instead as
> > OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment.
> 
> I'm not currently in a position to test 17.11. With 17.02 the situation
> remains unchanged.
> 
> For the moment I'll continue with 17.02 (patched) and will look at this
> again when we upgrade to 17.11.

Ok, still don't understand where you see the job recorded as FAILED and/or in which version. In 17.11.2 it is OUT_OF_MEMORY with ExitCode 0:125.
Comment 30 Alejandro Sanchez 2018-01-05 05:58:07 MST
I think I can come up with an intermediate solution for the info() problem. Let me discuss it internally and I'll come back to you.
Comment 31 David Matthews 2018-01-05 07:26:50 MST
(In reply to Alejandro Sanchez from comment #30)
> I think I can come up with an intermediate solution for the info() problem.
> Let me discuss it internally and I'll come back to you.
Just to note (as discussed in bug 3831) at 15.08 users would receive the memory exceeded warning message without "error:" preceding it. I'd just like this behaviour restored. Thanks.
Comment 32 Kaizaad 2018-01-05 07:34:07 MST
(In reply to Alejandro Sanchez from comment #25)
> (In reply to David Matthews from comment #24)
> > (In reply to Alejandro Sanchez from comment #23)
> > > With this commit I am going ahead
> > > and close this bug since that was the original concern with the user
> > > community.
> > 
> > Our issue is not just the message - it's that fact that jobs are failing
> > which should have succeeded. Please see bug 3831 which was closed as a
> > duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest
> > 3831 should be re-opened.
> 
> What is wrong with the updated message?
> 
> alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB
> count=1k conv=fdatasync
> 1024+0 records in
> 1024+0 records out
> 1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s
> slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once
> during execution. This may or may not result in some failure.
> srun: error: compute1: task 0: Out Of Memory
> alex@ibiza:~/t$ sacct -j 20002
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 20002                dd         p1      acct1          2 OUT_OF_ME+    0:125 
> alex@ibiza:~/t$
> 
> It clearly states that the usage hit the limit at some point (which is
> totally true), but it also states that this may or may _not_ result in some
> failure (which is also true). It doesn't state that this is 100% an error,
> and it depends on whether the Kernel succeed to reclaim unused pages and
> OOM-Killer didn't actually trigger. 
> 
> Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just
> that the usage hit the limit at some point during execution and note also
> that the ExitCode is 0 (sucess). So don't understand why you state that the
> jobs are failing.
> 
> I know the feedback could be improved as per comments in bug 4578, but right
> now what the message states and what sacct reports in the State and ExitCode
> fields doesn't seem incorrect to me. Anyhow I am open to suggestions.


I'd like to add that even though the exit code is 0, if you have "#SBATCH --mail-type=FAIL" you will still get an e-mail, so slurm still thinks the job has failed somehow.

thanks
-k
Comment 41 Kaizaad 2018-01-08 12:48:24 MST
(In reply to Kaizaad from comment #32)
> (In reply to Alejandro Sanchez from comment #25)
> > (In reply to David Matthews from comment #24)
> > > (In reply to Alejandro Sanchez from comment #23)
> > > > With this commit I am going ahead
> > > > and close this bug since that was the original concern with the user
> > > > community.
> > > 
> > > Our issue is not just the message - it's that fact that jobs are failing
> > > which should have succeeded. Please see bug 3831 which was closed as a
> > > duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest
> > > 3831 should be re-opened.
> > 
> > What is wrong with the updated message?
> > 
> > alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB
> > count=1k conv=fdatasync
> > 1024+0 records in
> > 1024+0 records out
> > 1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s
> > slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once
> > during execution. This may or may not result in some failure.
> > srun: error: compute1: task 0: Out Of Memory
> > alex@ibiza:~/t$ sacct -j 20002
> >        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> > ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> > 20002                dd         p1      acct1          2 OUT_OF_ME+    0:125 
> > alex@ibiza:~/t$
> > 
> > It clearly states that the usage hit the limit at some point (which is
> > totally true), but it also states that this may or may _not_ result in some
> > failure (which is also true). It doesn't state that this is 100% an error,
> > and it depends on whether the Kernel succeed to reclaim unused pages and
> > OOM-Killer didn't actually trigger. 
> > 
> > Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just
> > that the usage hit the limit at some point during execution and note also
> > that the ExitCode is 0 (sucess). So don't understand why you state that the
> > jobs are failing.
> > 
> > I know the feedback could be improved as per comments in bug 4578, but right
> > now what the message states and what sacct reports in the State and ExitCode
> > fields doesn't seem incorrect to me. Anyhow I am open to suggestions.
> 
> 
> I'd like to add that even though the exit code is 0, if you have "#SBATCH
> --mail-type=FAIL" you will still get an e-mail, so slurm still thinks the
> job has failed somehow.
> 
> thanks
> -k


Furthermore, if you also have a job dependency chain that has "afterok", it will still consider the job failed and not continue with the job chain.

thanks
-k
Comment 42 Tyson Whitehead 2018-01-09 09:02:19 MST
A quick note that on a lustre filesystem the job can actually be killed when it bumps the limit due to a known bug in lustre

https://jira.hpdd.intel.com/browse/LU-5801
Comment 62 Alejandro Sanchez 2018-01-23 07:04:16 MST
This has been fixed in the following commit:

https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb

which will be available starting at 17.11.3.

Here's an example of current behavior and output:

1. Job actually being oom-killed:
alex@ibiza:~/t$ srun --mem=200 ./mem_eater
slurmstepd-compute1: error: Detected 1 oom-kill event(s) in step 20014.0 cgroup.
srun: error: compute1: task 0: Out Of Memory
alex@ibiza:~/t$ sacct -j 20014
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20014         mem_eater         p2      acct1          2 OUT_OF_ME+    0:125 


2. Job memory usage hitting the limit but not being oom-killed:
alex@ibiza:~/t$ srun --slurmd-debug=info --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync
...
1024+0 records in
1024+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 2.01502 s, 508 MB/s
slurmstepd-compute1: task 0 (9852) exited with exit code 0.
slurmstepd-compute1: Step 20016.0 hit memory+swap limit at least once during execution. This may or may not result in some failure.
slurmstepd-compute1: debug:  _oom_event_monitor: No oom events detected.
slurmstepd-compute1: debug:  _oom_event_monitor: stopping.
...
alex@ibiza:~/t$ sacct -j 20016
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20016                dd         p2      acct1          2  COMPLETED      0:0 
alex@ibiza:~/t$

alex@ibiza:~/t$ cat test.bash
#!/bin/bash
#SBATCH --mem=1500
./mem_eater &
./mem_eater &
srun --mem=200 ./mem_eater &
wait
alex@ibiza:~/t$ sbatch test.bash
Submitted batch job 20018
alex@ibiza:~/t$ ls -ltr slurm-20018.out 
-rw-r--r-- 1 alex alex 459 Jan 23 14:47 slurm-20018.out
alex@ibiza:~/t$ cat slurm-20018.out 
srun: error: compute1: task 0: Out Of Memory
slurmstepd-compute1: error: Detected 1 oom-kill event(s) in step 20018.0 cgroup.
/home/alex/slurm/17.11/ibiza/spool/slurmd-compute1/job20018/slurm_script: line 6: 10056 Killed                  ./mem_eater
/home/alex/slurm/17.11/ibiza/spool/slurmd-compute1/job20018/slurm_script: line 6: 10057 Killed                  ./mem_eater
slurmstepd-compute1: error: Detected 2 oom-kill event(s) in step 20018.batch cgroup.
alex@ibiza:~/t$ sacct -j 20018
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20018         test.bash         p2      acct1          2 OUT_OF_ME+    0:125 
20018.batch       batch                 acct1          2 OUT_OF_ME+    0:125 
20018.0       mem_eater                 acct1          1 OUT_OF_ME+    0:125 
alex@ibiza:~/t$

I think this better reflects what's actually happening under OOM conditions. Note also that this fixes the 'afterok' dependencies (if the job is oom-killed, the dependant job won't start, but if only hits the limit it will, bug 4625). It also fixes the --mail-type=FAIL concern (comment 32).

BYU HPC sites might consider switching from their notifierd daemon to using this mechanism.

I'll leave this bug open for a few days just in case you want to add further questions and/or issues. If nothing else is raised, I'll close it and leave bug 4578 sev-5 to change the API for 18.08 for further work (for instance adding a SystemComment when the usage hits the limit). Thanks for your patience.
Comment 63 David Matthews 2018-01-23 07:20:01 MST
Thanks - I won't be able to test it for a while but it looks like a significant improvement.
Comment 64 Alejandro Sanchez 2018-02-06 04:10:03 MST
Marking this as resolved/fixed as per the provided patch. Future work will be tracked in bug 4578.