3831 – Out of memory failures after upgrade

Ticket 3831 - Out of memory failures after upgrade

Summary: Out of memory failures after upgrade

Status:	RESOLVED DUPLICATE of ticket 3820

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.3
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-05-23 10:12 MDT by David Matthews
Modified:	2017-06-22 03:34 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Met Office
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
15.08.7 output (17.43 KB, text/x-log) 2017-05-23 10:13 MDT, David Matthews	Details
17.02.3 output (20.30 KB, text/x-log) 2017-05-23 10:14 MDT, David Matthews	Details
disable OOM job status (520 bytes, patch) 2017-05-23 15:30 MDT, Tim Wickberg	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Matthews 2017-05-23 10:12:49 MDT

I'm currently trying to upgrade from 15.08.7 to 17.02.3. However, I'm finding some jobs are now failing which previously ran OK.

I've run the same job on 15.08.7 & 17.02.3. I've attached 2 log files containing the relevant contents of slurmd.log and the slurm config files for each version. The slurm configurations I'm using are effectively the same except CgroupReleaseAgentDir is disabled in 17.02.3.

In both cases the parallel task (task 0) appears to run correctly and exits with exit code 0. There is a "Exceeded job memory limit at some point" message but this is not unusual". However with 17.02.3 the task then fails. Stderr from the jobs contains:

slurmstepd: error: Exceeded job memory limit at some point.
srun: error: expspicesrv071: task 0: Out Of Memory
[mpiexec@expspicesrv071] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@expspicesrv071] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@expspicesrv071] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@expspicesrv071] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

What has changed since 15.08.7 to cause this difference?
Can I configure slurm to behave as it did before (or was the previous behaviour wrong in some way)?

Comment 1 David Matthews 2017-05-23 10:13:33 MDT

Created attachment 4619 [details]
15.08.7 output

Comment 2 David Matthews 2017-05-23 10:14:08 MDT

Created attachment 4620 [details]
17.02.3 output

Comment 3 Tim Wickberg 2017-05-23 10:58:13 MDT

(In reply to David Matthews from comment #0)
> I'm currently trying to upgrade from 15.08.7 to 17.02.3. However, I'm
> finding some jobs are now failing which previously ran OK.
> 
> I've run the same job on 15.08.7 & 17.02.3. I've attached 2 log files
> containing the relevant contents of slurmd.log and the slurm config files
> for each version. The slurm configurations I'm using are effectively the
> same except CgroupReleaseAgentDir is disabled in 17.02.3.
> 
> In both cases the parallel task (task 0) appears to run correctly and exits
> with exit code 0. There is a "Exceeded job memory limit at some point"
> message but this is not unusual". However with 17.02.3 the task then fails.
> Stderr from the jobs contains:
>
> What has changed since 15.08.7 to cause this difference?
> Can I configure slurm to behave as it did before (or was the previous
> behaviour wrong in some way)?

The 17.02 release added this in - it's designed to catch steps that have potentially been killed due to OOM (as indicated by the task/cgroup plugin), and return an error rather than silently continue. The message "Exceeded job memory limit at some point" corresponds to that changed behavior.

There is no way to disable this at the moment; I can give you a trivial patch that disables it in the meantime if you'd like, although the final fix on this will likely be rather different.

The jobs themselves - this should only be triggering if the job actually hit a memory limit. I take it that doesn't appear to have affected the job run in any fashion?

I do have one other open issue related to this, where it seems like the cgroup subsystem has reported the step hitting the memory limit but the application itself seems to complete correctly. I'd like to better understand how that can even happen - my impression was that the values we're checking should only be set in circumstances when processes have been killed, but it appears that may be an incorrect assumption.

- Tim

Comment 4 David Matthews 2017-05-23 12:13:32 MDT

(In reply to Tim Wickberg from comment #3)
> The 17.02 release added this in - it's designed to catch steps that have
> potentially been killed due to OOM (as indicated by the task/cgroup plugin),
> and return an error rather than silently continue. The message "Exceeded job
> memory limit at some point" corresponds to that changed behavior.

Are you saying this new behaviour will always happen when the "Exceeded job memory limit at some point" message appears? We've always had lots of these messages and I understood this happens with jobs that do a lot of I/O (the message is a significant support annoyance since it, not surprisingly, confuses our users). Does the changed behaviour only apply to srun?

This is how we've previously explained this error message to our users:
"This can be due to use of file cache (I/O) pushing your job right up to the cgroup 'memory' limit (cgroup memory = rss + file cache). This looks a little bit like an error to Slurm, but if the file cache is reclaimable then there is no harm in hitting the limit. If it is not reclaimable, your job would fail."
Have we got this wrong?

I'm not aware we have an issue with jobs silently continuing if OOM killer is activated.

The OOM killer is certainly not being activated in this case so, presumably, slurm shouldn't be indicating a failure.

> There is no way to disable this at the moment; I can give you a trivial
> patch that disables it in the meantime if you'd like, although the final fix
> on this will likely be rather different.

A patch would be appreciated - I don't think I can go ahead with the upgrade as things stand. When do you anticipate a proper fix? Is there an issue I can track?

> The jobs themselves - this should only be triggering if the job actually hit
> a memory limit. I take it that doesn't appear to have affected the job run
> in any fashion?

The task appears to have run fine and the OOM killer has not be activated.

> I do have one other open issue related to this, where it seems like the
> cgroup subsystem has reported the step hitting the memory limit but the
> application itself seems to complete correctly. I'd like to better
> understand how that can even happen - my impression was that the values
> we're checking should only be set in circumstances when processes have been
> killed, but it appears that may be an incorrect assumption.

Let me know if I can provide any further details which would help.

Comment 5 Tim Wickberg 2017-05-23 15:29:22 MDT

(In reply to David Matthews from comment #4)
> (In reply to Tim Wickberg from comment #3)
> > The 17.02 release added this in - it's designed to catch steps that have
> > potentially been killed due to OOM (as indicated by the task/cgroup plugin),
> > and return an error rather than silently continue. The message "Exceeded job
> > memory limit at some point" corresponds to that changed behavior.
> 
> Are you saying this new behaviour will always happen when the "Exceeded job
> memory limit at some point" message appears? We've always had lots of these
> messages and I understood this happens with jobs that do a lot of I/O (the
> message is a significant support annoyance since it, not surprisingly,
> confuses our users). Does the changed behaviour only apply to srun?

Anytime you see that message, you'll see the step marked as OOM.

This does not apply just to steps launched with srun, but to the batch script as well.

> This is how we've previously explained this error message to our users:
> "This can be due to use of file cache (I/O) pushing your job right up to the
> cgroup 'memory' limit (cgroup memory = rss + file cache). This looks a
> little bit like an error to Slurm, but if the file cache is reclaimable then
> there is no harm in hitting the limit. If it is not reclaimable, your job
> would fail."
> Have we got this wrong?

I think that's accurate.

> I'm not aware we have an issue with jobs silently continuing if OOM killer
> is activated.
> 
> The OOM killer is certainly not being activated in this case so, presumably,
> slurm shouldn't be indicating a failure.



> > There is no way to disable this at the moment; I can give you a trivial
> > patch that disables it in the meantime if you'd like, although the final fix
> > on this will likely be rather different.
> 
> A patch would be appreciated - I don't think I can go ahead with the upgrade
> as things stand. When do you anticipate a proper fix? Is there an issue I
> can track?

This bug and bug 3820 are both tracking this; I may mark this as a duplicate of that one at some point to consolidate our efforts.

I would like to better understand this, and would anticipate at the very least offering some option for how to handle these situations, and if to flag the job as OOM in such circumstances. But the part that's still not clear to me is why the cgroup seems to indicate the limit was crossed.

> > The jobs themselves - this should only be triggering if the job actually hit
> > a memory limit. I take it that doesn't appear to have affected the job run
> > in any fashion?
> 
> The task appears to have run fine and the OOM killer has not be activated.
> 
> > I do have one other open issue related to this, where it seems like the
> > cgroup subsystem has reported the step hitting the memory limit but the
> > application itself seems to complete correctly. I'd like to better
> > understand how that can even happen - my impression was that the values
> > we're checking should only be set in circumstances when processes have been
> > killed, but it appears that may be an incorrect assumption.
> 
> Let me know if I can provide any further details which would help.

Knowing this seems to be caused by high I/O does help - I believe that is also indicated by bug 3820 so at least the cause seems to be indisputable. I'll be working to reproduce this, and see what we may be able to do to mitigate this.

Comment 6 Tim Wickberg 2017-05-23 15:30:16 MDT

Created attachment 4626 [details]
disable OOM job status

The attached patch should prevent the JOB_OOM state from being set; the same warning messages you're used to will still be printed though.

Comment 8 David Matthews 2017-05-24 07:52:38 MDT

(In reply to Tim Wickberg from comment #5)
> (In reply to David Matthews from comment #4)
> > (In reply to Tim Wickberg from comment #3)
> > > The 17.02 release added this in - it's designed to catch steps that have
> > > potentially been killed due to OOM (as indicated by the task/cgroup plugin),
> > > and return an error rather than silently continue. The message "Exceeded job
> > > memory limit at some point" corresponds to that changed behavior.
> > 
> > Are you saying this new behaviour will always happen when the "Exceeded job
> > memory limit at some point" message appears? We've always had lots of these
> > messages and I understood this happens with jobs that do a lot of I/O (the
> > message is a significant support annoyance since it, not surprisingly,
> > confuses our users). Does the changed behaviour only apply to srun?
> 
> Anytime you see that message, you'll see the step marked as OOM.

I'm not sure what you mean. There is no indication of OOM that I can see. This is the sacct output for the case I reported:

# SACCT_FORMAT="jobname%15,elapsed,totalcpu,reqmem,maxrss,exitcode,state" sacct -j 1687827
        JobName    Elapsed   TotalCPU     ReqMem     MaxRSS ExitCode      State 
--------------- ---------- ---------- ---------- ---------- -------- ---------- 
um.recon_meto_+   00:00:58  01:07.250        1Gn                 1:0     FAILED 
          batch   00:00:58  00:06.975        1Gn     30984K      1:0     FAILED 
hydra_pmi_proxy   00:00:50  01:00.274        1Gn    139864K      0:0  COMPLETED 

> This does not apply just to steps launched with srun, but to the batch
> script as well.

It doesn't seem to apply to the batch script in my tests. I can easily trigger "Exceeded step memory limit at some point" using a simple dd command writing to a temporary file. If I run this directly in the batch script I get the warning in my output but the job succeeds:

        JobName    Elapsed   TotalCPU     ReqMem     MaxRSS ExitCode      State 
--------------- ---------- ---------- ---------- ---------- -------- ---------- 
     test-dd.sh   00:00:02  00:01.132      500Mn                 0:0  COMPLETED 
          batch   00:00:02  00:01.132      500Mn      1364K      0:0  COMPLETED 

If I add "srun" before the dd command in my batch script it fails:

        JobName    Elapsed   TotalCPU     ReqMem     MaxRSS ExitCode      State 
--------------- ---------- ---------- ---------- ---------- -------- ---------- 
     test-dd.sh   00:00:02  00:01.031      500Mn               125:0     FAILED 
          batch   00:00:02  00:00.133      500Mn      1360K    125:0     FAILED 
             dd   00:00:01  00:00.897      500Mn      1272K      0:0  COMPLETED 

> This bug and bug 3820 are both tracking this; I may mark this as a duplicate
> of that one at some point to consolidate our efforts.

Perhaps I misunderstand but it appears to me that 3820 is very different. My problem is unexpected failures. 3820 is simply complaining about the error message appearing in the output isn't it? This message has always appeared although 17.02 has made things worse by adding "error" before the message which appears more serious to the users (and is wrong since it does not necessary indicate an error).

The message is essential for cases where the OOM kiler has activated. I believe disabling it in a previous release caused problems:
https://bugs.schedmd.com/show_bug.cgi?id=1682

> I would like to better understand this, and would anticipate at the very
> least offering some option for how to handle these situations, and if to
> flag the job as OOM in such circumstances. But the part that's still not
> clear to me is why the cgroup seems to indicate the limit was crossed.

I'm new to this but, as I understand it, it is normal for this limit to be reached. The only reliable way to know if OOM has been triggered is to use cgroup notifications. Further info:
https://groups.google.com/d/msg/slurm-devel/f359KT-3lsU/yRm8vkxeBwAJ
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html#memory_example-usage
https://codepoets.co.uk/2016/adventures-with-cgroups-resource-control/

I'm aiming to try your patch tomorrow. Let me know if there are any tests you'd like me to try before I do so.

Comment 9 David Matthews 2017-05-31 01:12:53 MDT

The patch is working - thanks.

> This message has always appeared although 17.02 has made things worse by adding
> "error" before the message which appears more serious to the users (and is
> wrong since it does not necessary indicate an error).

Is it possible to do anything about this? It is already difficult to convince users that their job has run OK when they see
"slurmstepd: Exceeded step memory limit at some point."
in their output. It's going to be even harder when it says
"slurmstepd: error: Exceeded step memory limit at some point."
Error implies something has gone wrong whereas this message is simply a warning that something might have gone wrong.

Comment 10 Tim Wickberg 2017-06-21 16:23:07 MDT

I'm closing this as a duplicate of bug 3820 which describes the same behavior.

We're still looking into whether there may be a better way to aggregate this info for the 17.11 release, but in the meantime that patch should continue to work. I do not plan to disable that message in the 17.02 release.

*** This ticket has been marked as a duplicate of ticket 3820 ***

Comment 11 David Matthews 2017-06-22 03:34:10 MDT

I'm rather surprised that this patch has not been applied to the latest 17.02 release. Without it slurm is causing tasks which ran perfectly correctly to fail. This seems like a serious bug to me.

Please note that I agree the message should not be disabled - it is currently the only indication to users that a task failure MAY have been caused by oom_killer. However, the addition of "error" to the message is clearly wrong - it is not an error, simply a warning that there may have been an error.