1682 – No memory exceeded error message

Ticket 1682 - No memory exceeded error message

Summary: No memory exceeded error message

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.11.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-05-20 04:54 MDT by Will French
Modified:	2015-06-08 04:10 MDT (History)
CC List:	3 users (show)

See Also:
Site:	Vanderbilt
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.8
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (6.84 KB, text/plain) 2015-05-20 05:35 MDT, Will French	Details
eat.c (367 bytes, text/x-csrc) 2015-05-20 05:36 MDT, Will French	Details
Makefile (32 bytes, text/plain) 2015-05-20 05:36 MDT, Will French	Details
SLURM batch script (135 bytes, application/x-shellscript) 2015-05-20 05:37 MDT, Will French	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Will French 2015-05-20 04:54:23 MDT

Hello,

I just noticed that SLURM is no longer giving users information when their jobs are killed due to exceeding memory. I'm not sure when this changed (it may have changed in 14.11.5 but I just never noticed). Before I would get a message like this at the end of my stdout/stderr file:

/usr/spool/slurm/job32105/slurm_script: line 10: 22492 Killed                  ./eat_while
slurmstepd: error: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.

Now all I get is:

/usr/spool/slurm/job1950924/slurm_script: line 9:  5830 Killed                  ./eat_while


Was this change intentional? It's very useful to receive information from SLURM about why a job was killed, especially for memory errors since MaxRSS is often not super accurate so running sacct after the fact does not always reveal the cause of job failure very clearly.

Best,

Will

Comment 1 David Bigagli 2015-05-20 05:14:12 MDT

Hi,
   the messages were logged at the error level and some users complained so we moved them to the info level. Can you try to set your SlurmdDebug to info
and test if you can see the message?

David

Comment 2 Will French 2015-05-20 05:21:20 MDT

(In reply to David Bigagli from comment #1)
> Hi,
>    the messages were logged at the error level and some users complained so
> we moved them to the info level. Can you try to set your SlurmdDebug to info
> and test if you can see the message?
> 
> David

Looks like it's already set to info:

[frenchwr@vmps11 ~]$ scontrol show config | grep -i debug
DebugFlags              = (null)
SlurmctldDebug          = info
SlurmdDebug             = info

Comment 3 David Bigagli 2015-05-20 05:25:18 MDT

Could you attach your program and the slurm.conf so I can try to reproduce it?

Comment 4 Will French 2015-05-20 05:35:17 MDT

Created attachment 1890 [details]
slurm.conf

Comment 5 Will French 2015-05-20 05:36:01 MDT

Created attachment 1891 [details]
eat.c

Comment 6 Will French 2015-05-20 05:36:30 MDT

Created attachment 1892 [details]
Makefile

Comment 7 Will French 2015-05-20 05:37:03 MDT

Created attachment 1893 [details]
SLURM batch script

Comment 8 Will French 2015-05-20 05:38:01 MDT

This is actually an example that someone (maybe you?) at SchedMD shared with me when we initially set up SLURM and were testing cgroups.

Comment 9 David Bigagli 2015-05-20 05:42:48 MDT

I don't think it was me, but I will use it now :-). Thanks for all the data I will get back to you later.

David

Comment 10 David Bigagli 2015-05-20 08:57:40 MDT

Hi,
   the messages logged at error level are propagated to the user output,
messages logged at different level are not and that's why you don't see the
info messages. The info messages are written to the slurmd log file.

The reason we change it was because the error messages is written even if the process was not killed but just tried to exceed the RSS limit as reported by the
memory.failcnt file. That's why using info seems more appropriate.

Is this a serious problem for you? You could eventually keep a local patch.

David

Comment 11 Will French 2015-05-20 09:18:40 MDT

> 
> Is this a serious problem for you? You could eventually keep a local patch.
> 
> David

I wouldn't call it a serious problem but it certainly makes troubleshooting for the user more difficult when he/she receives no information about why his/her job was killed. Like I mentioned before, looking at MaxRSS from sacct does not always help, so it's basically a guessing game for users (and administrators) where they resubmit with an increased memory request and hope that the job does not die this time around.

It seems like getting at least some information, assuming it's available, to a user about a job issue would be good practice and promotes user friendliness, even if it's a warning message about the RSS limit being exceeded. I'm not sure I understand why you'd elect to withhold that information from the user if it's available.

Will

Comment 12 David Bigagli 2015-05-20 09:25:51 MDT

Because in some cases it is not an error but just an information about the job
behavior which users may not care about and seeing the string "error" cause
false alarms. It goes both ways unfortunately.

David

Comment 13 Moe Jette 2015-05-20 09:35:42 MDT

In fact, the message was removed by user request because they found it misleading.

Comment 14 Will French 2015-05-20 09:48:25 MDT

(In reply to David Bigagli from comment #12)
> Because in some cases it is not an error but just an information about the
> job
> behavior which users may not care about and seeing the string "error" cause
> false alarms. It goes both ways unfortunately.
> 
> David

Could it be made a warning message instead? Something like:

Warning: job step may have been partially swapped out to disk.

Or what about propagating the information back to the user as a specific exit code?

Why is it that in the past this message was only showed when MaxRSS was exceeded? Why can't SLURM output an error message at the moment after it decides to kill a job due to exceeding MaxRSS+Swap? Seems like this would avoid the false alarm.

Will

Comment 15 David Bigagli 2015-05-20 09:57:56 MDT

It could be a warning message or an info message but the problem is that since this message is logged by the slurmstepd then all warning/info messages the daemon write would be propagated to the user's output. 

Exit code also may not work because the use program is killed by oom which sends it a the kill signal and the exit code is undefined. It is not Slurm that is killing the job.

It appears in the context MaxRSS+Swap in your installation. If you disable ConstrainSwapSpace=yes than oom killer wont be invoked and cgroup will constrain the application to the amount of memory requested, however when the application
will exit user will still see the message. That was the original complain that prompeted the change.

We could perhaps have a configurable flag to log this message or create a new
logging function that will send a specific warning to a user, the message will
appear as an information one.

David

Comment 16 Will French 2015-05-21 02:39:00 MDT

> It appears in the context MaxRSS+Swap in your installation. If you disable
> ConstrainSwapSpace=yes than oom killer wont be invoked and cgroup will
> constrain the application to the amount of memory requested, however when
> the application
> will exit user will still see the message. That was the original complain
> that prompeted the change.

So if we toggle ConstrainSwapSpace to "no" then a user would get an error from SLURM about exceeding memory? Am I understanding correctly?

Comment 17 David Bigagli 2015-05-21 04:35:31 MDT

With the original code yes, it would check the content of memory.failcnt and if the number in it is greater than zero, indicating the number of times the 
memory limit has reached the value set in memory.limit_in_bytes, it will write the message at error level in the slurmd log file and user output.

With the modified code it will write at the info level so it will only go to the slurmd log file.

Would it be hard for you to maintain a local patch?

David

Comment 18 Will French 2015-05-21 07:19:50 MDT

 
> We could perhaps have a configurable flag to log this message or create a new
> logging function that will send a specific warning to a user, the message
> will appear as an information one.

This would get my vote. We'd prefer not to maintain a local patch if we can help it. I also imagine that I am not alone in preferring the prior behavior to the current, so I suspect other sites would also find a configurable option to be useful.

FWIW, I understand why others would find this error message to be misleading and why you all made the change. However, rather than silencing the message completely, it seems like the better solution is to create an additional logging function since the root of the issue is that the message does not fit nicely into any of the current logging mechanisms in SLURM.

Comment 19 David Bigagli 2015-06-04 08:31:43 MDT

Well vox populi vox dei. We have reworked the fix to print a less scary message
and put it back at log error. Hopefully this is a good compromise.

$ srun --mem=1 ./eatmem 1024
Jun  4 13:30:53.111396 32372 0x7f012a97f700 slurmstepd-regor01: Exceeded step memory limit at some point.
srun: error: regor01: task 0: Killed
srun: Force Terminated job step 25748.0

Commit 707268a521ea9be.

David

Comment 20 Will French 2015-06-04 09:03:25 MDT

(In reply to David Bigagli from comment #19)
> Well vox populi vox dei. We have reworked the fix to print a less scary
> message
> and put it back at log error. Hopefully this is a good compromise.
> 
> $ srun --mem=1 ./eatmem 1024
> Jun  4 13:30:53.111396 32372 0x7f012a97f700 slurmstepd-regor01: Exceeded
> step memory limit at some point.
> srun: error: regor01: task 0: Killed
> srun: Force Terminated job step 25748.0
> 
> Commit 707268a521ea9be.
> 
> David

Thank you, David. I look forward to seeing this in 14.11.8.

Will

Comment 21 Christopher Coffey 2015-06-08 04:10:21 MDT

+1 to folks not happy with this change.  This has caused numerous extra tickets for us where in the past the user could have known the reason for the job not succeeding.

Was this change announced?  I don't believe I saw it if so.  Thanks for fixing this in 14.11.8.


Chris