Ticket 4578 - Improve feedback on Out Of Memory conditions
Summary: Improve feedback on Out Of Memory conditions
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 18.08.x
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 6765 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-01-04 04:45 MST by Alejandro Sanchez
Modified: 2020-11-13 03:27 MST (History)
2 users (show)

See Also:
Site: SchedMD
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: 3 - High
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Alejandro Sanchez 2018-01-04 04:45:54 MST
Currently Slurm interprets a memory[+swap].failcnt > 0 as an Out of Memory condition. When this happens, the Kernel might be able to reclaim unused pages, and the application may or may not fail. This does not necessarily mean that the OOM-Killer actually killed any process. 

It would be nice though to provide a feedback back to user/admins when the oom-killer kills any process. There are mechanisms to register a notifier through the cgroup.event_control control file, so that the application can be notified through eventfd when OOM-Killer actually kills the process. If we manage to catch that event, we could then log that back to the slurmd.log, to the user stdout/stderr and potentially add a new Job/Step state OOM-Killed or similar, aside from the current OutOfMemory which just means that a spike of memory usage hit the limit.

There are two more potential things that could improve the feedback. The first is to somehow capture the syslog messages logged by the Kernel when the oom-killer kills a process and report that info back. That was requested in bug 4520 for CRAY systems but could potentially be considered for vanilla Linux systems too.

The other one thing that could improve feedback is to not only report that a job/step hit a memory[+swap] limit at some point, but also catch the information on memory.stats when this happens and report that back. I think that would let users better analyze the different factors that contributed to the memory footprint (page cache, rss, swap, etc.) under OOM conditions.

Useful references:
1. https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt 
2. https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oom_notifierd.c
3. https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ

Related bugs: bug 3820, bug 4520.
Comment 1 Alejandro Sanchez 2018-01-23 07:11:43 MST
Part of the enhancement has been solved here:

https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb

Perhaps modify the API so that we get rid of the SIG_OOM and instead we add a new member(s) to reflect oom-kill event and/or memory hitting the limit, perhaps displaying the second as SystemComment.
Comment 2 Alejandro Sanchez 2018-02-21 09:34:58 MST
Try to detect kernels with different oom counts available in the event file:

https://patchwork.kernel.org/patch/9737381/

and use this instead of the manual eventfd() monitoring.
Comment 3 Felip Moll 2020-03-31 06:08:19 MDT
*** Ticket 6765 has been marked as a duplicate of this ticket. ***