Ticket 4578

Summary: Improve feedback on Out Of Memory conditions
Product: Slurm Reporter: Alejandro Sanchez <alex>
Component: slurmdAssignee: Alejandro Sanchez <alex>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: felip.moll, kaizaad
Version: 18.08.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6765
https://bugs.schedmd.com/show_bug.cgi?id=9737
https://bugs.schedmd.com/show_bug.cgi?id=10122
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: 3 - High Emory-Cloud Sites: ---

Description Alejandro Sanchez 2018-01-04 04:45:54 MST
Currently Slurm interprets a memory[+swap].failcnt > 0 as an Out of Memory condition. When this happens, the Kernel might be able to reclaim unused pages, and the application may or may not fail. This does not necessarily mean that the OOM-Killer actually killed any process. 

It would be nice though to provide a feedback back to user/admins when the oom-killer kills any process. There are mechanisms to register a notifier through the cgroup.event_control control file, so that the application can be notified through eventfd when OOM-Killer actually kills the process. If we manage to catch that event, we could then log that back to the slurmd.log, to the user stdout/stderr and potentially add a new Job/Step state OOM-Killed or similar, aside from the current OutOfMemory which just means that a spike of memory usage hit the limit.

There are two more potential things that could improve the feedback. The first is to somehow capture the syslog messages logged by the Kernel when the oom-killer kills a process and report that info back. That was requested in bug 4520 for CRAY systems but could potentially be considered for vanilla Linux systems too.

The other one thing that could improve feedback is to not only report that a job/step hit a memory[+swap] limit at some point, but also catch the information on memory.stats when this happens and report that back. I think that would let users better analyze the different factors that contributed to the memory footprint (page cache, rss, swap, etc.) under OOM conditions.

Useful references:
1. https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt 
2. https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oom_notifierd.c
3. https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ

Related bugs: bug 3820, bug 4520.
Comment 1 Alejandro Sanchez 2018-01-23 07:11:43 MST
Part of the enhancement has been solved here:

https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb

Perhaps modify the API so that we get rid of the SIG_OOM and instead we add a new member(s) to reflect oom-kill event and/or memory hitting the limit, perhaps displaying the second as SystemComment.
Comment 2 Alejandro Sanchez 2018-02-21 09:34:58 MST
Try to detect kernels with different oom counts available in the event file:

https://patchwork.kernel.org/patch/9737381/

and use this instead of the manual eventfd() monitoring.
Comment 3 Felip Moll 2020-03-31 06:08:19 MDT
*** Ticket 6765 has been marked as a duplicate of this ticket. ***