1963 – OOM logging in sacct

Ticket 1963 - OOM logging in sacct

Summary: OOM logging in sacct

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	14.11.8
Hardware:	Linux Linux

Importance:	--- 5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-09-21 06:02 MDT by Ryan Cox
Modified:	2017-08-04 08:52 MDT (History)
CC List:	1 user (show)

See Also:
Site:	BYU - Brigham Young University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.0-pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ryan Cox 2015-09-21 06:02:45 MDT

When a job dies due to hitting the oom-killer, there is nothing in sacct that informs the user about the out of memory condition. As I mentioned at SLUG, it would be nice to set the relevant RSS fields to be equal to what the user requested.

Since the accounting information is polling-based, it usually doesn't poll at the correct time to catch a spike in memory usage. Users then see that they used, say, 5 MB instead of 128 GB. This happens very frequently to our users who run out of memory. If they requested 128 GB and hit the oom-killer, we can assume that they did indeed use all 128 GB and account for it accordingly, even if the polling didn't catch them using it.

Furthermore, the exit code that is shown in sacct can also mislead users. For example:
./trigger-the-oom-killer-because-i-do-not-understand-my-own-program > /tmp/results
cp /tmp/results /somewhere/else/

oom-killer will kill the first process but the cp will succeed since the file was created. Now the user gets a success instead of a failure. It would be really nice to have a separate state to represent the OOM condition, maybe MEM_EXCEEDED.

So basically:
1) MaxRSS gets set to the amount of memory they requested. Not sure about the VM fields
2) New state for sacct that reflects that oom-killer ran

Side note: if you ever want to switch to an event-based notification of the oom-killer rather than polling at the end, you can take a look at what we did on our login nodes (https://github.com/BYUHPC/uft/tree/master/oom_notifierd) or see the cgroups/memory.txt file with the kernel documentation.

Comment 1 florian.pommerening 2017-08-04 03:38:59 MDT

We also face this problem. It would be nice if the job accounting could be triggered once just before the process is done. I guess this is not possible if the process finishes normally. But in cases where slurm kills the job with a signal, it should be possible.

The script looks interesting but I'm not quite sure how I could use this for array tasks. To be able to distinguish which task ran out of memory, I would have to run the script inside the context of the task but if that ran out of memory than the whole task would be killed and the script would have no chance of documenting this, right?

Comment 2 Moe Jette 2017-08-04 08:39:05 MDT

Slurm version 17.11 will have a new job state "OutOfMemory" (which appears in accounting, squeue, etc.). This will be set on job termination if the task/cgroup plugin is configured and the memory cgroup records an OOM condition on any node.

For reference, here is the commit:
https://github.com/SchedMD/slurm/commit/818a09e802587e68abc6a5c06f0be2d4ecfe97e3

Comment 3 florian.pommerening 2017-08-04 08:51:12 MDT

That sounds great, thanks for the info.

Just to be sure: will this also work for array tasks? The way you phrased it, it sounded like the whole job would get an OOM status if one of the array tasks runs out of memory. Will it be possible to distinguish which tasks in an array ran out of memory and which of them did not?

Comment 4 Moe Jette 2017-08-04 08:52:14 MDT

Each task of a job array has a separate accounting record (and job record in slurmctld).