Summary: | OOM logging in sacct | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
Component: | Accounting | Assignee: | Unassigned Developer <dev-unassigned> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | florian.pommerening |
Version: | 14.11.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | BYU - Brigham Young University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.0-pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ryan Cox
2015-09-21 06:02:45 MDT
We also face this problem. It would be nice if the job accounting could be triggered once just before the process is done. I guess this is not possible if the process finishes normally. But in cases where slurm kills the job with a signal, it should be possible. The script looks interesting but I'm not quite sure how I could use this for array tasks. To be able to distinguish which task ran out of memory, I would have to run the script inside the context of the task but if that ran out of memory than the whole task would be killed and the script would have no chance of documenting this, right? Slurm version 17.11 will have a new job state "OutOfMemory" (which appears in accounting, squeue, etc.). This will be set on job termination if the task/cgroup plugin is configured and the memory cgroup records an OOM condition on any node. For reference, here is the commit: https://github.com/SchedMD/slurm/commit/818a09e802587e68abc6a5c06f0be2d4ecfe97e3 That sounds great, thanks for the info. Just to be sure: will this also work for array tasks? The way you phrased it, it sounded like the whole job would get an OOM status if one of the array tasks runs out of memory. Will it be possible to distinguish which tasks in an array ran out of memory and which of them did not? Each task of a job array has a separate accounting record (and job record in slurmctld). |