Ticket 10255 - Job extern step always ends in OUT_OF_MEMORY in 20.02.6
Summary: Job extern step always ends in OUT_OF_MEMORY in 20.02.6
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact: Alejandro Sanchez
URL:
: 10122 10380 12122 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-11-19 11:29 MST by ARC Admins
Modified: 2021-10-29 07:34 MDT (History)
7 users (show)

See Also:
Site: University of Michigan
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description ARC Admins 2020-11-19 11:29:47 MST
Hello,

We recently patched to 20.02.6 (which, the "Version" drop down doesn't have listed, btw) and are seeing something peculiar with jobs. Every job that is submitted, started, and completed after the patch has OUT_OF_MEMORY for its extern step exit code. Here's an example:


```
[drhey@glctld ~]$ cat test-for-dan.sbat
#!/bin/bash
#SBATCH --job-name=hello_world
#SBATCH --time=10:00:00
#SBATCH --mail-user=drhey@umich.edu
#SBATCH --mail-type=none
#SBATCH --account=support
#SBATCH --partition=standard
#SBATCH --mem=19g

sleep 60
echo "test"


[drhey@glctld ~]$ sbatch test-for-dan.sbat
Submitted batch job 15345859

[drhey@glctld ~]$ sq
             JOBID PARTITION     NAME     USER  ACCOUNT ST       TIME  NODES NODELIST(REASON)
          15345859  standard hello_wo    drhey  support  R       0:02      1 gl3096

[drhey@glctld ~]$ ssh gl3096
Last login: Thu Nov 19 13:22:41 2020 from 10.164.9.220

[drhey@gl3096 ~]$ cat /sys/fs/cgroup/memory/slurm/uid_228441/job_15345859/step_extern/memory.limit_in_bytes
20401094656
[drhey@gl3096 ~]$ logout
Connection to gl3096 closed.

[drhey@glctld ~]$ sacct -j 15345859 --format=User,JobName,JobID,Account,Partition,AllocTRES%40,Submit,Start,End,Elapsed,TimeLimit,ExitCode,State%25
     User    JobName        JobID    Account  Partition                                AllocTRES              Submit               Start                 End    Elapsed  Timelimit ExitCode                     State
--------- ---------- ------------ ---------- ---------- ---------------------------------------- ------------------- ------------------- ------------------- ---------- ---------- -------- -------------------------
    drhey hello_wor+ 15345859        support   standard         billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00   10:00:00      0:0                 COMPLETED
               batch 15345859.ba+    support                                cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00                 0:0                 COMPLETED
              extern 15345859.ex+    support                    billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00               0:125             OUT_OF_MEMORY
```


David
Comment 1 Kilian Cavalotti 2020-11-19 15:48:30 MST
I can confirm we're observing the same behavior, FWIW.

Cheers,
-- 
Kilian
Comment 2 Felip Moll 2020-11-20 05:05:39 MST
I am investigating the issue. bug 10122 (Kaust) is also affected.

May you tell me which kernel version, OS, systemd version are you running? Can you upload your latest slurm.conf?
Comment 3 Felip Moll 2020-11-20 06:56:42 MST
Ignore my last comments.

I reproduced that in my CentOS 7 and I am investigating the cause:

[slurm@moll0 inst]$ sbatch --wrap "sleep 10" 
Submitted batch job 34
[slurm@moll0 inst]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                34     debug     wrap    slurm  R       0:01      1 moll1 
[slurm@moll0 inst]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34                 wrap      debug      slurm          1  COMPLETED      0:0 
34.batch          batch                 slurm          1  COMPLETED      0:0 
34.extern        extern                 slurm          1 OUT_OF_ME+    0:125 
[slurm@moll0 inst]$ uname -a
Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[slurm@moll0 inst]$ cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)
Comment 4 Felip Moll 2020-11-20 06:58:36 MST
*** Ticket 10122 has been marked as a duplicate of this ticket. ***
Comment 9 Kilian Cavalotti 2020-11-24 01:44:18 MST
Hi,

I am currently out of office, returning on November 30. 

If you need to
reach Research Computing, please email srcc-support@stanford.edu

Cheers,
Comment 21 Felip Moll 2020-12-01 08:25:21 MST
Hi,

The specific case where the extern always ended in OOM was happening because extern step was registered as listener on events on the memory cgroup in order to count OOMs, and on termination the cgroup directory was deleted before reading the counter.

According to cgroups v1 API, a cgroup rmdir generates an event notification, so the rmdir was counted as an OOM.

This has been fixed in commit 272c636d507e1 and will be in 20.02.7.

I am closing this bug.

Thanks!
Comment 22 Kilian Cavalotti 2020-12-01 09:00:36 MST
Hi Felip, 

(In reply to Felip Moll from comment #21)
> The specific case where the extern always ended in OOM was happening because
> extern step was registered as listener on events on the memory cgroup in
> order to count OOMs, and on termination the cgroup directory was deleted
> before reading the counter.
> 
> According to cgroups v1 API, a cgroup rmdir generates an event notification,
> so the rmdir was counted as an OOM.
> 
> This has been fixed in commit 272c636d507e1 and will be in 20.02.7.

Thanks for the fix and the explanation!
Will the fix also be in 20.11.1?

Cheers,
-- 
Kilian
Comment 23 Felip Moll 2020-12-01 10:03:29 MST
(In reply to Kilian Cavalotti from comment #22)
> Hi Felip, 
> 
> (In reply to Felip Moll from comment #21)
> > The specific case where the extern always ended in OOM was happening because
> > extern step was registered as listener on events on the memory cgroup in
> > order to count OOMs, and on termination the cgroup directory was deleted
> > before reading the counter.
> > 
> > According to cgroups v1 API, a cgroup rmdir generates an event notification,
> > so the rmdir was counted as an OOM.
> > 
> > This has been fixed in commit 272c636d507e1 and will be in 20.02.7.
> 
> Thanks for the fix and the explanation!
> Will the fix also be in 20.11.1?
> 
> Cheers,
> -- 
> Kilian

Yes indeed, this will be merged in in every version >= 20.02.7 .

Regards
Comment 24 Albert Gil 2020-12-09 07:45:07 MST
*** Ticket 10380 has been marked as a duplicate of this ticket. ***
Comment 25 Albert Gil 2021-10-29 07:34:43 MDT
*** Ticket 12122 has been marked as a duplicate of this ticket. ***