Hello, We recently patched to 20.02.6 (which, the "Version" drop down doesn't have listed, btw) and are seeing something peculiar with jobs. Every job that is submitted, started, and completed after the patch has OUT_OF_MEMORY for its extern step exit code. Here's an example: ``` [drhey@glctld ~]$ cat test-for-dan.sbat #!/bin/bash #SBATCH --job-name=hello_world #SBATCH --time=10:00:00 #SBATCH --mail-user=drhey@umich.edu #SBATCH --mail-type=none #SBATCH --account=support #SBATCH --partition=standard #SBATCH --mem=19g sleep 60 echo "test" [drhey@glctld ~]$ sbatch test-for-dan.sbat Submitted batch job 15345859 [drhey@glctld ~]$ sq JOBID PARTITION NAME USER ACCOUNT ST TIME NODES NODELIST(REASON) 15345859 standard hello_wo drhey support R 0:02 1 gl3096 [drhey@glctld ~]$ ssh gl3096 Last login: Thu Nov 19 13:22:41 2020 from 10.164.9.220 [drhey@gl3096 ~]$ cat /sys/fs/cgroup/memory/slurm/uid_228441/job_15345859/step_extern/memory.limit_in_bytes 20401094656 [drhey@gl3096 ~]$ logout Connection to gl3096 closed. [drhey@glctld ~]$ sacct -j 15345859 --format=User,JobName,JobID,Account,Partition,AllocTRES%40,Submit,Start,End,Elapsed,TimeLimit,ExitCode,State%25 User JobName JobID Account Partition AllocTRES Submit Start End Elapsed Timelimit ExitCode State --------- ---------- ------------ ---------- ---------- ---------------------------------------- ------------------- ------------------- ------------------- ---------- ---------- -------- ------------------------- drhey hello_wor+ 15345859 support standard billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26 00:01:00 10:00:00 0:0 COMPLETED batch 15345859.ba+ support cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26 00:01:00 0:0 COMPLETED extern 15345859.ex+ support billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26 00:01:00 0:125 OUT_OF_MEMORY ``` David
I can confirm we're observing the same behavior, FWIW. Cheers, -- Kilian
I am investigating the issue. bug 10122 (Kaust) is also affected. May you tell me which kernel version, OS, systemd version are you running? Can you upload your latest slurm.conf?
Ignore my last comments. I reproduced that in my CentOS 7 and I am investigating the cause: [slurm@moll0 inst]$ sbatch --wrap "sleep 10" Submitted batch job 34 [slurm@moll0 inst]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 34 debug wrap slurm R 0:01 1 moll1 [slurm@moll0 inst]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 34 wrap debug slurm 1 COMPLETED 0:0 34.batch batch slurm 1 COMPLETED 0:0 34.extern extern slurm 1 OUT_OF_ME+ 0:125 [slurm@moll0 inst]$ uname -a Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [slurm@moll0 inst]$ cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core)
*** Ticket 10122 has been marked as a duplicate of this ticket. ***
Hi, I am currently out of office, returning on November 30. If you need to reach Research Computing, please email srcc-support@stanford.edu Cheers,
Hi, The specific case where the extern always ended in OOM was happening because extern step was registered as listener on events on the memory cgroup in order to count OOMs, and on termination the cgroup directory was deleted before reading the counter. According to cgroups v1 API, a cgroup rmdir generates an event notification, so the rmdir was counted as an OOM. This has been fixed in commit 272c636d507e1 and will be in 20.02.7. I am closing this bug. Thanks!
Hi Felip, (In reply to Felip Moll from comment #21) > The specific case where the extern always ended in OOM was happening because > extern step was registered as listener on events on the memory cgroup in > order to count OOMs, and on termination the cgroup directory was deleted > before reading the counter. > > According to cgroups v1 API, a cgroup rmdir generates an event notification, > so the rmdir was counted as an OOM. > > This has been fixed in commit 272c636d507e1 and will be in 20.02.7. Thanks for the fix and the explanation! Will the fix also be in 20.11.1? Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #22) > Hi Felip, > > (In reply to Felip Moll from comment #21) > > The specific case where the extern always ended in OOM was happening because > > extern step was registered as listener on events on the memory cgroup in > > order to count OOMs, and on termination the cgroup directory was deleted > > before reading the counter. > > > > According to cgroups v1 API, a cgroup rmdir generates an event notification, > > so the rmdir was counted as an OOM. > > > > This has been fixed in commit 272c636d507e1 and will be in 20.02.7. > > Thanks for the fix and the explanation! > Will the fix also be in 20.11.1? > > Cheers, > -- > Kilian Yes indeed, this will be merged in in every version >= 20.02.7 . Regards
*** Ticket 10380 has been marked as a duplicate of this ticket. ***
*** Ticket 12122 has been marked as a duplicate of this ticket. ***