Ticket 8140 - OverMemoryKill does not work
Summary: OverMemoryKill does not work
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Nate Rini
QA Contact: Douglas Wightman
URL:
Depends on:
Blocks:
 
Reported: 2019-11-25 03:36 MST by CSC sysadmins
Modified: 2020-01-09 14:03 MST (History)
0 users

See Also:
Site: CSC - IT Center for Science
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02, 19.05-6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm and cgroup.confs (8.34 KB, text/plain)
2019-11-25 03:36 MST, CSC sysadmins
Details
slurmd log (38.01 KB, text/x-log)
2019-11-26 00:49 MST, CSC sysadmins
Details
patch (800 bytes, patch)
2019-12-10 14:57 MST, Nate Rini
Details | Diff
slurmd log with patch 12526 (67.13 KB, text/x-log)
2019-12-12 01:29 MST, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2019-11-25 03:36:00 MST
Created attachment 12387 [details]
slurm and cgroup.confs

Hi,

I tried to test old way to set memory limit instead of cgroup mem limits (which is a bit problematic for us). But on my test environment I could not get it working. 

I've set up Overmemorykill parameter and accounting but simple malloc test program can allocate more memory than the limit is.

JobAcctGatherParams     = UsePss,OverMemoryKill
AccountingStorageEnforce = associations,limits,qos


Here is example run:

[ttervo@c1 ~]$ sacct -j 45 -omaxrss,reqmem,elapsed,exitcode
    MaxRSS     ReqMem    Elapsed ExitCode 
---------- ---------- ---------- -------- 
                  1Mc   00:04:25      0:0 
         0        1Mc   00:04:25      0:0 
  1407988K        1Mc   00:04:25      0:0 


There is also old information on the slurm.conf man page, it conflicts with release notes: 

man slurm.conf

       MemLimitEnforce
       If set to yes then Slurm will terminate the job if it exceeds the value  requested using  the --mem-per-cpu option of salloc/sbatch/srun.  This is useful in combination with JobAcctGatherParams=OverMemoryKill. 

RELEASE_NOTES:

NOTE: MemLimitEnforce parameter has been removed and the functionality that
      was provided with it has been merged into a JobAcctGatherParams.
Comment 1 Nate Rini 2019-11-25 11:54:14 MST
Tommi,

Can you please attach your slurmd log of the node where you tested this?

Thanks,
--Nate
Comment 2 CSC sysadmins 2019-11-26 00:49:05 MST
Created attachment 12395 [details]
slurmd log

Test which I ran:

[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest 
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
Comment 3 Nate Rini 2019-11-26 09:16:55 MST
(In reply to Tommi Tervo from comment #0)
> JobAcctGatherParams     = UsePss,OverMemoryKill
> AccountingStorageEnforce = associations,limits,qos

Please also set "MemLimitEnforce=yes" in your slurm.conf.
Comment 4 CSC sysadmins 2019-11-26 14:03:37 MST
----- bugs@schedmd.com wrote:
> https://bugs.schedmd.com/show_bug.cgi?id=8140
> 
> --- Comment #3 from Nate Rini <nate@schedmd.com> ---

> Please also set "MemLimitEnforce=yes" in your slurm.conf.
> 
 
Like I wrote, it  is deprecated/removed from 19.05. It is on the config file but ignored.
Comment 5 CSC sysadmins 2019-11-28 01:32:32 MST
[root@slurmctl ~]# grep -i memlim /etc/slurm/slurm.conf
MemLimitEnforce=YES
[root@slurmctl ~]# systemctl restart slurmctld
[root@slurmctl ~]# scontrol show config |grep -i memlim
[root@slurmctl ~]# echo $?
1
Comment 7 Nate Rini 2019-12-02 14:40:03 MST
Is this a Cray Aries cluster?
Comment 8 CSC sysadmins 2019-12-02 22:46:50 MST
(In reply to Nate Rini from comment #7)
> Is this a Cray Aries cluster?

No, Centos7 cluster.

-Tommi
Comment 9 Nate Rini 2019-12-03 15:02:12 MST
Tommi

I believe I have confirmed the bug. I will work on a patchset.

Thanks,
--Nate
Comment 12 Nate Rini 2019-12-03 16:40:42 MST
Tommi

Does this node have a swap enabled?
> cat /proc/swaps 

Thanks,
--Nate
Comment 13 CSC sysadmins 2019-12-04 00:20:43 MST
> Does this node have a swap enabled?
> > cat /proc/swaps 

Hi,

Yes it has swap but swappiness seems to be zero:

[ttervo@c1 ~]$ cat /proc/sys/vm/swappiness 
0

[ttervo@c1 ~]$ free
              total        used        free      shared  buff/cache   available
Mem:        2046892      165196     1705348       23160      176348     1706428
Swap:       1048572        2560     1046012
[ttervo@c1 ~]$ cat /proc/swaps 
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       1048572 2560    -2
Comment 14 Nate Rini 2019-12-04 08:55:36 MST
(In reply to Tommi Tervo from comment #13)
> Yes it has swap but swappiness seems to be zero:
> [ttervo@c1 ~]$ cat /proc/swaps 
> Used=2560

Looks like it is still getting used. In my testing, I found that the rlimit was set on the process and all the memory above the requested allocation went to swap.  Since swapped out pages don't count against the memory RSS usage, Slurm was not killing the processes.

Is it possible to call this and try again?
> swapoff /dev/dm-1

Thanks,
--Nate
Comment 15 CSC sysadmins 2019-12-05 00:10:01 MST
> Is it possible to call this and try again?
> > swapoff /dev/dm-1

Hi,

It did not have any effect:

[root@c1 ~]# swapoff -a
[root@c1 ~]# free
              total        used        free      shared  buff/cache   available
Mem:        2046892      173868     1681852       33400      191172     1685220
Swap:             0           0           0
[root@c1 ~]# logout
[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL

[ttervo@c1 ~]$ ./memtest
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
exit
[ttervo@c1 ~]$ sacct -j 53 -oreqmem,maxrss
    ReqMem     MaxRSS 
---------- ---------- 
       1Mc            
       1Mc          0 
       1Mc   1407896K
Comment 16 Nate Rini 2019-12-10 10:44:45 MST
(In reply to Tommi Tervo from comment #15)
> > Is it possible to call this and try again?
> > > swapoff /dev/dm-1
> It did not have any effect:

Thanks for verifying. Working on a patch set now.
Comment 24 Nate Rini 2019-12-10 15:00:15 MST
Tommi,

A patch is undergoing review, please tell me if you need it soon than the normal review process allows.

Thanks,
--Nate
Comment 25 CSC sysadmins 2019-12-11 00:41:32 MST
(In reply to Nate Rini from comment #24)
> Tommi,
> 
> A patch is undergoing review, please tell me if you need it soon than the
> normal review process allows.


Hi,

I could apply it to my test environment for additional testing.

Thanks,
Tommi
Comment 26 Nate Rini 2019-12-11 08:37:04 MST
(In reply to Nate Rini from comment #22)
> Created attachment 12526 [details]
> patch

Please give it a try on your test system.
Comment 27 CSC sysadmins 2019-12-12 01:29:40 MST
Created attachment 12544 [details]
slurmd log with patch 12526

Hi,

I've a bad news, patch did not help. I verified that build is using patched source code:

[root@slurmctl slurm-19.05.4]# grep -A1 'clone slurmctld config' /root/rpmbuild/BUILD/slurm-19.05.4/src/slurmd/slurmd/slurmd.c
        /* clone slurmctld config into slurmd config */
        conf->job_acct_oom_kill = slurmctld_conf.job_acct_oom_kill;

[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest 
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
[ttervo@c1 ~]$ sacct -j 56 -o maxrss,reqmem
    MaxRSS     ReqMem 
---------- ---------- 
                  1Mc 
      394K        1Mc 
  1408230K        1Mc
Comment 28 Nate Rini 2019-12-12 09:19:56 MST
Tommi,

Can you please verify that the slurmd daemon was fully restarted on the test node?
> [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use
> [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use

Can you please verify that the cgroups are mounted on the node in the expected location?
> 2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
> [2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory

While the job is sleeping for 60 seconds, can you please call gcore against the attached slurmstepd and provide 't a a bt full' output from gdb?

Thanks,
--Nate
Comment 29 CSC sysadmins 2019-12-13 04:57:15 MST
(In reply to Nate Rini from comment #28)
> Tommi,
> 
> Can you please verify that the slurmd daemon was fully restarted on the test
> node?
> > [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use
> > [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use


Doh, seems that systemctl could not stop old slurmd and I did not catch that from verbose log, looked only for updated slurmd version string. After kill -9 `pidof slurmd` and systemctl start slurmd overmemorykill works fine on my test system. Thanks for fix.

Best Regards,
Tommi Tervo
CSC
Comment 34 Nate Rini 2020-01-09 14:03:13 MST
Tommi,

This is now fixed upstream by 4edf4a5898a2944. Please reply if you have any questions or issues.

Thanks,
--Nate