7839 – Slurmd defunct processes and many slurmstepd "sleeping"

Ticket 7839 - Slurmd defunct processes and many slurmstepd "sleeping"

Summary: Slurmd defunct processes and many slurmstepd "sleeping"

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	19.05.1
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-09-30 15:01 MDT by Anthony DelSorbo
Modified:	2020-04-17 04:54 MDT (History)
CC List:	0 users

See Also:	7872 7888 8851
Site:	NOAA
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	NESCC
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd messages (996.27 KB, text/plain) 2019-10-01 06:31 MDT, Anthony DelSorbo	Details
Slurm configuration (9.18 KB, text/plain) 2019-10-01 06:31 MDT, Anthony DelSorbo	Details
mount command output (6.13 KB, text/plain) 2019-10-01 06:41 MDT, Anthony DelSorbo	Details
mem_cg grep of system.map... (5.13 KB, text/plain) 2019-10-01 06:42 MDT, Anthony DelSorbo	Details
dmesg file from h2c07 (530.70 KB, text/x-log) 2019-10-02 05:50 MDT, Anthony DelSorbo	Details
New process table listing for h2c07 (107.67 KB, text/x-log) 2019-10-02 05:53 MDT, Anthony DelSorbo	Details
sysctl - a output on node (192.12 KB, text/plain) 2019-10-03 10:56 MDT, Anthony DelSorbo	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Anthony DelSorbo 2019-09-30 15:01:01 MDT

We have many nodes that are winding up in a down state or not responding leaving the impression that slurm is quite unstable as of late.  Over the past few days, we've needed to reboot hundreds of nodes for this reason.  

A very common thread among these nodes is a stuck slurmstepd "sleep" processes, defunct processes, stuck user processes or multiple slurmd processes.  Below is a sample of the process table on one of these nodes.  I've also included a sample lsof output of one specific process (88172).

This issue seems to exhibit many of the symptoms of case 4718, which has already be fixed.  Is this perhaps a regression?

Please let me know what additional information I can provide you.

Thanks,

Tony.

==================

[root@h25c19 ~]#  ps --forest -lfe | egrep '[s]leep|[s]lurm'
F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
1 D root     125742      1  0  80   0 - 1020759 mem_cg Sep13 ?      00:00:18 /apps/slurm/d/sbin/slurmd
1 Z root      87619 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87631 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87666 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87677 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87689 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87814 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      87954 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      88031 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 Z root      88048 125742  0  80   0 -     0 do_exi Sep27 ?        00:00:00  \_ [slurmd] <defunct>
1 D root      88290 125742  0  80   0 - 1020759 mem_cg Sep27 ?      00:00:00  \_ /apps/slurm/d/sbin/slurmd
4 S root      87605      1  0  80   0 - 84242 do_wai Sep27 ?        00:00:15 slurmstepd: [120564.extern]
0 S root      87977  87605  0  80   0 -  1090 hrtime Sep27 ?        00:00:00  \_ sleep 100000000
4 S root      87608      1  0  80   0 - 84242 do_wai Sep27 ?        00:00:14 slurmstepd: [120570.extern]
...
4 S root      88139      1  0  80   0 - 84320 flock_ Sep27 ?        00:00:00 slurmstepd: [120579.batch]
5 S rtruc     88236  88139  0  80   0 - 84320 pipe_w Sep27 ?        00:00:00  \_ slurmstepd: [120579.batch]
4 D root      88150      1  0  80   0 - 84320 mem_cg Sep27 ?        00:00:00 slurmstepd: [120592.batch]
1 D root      92076  88150  0  80   0 - 84320 mem_cg Sep27 ?        00:00:00  \_ slurmstepd: [120592.batch]
4 S root      88163      1  0  80   0 - 84320 flock_ Sep27 ?        00:00:00 slurmstepd: [120565.batch]
5 S rtruc     88234  88163  0  80   0 - 84320 pipe_w Sep27 ?        00:00:00  \_ slurmstepd: [120565.batch]
4 S root      88106      1  0  80   0 - 84320 flock_ Sep27 ?        00:00:00 slurmstepd: [120564.batch]
5 S rtruc     88172  88106  0  80   0 - 84320 pipe_w Sep27 ?        00:00:00  \_ slurmstepd: [120564.batch]

============

[root@h25c19 ~]# lsof -p 88172
COMMAND     PID  USER   FD      TYPE             DEVICE SIZE/OFF               NODE NAME
slurmstep 88172 rtruc  cwd       DIR               0,52    18432         4090183543 /home/rtruc (10.181.1.1:/home)
slurmstep 88172 rtruc  rtd       DIR               0,38     4096         1745657516 / (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  txt       REG               0,51  1295568         1788599722 /apps/slurm/19.05.1-2/sbin/slurmstepd (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,38   163400          136497372 /usr/lib64/ld-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,51  8190240          899024412 /apps/slurm/19.05.1-2/lib/slurm/libslurmfull.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,38    19288          136498713 /usr/lib64/libdl-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38   254392          136497878 /usr/lib64/libhwloc.so.5.7.5 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    61672          136498092 /usr/lib64/libpam.so.0.83.1 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    15632          136498093 /usr/lib64/libpam_misc.so.0.82.0 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    14496          136498154 /usr/lib64/libutil-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38   141968          136497980 /usr/lib64/libpthread-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38  2151672          136498702 /usr/lib64/libc-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38  1137024          136497722 /usr/lib64/libm-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    50752          136498067 /usr/lib64/libnuma.so.1 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    41272          136497971 /usr/lib64/libltdl.so.7.3.0 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38   127184          136497495 /usr/lib64/libaudit.so.1.0.0 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    88776          136498730 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    23968          136497579 /usr/lib64/libcap-ng.so.0.0.0 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,38    61624          136497795 /usr/lib64/libnss_files-2.17.so (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,51   599536         3637155463 /apps/slurm/19.05.1-2/lib/slurm/select_cons_res.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   212944         4035587692 /apps/slurm/19.05.1-2/lib/slurm/auth_munge.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,38    41768          136498008 /usr/lib64/libmunge.so.2.0.0 (10.184.2.96:/install/netboot/centos7-statelite/x86_64/compute/rootimg)
slurmstep 88172 rtruc  mem       REG               0,51   186784         3819702317 /apps/slurm/19.05.1-2/lib/slurm/acct_gather_energy_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   204464         1844316952 /apps/slurm/19.05.1-2/lib/slurm/acct_gather_profile_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   186296          764328688 /apps/slurm/19.05.1-2/lib/slurm/acct_gather_interconnect_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   186264         2748103109 /apps/slurm/19.05.1-2/lib/slurm/acct_gather_filesystem_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   208656          354388911 /apps/slurm/19.05.1-2/lib/slurm/switch_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   341152          979810419 /apps/slurm/19.05.1-2/lib/slurm/jobacct_gather_cgroup.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   184704         2225469714 /apps/slurm/19.05.1-2/lib/slurm/core_spec_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   200016          118934908 /apps/slurm/19.05.1-2/lib/slurm/proctrack_cgroup.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   409144         2706524128 /apps/slurm/19.05.1-2/lib/slurm/task_affinity.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   418448         4061450313 /apps/slurm/19.05.1-2/lib/slurm/task_cgroup.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   189776         3628401163 /apps/slurm/19.05.1-2/lib/slurm/checkpoint_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   169744          128520314 /apps/slurm/19.05.1-2/lib/slurm/cred_munge.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc  mem       REG               0,51   186320          490713390 /apps/slurm/19.05.1-2/lib/slurm/job_container_none.so (10.181.1.2:/apps_hera)
slurmstep 88172 rtruc    0r      CHR                1,3      0t0               2052 /dev/null
slurmstep 88172 rtruc    1w      REG         312,769232     7050 144115672484532382 /scratch2/BMC/rtruc/RUC13/log/ncl_ruc_skewt2_2019092718_02.log
slurmstep 88172 rtruc    2w      REG         312,769232     7050 144115672484532382 /scratch2/BMC/rtruc/RUC13/log/ncl_ruc_skewt2_2019092718_02.log
slurmstep 88172 rtruc    3w      REG         312,769232     7050 144115672484532382 /scratch2/BMC/rtruc/RUC13/log/ncl_ruc_skewt2_2019092718_02.log
slurmstep 88172 rtruc    4r     FIFO                0,9      0t0           23954255 pipe
slurmstep 88172 rtruc    5w     FIFO                0,9      0t0           23954255 pipe
slurmstep 88172 rtruc    6u     unix 0xffff9c3c08fd1800      0t0           23993368 /var/spool/slurmd/h25c19_120564.4294967294
slurmstep 88172 rtruc    7r     FIFO                0,9      0t0           23993370 pipe
slurmstep 88172 rtruc    8w     FIFO                0,9      0t0           23993370 pipe
slurmstep 88172 rtruc    9r      REG               0,29        0           23898560 /sys/fs/cgroup/memory/slurm/uid_3014/job_120564/step_batch/memory.oom_control
slurmstep 88172 rtruc   10w      REG               0,29        0           23898548 /sys/fs/cgroup/memory/slurm/uid_3014/job_120564/step_batch/cgroup.event_control
slurmstep 88172 rtruc   11u  a_inode               0,10        0              12390 [eventfd]
slurmstep 88172 rtruc   12r     FIFO                0,9      0t0           23898582 pipe
slurmstep 88172 rtruc   13w     FIFO                0,9      0t0           23898582 pipe
slurmstep 88172 rtruc   14r      CHR                1,3      0t0               2052 /dev/null
slurmstep 88172 rtruc   15w      REG         312,769232     7050 144115672484532382 /scratch2/BMC/rtruc/RUC13/log/ncl_ruc_skewt2_2019092718_02.log
slurmstep 88172 rtruc   16r     FIFO                0,9      0t0           23674775 pipe

Comment 1 Felip Moll 2019-10-01 03:56:15 MDT

Anthony,

From your ps output I can see how some processes are waiting in mem_cg kernel functions.

Can you invoke ps and return the full WCHAN string? In the output you sent it is truncated. For example:

ps -a opid,wchan:42,cmd

you can also check in "cat /boot/System.map<yourkernelversion> |grep mem_cg" which functions are available for this substring.

I am suspicious about a problem with cgroups which cannot kill the processes. I would need your slurm.conf, slurmd and slurmctld logs, and a "mount" command issued in a failing node.

Comment 2 Felip Moll 2019-10-01 04:03:32 MDT

Sorry, I meant something like:

ps -e --forest  opid,ppid,user,wchan:42,cmd

Comment 3 Anthony DelSorbo 2019-10-01 06:24:30 MDT

(In reply to Felip Moll from comment #2)
> Sorry, I meant something like:
> 
> ps -e --forest  opid,ppid,user,wchan:42,cmd

Felip,

See below for PS output.  As we send our slurmd to syslogs, I'm going to prune and attach the messages file as well as the slurm.conf file.


========

125742      1 root     mem_cgroup_oom_synchronize                 /apps/slurm/d/sbin/slurmd
 87619 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87631 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87666 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87677 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87689 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87814 125742 root     do_exit                                     \_ [slurmd] <defunct>
 87954 125742 root     do_exit                                     \_ [slurmd] <defunct>
 88031 125742 root     do_exit                                     \_ [slurmd] <defunct>
 88048 125742 root     do_exit                                     \_ [slurmd] <defunct>
 88290 125742 root     mem_cgroup_oom_synchronize                  \_ /apps/slurm/d/sbin/slurmd
 87605      1 root     do_wait                                    slurmstepd: [120564.extern]
 87977  87605 root     hrtimer_nanosleep                           \_ sleep 100000000
 87608      1 root     do_wait                                    slurmstepd: [120570.extern]
 88062  87608 root     hrtimer_nanosleep                           \_ sleep 100000000
 87613      1 root     do_wait                                    slurmstepd: [120568.extern]
 88128  87613 root     hrtimer_nanosleep                           \_ sleep 100000000
 87626      1 root     flock_lock_inode_wait                      slurmstepd: [120580.extern]
 88240  87626 root     hrtimer_nanosleep                           \_ sleep 100000000
 87636      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120574.extern]
 88182  87636 root     hrtimer_nanosleep                           \_ sleep 100000000
 87645      1 root     do_wait                                    slurmstepd: [120579.extern]
 88115  87645 root     hrtimer_nanosleep                           \_ sleep 100000000
 87676      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120569.extern]
 88287  87676 root     mem_cgroup_oom_synchronize                  \_ sleep 100000000
 87686      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120573.extern]
 88237  87686 root     hrtimer_nanosleep                           \_ sleep 100000000
 87696      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120575.extern]
 88248  87696 root     hrtimer_nanosleep                           \_ sleep 100000000
 87712      1 root     do_wait                                    slurmstepd: [120563.extern]
 88142  87712 root     hrtimer_nanosleep                           \_ sleep 100000000
 87727      1 root     do_wait                                    slurmstepd: [120576.extern]
 88042  87727 root     hrtimer_nanosleep                           \_ sleep 100000000
 87744      1 root     do_wait                                    slurmstepd: [120577.extern]
 88157  87744 root     hrtimer_nanosleep                           \_ sleep 100000000
 87760      1 root     do_wait                                    slurmstepd: [120578.extern]
 87884  87760 root     hrtimer_nanosleep                           \_ sleep 100000000
 87791      1 root     do_wait                                    slurmstepd: [120566.extern]
 88103  87791 root     hrtimer_nanosleep                           \_ sleep 100000000
 87809      1 root     do_wait                                    slurmstepd: [120571.extern]
 88099  87809 root     hrtimer_nanosleep                           \_ sleep 100000000
 87824      1 root     flock_lock_inode_wait                      slurmstepd: [120567.extern]
 87833      1 root     do_wait                                    slurmstepd: [120565.extern]
 88108  87833 root     hrtimer_nanosleep                           \_ sleep 100000000
 87847      1 root     do_wait                                    slurmstepd: [120572.extern]
 88110  87847 root     hrtimer_nanosleep                           \_ sleep 100000000
 87965      1 root     do_wait                                    slurmstepd: [120588.extern]
 88121  87965 root     hrtimer_nanosleep                           \_ sleep 100000000
 87972      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120591.extern]
 88233  87972 root     hrtimer_nanosleep                           \_ sleep 100000000
 88007      1 root     do_wait                                    slurmstepd: [120592.extern]
 88118  88007 root     hrtimer_nanosleep                           \_ sleep 100000000
 88020      1 root     do_wait                                    slurmstepd: [120586.extern]
 88137  88020 root     hrtimer_nanosleep                           \_ sleep 100000000
 88037      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120581.extern]
 88224  88037 root     hrtimer_nanosleep                           \_ sleep 100000000
 88046      1 root     do_wait                                    slurmstepd: [120582.extern]
 88147  88046 root     hrtimer_nanosleep                           \_ sleep 100000000
 88047      1 root     do_wait                                    slurmstepd: [120589.extern]
 88175  88047 root     hrtimer_nanosleep                           \_ sleep 100000000
 88050      1 root     flock_lock_inode_wait                      slurmstepd: [120583.extern]
 88219  88050 root     hrtimer_nanosleep                           \_ sleep 100000000
 88051      1 root     do_wait                                    slurmstepd: [120590.extern]
 88164  88051 root     hrtimer_nanosleep                           \_ sleep 100000000
 88053      1 root     do_wait                                    slurmstepd: [120587.extern]
 88123  88053 root     hrtimer_nanosleep                           \_ sleep 100000000
 88056      1 root     do_wait                                    slurmstepd: [120585.extern]
 88154  88056 root     hrtimer_nanosleep                           \_ sleep 100000000
 88057      1 root     do_wait                                    slurmstepd: [120584.extern]
 88222  88057 root     hrtimer_nanosleep                           \_ sleep 100000000
 88081      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120578.batch]
 92074  88081 root     mem_cgroup_oom_synchronize                  \_ slurmstepd: [120578.batch]
 88106      1 root     flock_lock_inode_wait                      slurmstepd: [120564.batch]
 88172  88106 rtruc    pipe_wait                                   \_ slurmstepd: [120564.batch]
 88119      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120571.batch]
 92070  88119 root     mem_cgroup_oom_synchronize                  \_ slurmstepd: [120571.batch]
 88131      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120572.batch]
 92077  88131 root     mem_cgroup_oom_synchronize                  \_ slurmstepd: [120572.batch]
 88139      1 root     flock_lock_inode_wait                      slurmstepd: [120579.batch]
 88236  88139 rtruc    pipe_wait                                   \_ slurmstepd: [120579.batch]
 88150      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120592.batch]
 92076  88150 root     mem_cgroup_oom_synchronize                  \_ slurmstepd: [120592.batch]
 88163      1 root     flock_lock_inode_wait                      slurmstepd: [120565.batch]
 88234  88163 rtruc    pipe_wait                                   \_ slurmstepd: [120565.batch]
 88171      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120568.batch]
 88266  88171 rtruc    pipe_wait                                   \_ slurmstepd: [120568.batch]
 88217      1 root     futex_wait_queue_me                        slurmstepd: [120586.batch]
 88231      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120566.batch]
 88257      1 root     mem_cgroup_oom_synchronize                 slurmstepd: [120588.batch]

Comment 4 Anthony DelSorbo 2019-10-01 06:31:00 MDT

Created attachment 11758 [details]
slurmd messages

Comment 5 Anthony DelSorbo 2019-10-01 06:31:37 MDT

Created attachment 11759 [details]
Slurm configuration

Comment 6 Anthony DelSorbo 2019-10-01 06:41:29 MDT

Created attachment 11760 [details]
mount command output

Comment 7 Anthony DelSorbo 2019-10-01 06:42:35 MDT

Created attachment 11761 [details]
mem_cg grep of system.map...

Comment 8 Anthony DelSorbo 2019-10-01 06:47:51 MDT

systemctld logs are nearly 1GB each - compressed.  Is there another way I can send them up to you?

Tony.

Comment 9 Felip Moll 2019-10-01 08:08:58 MDT

Tony, I will first analyze what you've already uploaded before you send slurmctld logs.

Do you have a dmesg from the failing node?

mem_cgroup_oom_synchronize() kernel function is normally invoked when there's an Out of Memory event. It's also seen in the logs:

Sep 27 02:57:24 h25c19 slurmstepd[440940]: _oom_event_monitor: oom-kill event count: 1

If the oom killed slurmstepds randomly it may lead to what you are seeing. I will respond better once I analyzed everything, but in the meantime can you investigate this possibility and can you also attach cgroup.conf from the node?

From linux kernel code source/mm/memcontrol.c

/**
 * mem_cgroup_oom_synchronize - complete memcg OOM handling
 * @handle: actually kill/wait or just clean up the OOM state
 *
 * This has to be called at the end of a page fault if the memcg OOM
 * handler was enabled.
 *
 * Memcg supports userspace OOM handling where failed allocations must
 * sleep on a waitqueue until the userspace task resolves the
 * situation.  Sleeping directly in the charge context with all kinds
 * of locks held is not a good idea, instead we remember an OOM state
 * in the task and mem_cgroup_oom_synchronize() has to be called at
 * the end of the page fault to complete the OOM handling.
 *
 * Returns %true if an ongoing memcg OOM situation was detected and
 * completed, %false otherwise.
 */
bool mem_cgroup_oom_synchronize(bool handle)
{
...
}

What I can also see is that in some nodes you have MemSpecLimit configured. Specially in h25c19 you have MemSpecLimit=2000 and RealMemory=95000. This means the slurmd daemons shouldn't be  killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon's memory cgroup) but I am wondering if this setting could be related to the issue we're seeing.


I see other nodes too with strange MemSpecLimit, you are granting 128GB for slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is this intentional?

NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000 MemSpecLimit=128000  



Thanks

Comment 10 Anthony DelSorbo 2019-10-01 12:20:13 MDT

Here is the content of cgroup.conf:

[slurm@bqs1 etc]$ cat cgroup.conf
CgroupAutomount=yes
ConstrainCores=no
# ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
#CgroupMountpoint=/sys/fs/cgroup

It would not be unheard of that I did something stupid in configuring slurm. in you comment:

"I see other nodes too with strange MemSpecLimit, you are granting 128GB for slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is this intentional?

NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000 MemSpecLimit=128000  "

These hfe nodes are login nodes.  We have many users. So, I wanted to leave out enough memory for users to compile, edit, etc.  Is there a better way?

For the other compute nodes, is 2000 sufficient?

Comment 11 Felip Moll 2019-10-01 16:05:08 MDT

(In reply to Anthony DelSorbo from comment #10)
> Here is the content of cgroup.conf:
> 
> [slurm@bqs1 etc]$ cat cgroup.conf
> CgroupAutomount=yes
> ConstrainCores=no
> # ConstrainDevices=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> #CgroupMountpoint=/sys/fs/cgroup
> 
> It would not be unheard of that I did something stupid in configuring slurm.
> in you comment:
> 
> "I see other nodes too with strange MemSpecLimit, you are granting 128GB for
> slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is
> this intentional?
> 
> NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000
> MemSpecLimit=128000  "
> 
> These hfe nodes are login nodes.  We have many users. So, I wanted to leave
> out enough memory for users to compile, edit, etc.  Is there a better way?

Well, just setting the RealMemory to the desired lower value would be better. For example with 66000 you could set MemSpecLimit=2000 and the remaining 64GB would be used for job allocations.
I don't like the idea of setting slurmd cgroup memory to 128GB, you're technically allowing slurmd to grow up to 128GB.

> For the other compute nodes, is 2000 sufficient?

I would say this is sufficient, but I encourage you to check your nodes and guide yourself with the greatest RSS value seen.

Besides that, can you provide a a dmesg from a failing node?

Comment 12 Anthony DelSorbo 2019-10-02 05:50:28 MDT

Created attachment 11776 [details]
dmesg file from h2c07

I lost my previous node we were using for troubleshooting.  Be aware we're now using h2c07.

Comment 13 Anthony DelSorbo 2019-10-02 05:53:10 MDT

Created attachment 11777 [details]
New process table listing for h2c07

Just in case you need it

Comment 14 Anthony DelSorbo 2019-10-02 11:37:41 MDT

Felip,

Here's some additional info that may be relevant:

The following job completed successfully, it is a single core job.  The user's log does not indicate anything abnormal occurred. 
 However, as you can see, the slurmstepd ended with an _oom_event_monitor message.  And, although it states an oom-kill occurred, there's no evidence as to what was killed.


Oct  2 16:57:05 h25c24 slurmstepd[453775]: job 161901 completed with slurm_rc = 0, job_rc = 0
Oct  2 16:57:05 h25c24 slurmstepd[453775]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
Oct  2 16:57:05 h25c24 slurmstepd[453775]: done with job
Oct  2 16:57:05 h25c24 slurmstepd[453768]: Sent signal 18 to 161901.4294967295
Oct  2 16:57:05 h25c24 slurmstepd[453768]: Sent signal 15 to 161901.4294967295
Oct  2 16:57:05 h25c24 slurmstepd[453768]: _oom_event_monitor: oom-kill event count: 1
Oct  2 16:57:05 h25c24 slurmstepd[453768]: done with job

Oct  2 16:57:05 h25c24 EPILOG-ROOT: JobInfo: JobId=161901 JobName=RAP_ncl_09_00_zip UserId=Eric.James(5133) GroupId=wrfruc(10019) MCS_label=N/A Priority=206024767 Nice=0 Account=wrfruc QOS=batch JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-10-02T16:57:04 EligibleTime=2019-10-02T16:57:04 AccrueTime=2019-10-02T16:57:04 StartTime=2019-10-02T16:57:04 EndTime=2019-10-02T16:57:05 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-02T16:57:04 Partition=hera AllocNode:Sid=hfe04:240106 ReqNodeList=(null) ExcNodeList=(null) NodeList=h25c24 BatchHost=h25c24 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=800M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/Eric.James Comment=83df185dcab2b6b36ec18d54aee12bd6  StdErr=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/ncl_201902040900_00_zip.log StdIn=/dev/null StdOut=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/ncl_201902040900_00_zip.log Power=

Comment 15 Felip Moll 2019-10-03 04:59:15 MDT

In your dmesg I am seeing that job 89098 is being killed by oom. Specifically it kills ISi.ml60.x command, is that a user executable or some Isilon storage daemon (or other system component)? 

[668472.552618] ISi.ml60.x invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[668472.560855] ISi.ml60.x cpuset=/ mems_allowed=0-1
[668472.566094] CPU: 53 PID: 165051 Comm: ISi.ml60.x Tainted: P           OE  ------------   3.10.0-957.21.3.el7.x86_64 #1
[668472.577400] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0016.C0001.020120190930 02/01/2019
[668472.588301] Call Trace:
[668472.591418]  [<ffffffffac363107>] dump_stack+0x19/0x1b
[668472.597187]  [<ffffffffac35db2a>] dump_header+0x90/0x229
[668472.603125]  [<ffffffffabcd6b72>] ? default_wake_function+0x12/0x20
[668472.610014]  [<ffffffffabdba386>] ? find_lock_task_mm+0x56/0xc0
[668472.616562]  [<ffffffffabe317a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[668472.623884]  [<ffffffffabdba834>] oom_kill_process+0x254/0x3d0
[668472.630336]  [<ffffffffabe35586>] mem_cgroup_oom_synchronize+0x546/0x570
[668472.637656]  [<ffffffffabe34a00>] ? mem_cgroup_charge_common+0xc0/0xc0
[668472.644802]  [<ffffffffabdbb0c4>] pagefault_out_of_memory+0x14/0x90
[668472.651680]  [<ffffffffac35c032>] mm_fault_error+0x6a/0x157
[668472.657874]  [<ffffffffac3707c8>] __do_page_fault+0x3c8/0x4f0
[668472.664231]  [<ffffffffac370925>] do_page_fault+0x35/0x90
[668472.670243]  [<ffffffffac36c768>] page_fault+0x28/0x30
[668472.675995] Task in /slurm/uid_16039/job_89098/step_0/task_0 killed as a result of limit of /slurm/uid_16039/job_89098

As you can see the mem_cgroup_oom_synchronize function shows up here, which is the same than the one seen in 'ps' WCHAN field.
In parallel I can see some Lustre errors:

[1095389.926811] Lustre: 7633:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434150/real 0]  req@ffff8866fb23e880 x1644529215860368/t0(0) o103->scratch1-OST002f-osc-ffff8873e8f93800@10.186.224.28@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434157 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[1095389.958596] Lustre: 7633:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[1095389.969455] Lustre: scratch1-OST002f-osc-ffff8873e8f93800: Connection to scratch1-OST002f (at 10.186.224.28@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[1095389.989152] Lustre: Skipped 7 previous similar messages
[1095391.923466] Lustre: 7673:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434151/real 0]  req@ffff88709d6563c0 x1644529218064592/t0(0) o103->scratch1-OST005b-osc-ffff8873e8f93800@10.186.224.36@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434159 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[1095391.954481] Lustre: 7673:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 9468 previous similar messages
[1095392.445297] drop_caches (192500): drop_caches: 3
[1095392.545268] epilog.root: Wed Sep 25 17:55:59 UTC 2019 Finished Job 105336
[1095392.657636] Lustre: scratch1-MDT0000-mdc-ffff8873e8f93800: Connection to scratch1-MDT0000 (at 10.186.224.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[1095392.677498] Lustre: Skipped 60 previous similar messages
[1095392.868931] Lustre: scratch1-OST0073-osc-ffff8873e8f93800: Connection restored to 10.186.224.26@o2ib (at 10.186.224.26@o2ib)
[1095392.881834] Lustre: Skipped 10 previous similar messages
[1095395.421728] Lustre: scratch1-MDT0000-mdc-ffff8873e8f93800: Connection restored to 10.186.224.66@o2ib (at 10.186.224.66@o2ib)
[1095395.915843] Lustre: 7681:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434151/real 0]  req@ffff8868e147f200 x1644529218676544/t0(0) o103->scratch1-OST0064-osc-ffff8873e8f93800@10.186.224.38@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434159 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[1095395.947253] Lustre: 7681:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 188922 previous similar messages
[1095397.028266] Lustre: scratch1-OST005c-osc-ffff8873e8f93800: Connection restored to 10.186.224.36@o2ib (at 10.186.224.36@o2ib)
[1095468.190495] epilog.root: Wed Sep 25 17:57:15 UTC 2019 Finished Job 106268
[1655908.323319] Lustre: 7627:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569996005/real 0]  req@ffff885c196117c0 x1644529317911328/t0(0) o400->scratch1-OST0007-osc-ffff8873e8f93800@10.186.224.22@o2ib:28/4 lens 224/224 e 0 to 1 dl 1569996012 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[1655908.323689] Lustre: scratch1-OST0008-osc-ffff8873e8f93800: Connection to scratch1-OST0008 (at 10.186.224.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[1655908.323690] Lustre: Skipped 1 previous similar message
[1655908.380065] Lustre: 7627:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 36831 previous similar messages
[1655914.385317] Lustre: 7614:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569996012/real 0]  req@ffff885eeaf5aac0 x1644529317913280/t0(0) o8->scratch1-OST0007-osc-ffff8873e8f93800@10.186.224.22@o2ib:28/4 lens 520/544 e 0 to 1 dl 1569996018 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[1656123.723948] Lustre: scratch1-OST006f-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib)
[1656123.736981] Lustre: Skipped 60 previous similar messages
[1656124.275015] Lustre: scratch1-OST0007-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib)
[1656136.813117] Lustre: scratch1-OST000a-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib)
[1656136.826170] Lustre: Skipped 2 previous similar messages

How do you match the RealMemory setting and the real available memory in the system? Does they match exactly? If so I would recommend to reserve a bit more for the OS decreasing RealMemory.
From the initial 'ps' note that the slurmd and stepd are in D state which probably means are waiting for input/output:

F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
1 D root     125742      1  0  80   0 - 1020759 mem_cg Sep13 ?      00:00:18 /apps/slurm/d/sbin/slurmd
...
1 D root      88290 125742  0  80   0 - 1020759 mem_cg Sep27 ?      00:00:00  \_ /apps/slurm/d/sbin/slurmd


So I am betting here you're having storage issues, be it because daemons are dying probably because of OOMs or otherwise because Lustre is failing at some point.

Comment 16 Felip Moll 2019-10-03 05:12:30 MDT

Another note;

the first failure I see in the logs is in this line:

[496671.216158] epilog.root: Wed Sep 18 19:13:30 UTC 2019 Finished Job 75474
[496719.295392] Lustre: 7620:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1568834051/real 0]  req@ffff886de0f4ed40 x1644525101596384/t0(0) o4->scratch1-OST0026-osc-ffff8873e8f93800@10.186.224.26@o2ib:6/4 lens 608/448 e 0 to 1 dl 1568834058 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
[496719.324924] Lustre: scratch1-OST0026-osc-ffff8873e8f93800: Connection to scratch1-OST0026 (at 10.186.224.26@o2ib) was lost; in progress operations using this service will wait for recovery to complete
.....

just after that the 'unzip' command got stuck into D state:

[496940.418823] INFO: task unzip:55703 blocked for more than 120 seconds.
[496940.426272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[496940.435067] unzip           D ffff886aa1088640     0 55703  55680 0x00000000

Comment 17 Felip Moll 2019-10-03 06:00:54 MDT

Can you also provide me with:

'sysctl -a' in a failing node?


Can you also enter the node, go to /sys/fs/cgroup/memory/slurm*/uid_*/job_*/ (for a job and respective user you see stuck in 'ps') and run:

pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \;


Thanks

Comment 20 Anthony DelSorbo 2019-10-03 07:56:29 MDT

(In reply to Felip Moll from comment #17)
> Can you also provide me with:
> 
> 'sysctl -a' in a failing node?
> 
will attach file ....

> 
> Can you also enter the node, go to /sys/fs/cgroup/memory/slurm*/uid_*/job_*/
> (for a job and respective user you see stuck in 'ps') and run:
> 
> pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \;
> 
> 
> Thanks

[root@h2c07 job_75246]# cd /sys/fs/cgroup/memory/slurm/uid_3014/job_106255
[root@h2c07 job_106255]#  pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/memory/slurm/uid_3014/job_106255
./step_batch/task_0/memory.oom_control
oom_kill_disable 0
under_oom 0
./step_batch/memory.oom_control
oom_kill_disable 0
under_oom 0
./step_extern/task_0/memory.oom_control
oom_kill_disable 0
under_oom 0
./step_extern/memory.oom_control
oom_kill_disable 0
under_oom 0
./memory.oom_control
oom_kill_disable 0
under_oom 0

Comment 21 Anthony DelSorbo 2019-10-03 10:56:04 MDT

Created attachment 11800 [details]
sysctl - a output on node

Comment 22 Felip Moll 2019-10-04 04:14:57 MDT

(In reply to Anthony DelSorbo from comment #21)
> Created attachment 11800 [details]
> sysctl - a output on node

Everything seems correct, so please, can you modify the RealMemory and MemSpec to a better value (RealMemory less than the physical memory on the node provided by free -m)?

Can you also comment about comment 15 and comment 16?

Thanks

Comment 23 Felip Moll 2019-10-09 05:48:42 MDT

Anthony, there may be some relation with this bug and 7872.

This:

Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)'

is the cause of the sympthoms of 7872 and can be part of this abandoned slurmstepds.

What is happening is that for some reason the freezer cgroup is set to 'null' in your proctrack plugin, then the task plugin tries to get the pids to kill steps from this plugin and since there's no path of the cgroup it fails to find the pids.


In a compute node:
---
What does 'cat /apps/slurm/d/etc/slurmd.service' show? I want to see if systemd is involved in messing your cgroups.

Show me also:

gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/slurmd_2eservice

and if it doesn't show anything, the full output of:

gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --recurse

Comment 24 Felip Moll 2019-10-17 09:50:37 MDT

After more digging this seems to be just a consequence of trying to kill tasks on a cgroup where no more tasks remain:

Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)'

so this message is also a symptom, not a consequence. It may have happened that an oom killed some stepd that shouldn't be killed.

I encourage you to check my comments 22 and 23, apply the changes and then see what happens.

Will wait for your feedback.

Comment 25 Felip Moll 2019-11-04 09:01:00 MST

Hi Anthony,

Can you refer to my comments 22 and 23?

Thank you.

Comment 26 Anthony DelSorbo 2019-11-04 10:31:12 MST

(In reply to Felip Moll from comment #25)
> Hi Anthony,
> 
> Can you refer to my comments 22 and 23?
> 
> Thank you.

Felip, I'm having a hard time understanding why I need to change the values I currently have and the ramifications thereof.  

(In reply to Felip Moll from comment #22)
> (In reply to Anthony DelSorbo from comment #21)
> > Created attachment 11800 [details]
> > sysctl - a output on node
> 
> Everything seems correct, so please, can you modify the RealMemory and
> MemSpec to a better value (RealMemory less than the physical memory on the
> node provided by free -m)?
> 

In our theia system (soon to be decommissioned) we have:

[root@t0268 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          64329        1522       62135         410         671       59828

The node definition in slurm.conf is:

NodeName=t[0001-1012,1027-1198] CPUs=24 CoresPerSocket=12 RealMemory=64000

and 

[root@t0268 ~]# scontrol sho node t0268 | egrep "RealMemory|MemSpecLimit|CfgTRES"
   RealMemory=64000 AllocMem=0 FreeMem=62133 Sockets=2 Boards=1
   MemSpecLimit=2000
   CfgTRES=cpu=24,mem=62.50G,billing=24

The DefMemPerCPU=2500 on partition theia.  Multiply that by 24 cpus = total mem to job = 60000. 

I'm not experiencing issues on that system - but it is running 18.08.  On the new hera system, I have:

[root@h6c34 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          95316        1741       90979         757        2595       91669

The node definition in slurm.conf is: 

NodeName=h1c[01-52],h[2-5]c[01-56],...,h[25]c[01-52] CPUs=40 CoresPerSocket=20 RealMemory=95000

and 

[root@h6c34 ~]# scontrol show node h6c34 | egrep "RealMemory|MemSpecLimit|CfgTRES"
   RealMemory=95000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   MemSpecLimit=2000
   CfgTRES=cpu=40,mem=95000M,billing=40

The DefMemPerCPU=2300 on partition hera x 40 cpus = 92000.  

This system is currently ruuning 19.05.1 and tomorrow it will be 19.05.3.  The configuration is nearly identical.  

So according to my aggregate defMemPerCPU I'm already 3GB below the RealMemory value of 95000. 

If I understand you correctly, I should set my RealMemory to less than 90979.  So, let's say 90000.  Then what do I set memSpec to and how does that affect DefMemPerCPU?  Doesn't that imply reducing the total amount of memory to the user as well?

Tony.

Comment 27 Felip Moll 2019-11-05 09:14:36 MST

> Felip, I'm having a hard time understanding why I need to change the values
> I currently have and the ramifications thereof.  

I wanted to set this to safe values to let the OS enough memory for I/O operations.

If you set RealMemory very close to the available system memory then jobs can allocate all this memory and not leave any for other processes, which can lead to OOMs on the OS side.

That may not be the reason of why you're seeing the defunct processes, but setting safer values will remove one piece out of the problem.
 
> In our theia system (soon to be decommissioned) we have:
> 
> [root@t0268 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          64329        1522       62135         410         671      
> 59828
> 
> The node definition in slurm.conf is:
> 
> NodeName=t[0001-1012,1027-1198] CPUs=24 CoresPerSocket=12 RealMemory=64000

Jobs are able to allocate up to 64000MB, leaving only 329 left for the OS.

If I run "srun --mem=64000 -N1 ./my_mem_eater" I may trigger OOM on OS side.

> I'm not experiencing issues on that system - but it is running 18.08.  On
> the new hera system, I have:
> 
> [root@h6c34 ~]# free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:          95316        1741       90979         757        2595      
> 91669
> 
> The node definition in slurm.conf is: 
> 
> NodeName=h1c[01-52],h[2-5]c[01-56],...,h[25]c[01-52] CPUs=40
> CoresPerSocket=20 RealMemory=95000
> 

The same in this case if I run 'srun --mem=95000..."


> The DefMemPerCPU=2300 on partition hera x 40 cpus = 92000.  

The DefMemPerCPU can be overwritten by user command line --mem, or --mem-per-cpu.
This is not a guarantee that jobs won't use more memory.

> So according to my aggregate defMemPerCPU I'm already 3GB below the
> RealMemory value of 95000. 

If users don't specify --mem* , yes. Otherwise they can consume 95000MB per node.
 
> If I understand you correctly, I should set my RealMemory to less than
> 90979.  So, let's say 90000.

No, to 95316-[safe_amount_for_OS].

If you consider that Lustre, I/O, system daemons and others will consume for example a max of 5GB, then I would set for example 90000MB, leaving 5316 for the OS.


>  Then what do I set memSpec to and how does
> that affect DefMemPerCPU?

MemSpecLimit=<amount> will substract 'amount' from what's available for job allocations. I.e. if you set RealMemory=90000 and MemSpecLimit=2000 then the memory allocatable for jobs will be 88000MB, and 2000MB will be set as the limit for the cgroup of slurmd, so slurmd daemon won't exceed 2000MB usage.

>  Doesn't that imply reducing the total amount of
> memory to the user as well?

Yes.



Added to this information and returning back to the error seen in dmesg_h2c07.log, I have still unresolved questions from comment 15.

1. ISi.ml60.x, what is that command? Is this a specific job command, or something related to storage?
2. That command was killed due to out of memory in the job's cgroup, followed after an issue in Lustre storage. I was wondering if killing this command + later Lustre I/O issues had something to do with your slurmd defunct processes.
3. Comment 16 refers to unzip being in 'D' state after Lustre issues. I wonder if this has to do with the issue too.
4. I am still thinking about bug 7872 being a duplicate of this one.

[668484.785112] Memory cgroup out of memory: Kill process 165028 (ISi.ml60.x) score 40 or sacrifice child
[668484.794886] Killed process 165028 (ISi.ml60.x) total-vm:738984kB, anon-rss:208648kB, file-rss:2384kB, shmem-rss:1284kB
[668484.863204] ISi.ml60.x invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[668484.871516] ISi.ml60.x cpuset=/ mems_allowed=0-1
[668484.876797] CPU: 29 PID: 165044 Comm: ISi.ml60.x Tainted: P           OE  ------------   3.10.0-957.21.3.el7.x86_64 #1
[668484.888029] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0016.C0001.020120190930 02/01/2019
[668484.898813] Call Trace:
[668484.901804]  [<ffffffffac363107>] dump_stack+0x19/0x1b
[668484.907458]  [<ffffffffac35db2a>] dump_header+0x90/0x229
[668484.913273]  [<ffffffffabcd6b72>] ? default_wake_function+0x12/0x20
[668484.920037]  [<ffffffffabdba386>] ? find_lock_task_mm+0x56/0xc0
[668484.926430]  [<ffffffffabe317a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[668484.933585]  [<ffffffffabdba834>] oom_kill_process+0x254/0x3d0
[668484.939866]  [<ffffffffabe35586>] mem_cgroup_oom_synchronize+0x546/0x570
[668484.947002]  [<ffffffffabe34a00>] ? mem_cgroup_charge_common+0xc0/0xc0
[668484.953977]  [<ffffffffabdbb0c4>] pagefault_out_of_memory+0x14/0x90
[668484.960694]  [<ffffffffac35c032>] mm_fault_error+0x6a/0x157
[668484.966711]  [<ffffffffac3707c8>] __do_page_fault+0x3c8/0x4f0
[668484.972903]  [<ffffffffac370925>] do_page_fault+0x35/0x90
[668484.978759]  [<ffffffffac36c768>] page_fault+0x28/0x30
[668484.984349] Task in /slurm/uid_16039/job_89098/step_0/task_12 killed as a result of limit of /slurm/uid_16039/job_89098
[668484.995608] memory: usage 5242880kB, limit 5242880kB, failcnt 155852
[668485.002444] memory+swap: usage 5242880kB, limit 5242880kB, failcnt 694
[668485.009453] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0


Actions on your side:
a. Respond to questions or make observation to 1, 2, 3, 4
b. Decide whether to set safe values for RealMemory and optionally MemSpecLimit.
c. Check for Lustre issues
d. Inform if it is still happening frequently

Comment 28 Felip Moll 2019-12-06 03:57:51 MST

Hi, do you have any feedback on my last comment?

Thanks!

Comment 29 Felip Moll 2019-12-13 02:48:00 MST

(In reply to Felip Moll from comment #28)
> Hi, do you have any feedback on my last comment?
> 
> Thanks!

Hi Anthony,

Since I haven't received any feedback in the last month I am heading to close this issue as INFOGIVEN. If you consider it is still unresolved, please mark the bug as OPEN again and provide the information requested in my latest comments.

Thanks for your comprehension!