5130 – Slurmd stops working - unable to create cgroup

Ticket 5130 - Slurmd stops working - unable to create cgroup

Summary: Slurmd stops working - unable to create cgroup

Status:	RESOLVED DUPLICATE of ticket 5082

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.7
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-05-03 08:59 MDT by GSK-ONYX-SLURM
Modified:	2018-05-11 17:50 MDT (History)
CC List:	1 user (show)

See Also:
Site:	GSK
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	?
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Control daemon log. (26.56 MB, text/plain) 2018-05-04 04:34 MDT, GSK-ONYX-SLURM	Details
Slurm daemon log (68.85 MB, text/plain) 2018-05-04 04:38 MDT, GSK-ONYX-SLURM	Details
Slurm configuration (5.82 KB, text/plain) 2018-05-08 08:33 MDT, GSK-ONYX-SLURM	Details
Sdiag information (8.74 KB, text/plain) 2018-05-08 08:34 MDT, GSK-ONYX-SLURM	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description GSK-ONYX-SLURM 2018-05-03 08:59:29 MDT

Hi.
We have now had a second occurrence of the slurmd reporting unable to create cgroup and then hanging.  Log file entries are:

[2018-05-03T13:43:32.128] [157115] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_157115' : No space left on device
[2018-05-03T13:43:32.129] [157119] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_157119' : No space left on device
[2018-05-03T13:43:32.129] [157118] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_157118' : No space left on device

and
[2018-05-03T13:43:32.491] [157115] error: task/cgroup: unable to add task[pid=32382] to memory cg '(null)'
[2018-05-03T13:43:33.724] [157119] error: task/cgroup: unable to add task[pid=32673] to memory cg '(null)'
[2018-05-03T13:43:33.725] [157118] error: task/cgroup: unable to add task[pid=32674] to memory cg '(null)'

and then eventually the slurmd just stops logging output.

We have this situation now.

Comment 3 Marshall Garey 2018-05-03 09:24:16 MDT

Hi,

I believe this is a duplicate of bug 5082. There is a known kernel bug that causes this. What is your kernel version? On bug 5082, I'm also investigating more deeply to see if there's something else that may cause the job cgroups to not get cleaned up.

I don't think the logging is related at all - there have been several reports of this and none have mentioned logging. Can you upload the last part of the slurmd log file? Can you check to see if you filled up your filesystem?

Comment 5 GSK-ONYX-SLURM 2018-05-03 10:05:24 MDT

[root@uk1salx00553 slurm]# cat /proc/version
Linux version 3.10.0-693.11.6.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017
[root@uk1salx00553 slurm]#
[root@uk1salx00553 slurm]# uname -r
3.10.0-693.11.6.el7.x86_64
[root@uk1salx00553 slurm]#
[root@uk1salx00553 slurm]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[root@uk1salx00553 slurm]#

Comment 6 Marshall Garey 2018-05-03 13:26:03 MDT

Alright, so that kernel version is relatively new - we know certain older kernel versions had this bug, but aren't sure if/when RedHat added the fixes in.

Can you upload the relevant slurmd log file (the one the stopped logging) and check your filesystem space? Also, Alex said that when he did training at your site he suggested setting up logrotate. Have you done that? Maybe that's what you were seeing?

Comment 7 GSK-ONYX-SLURM 2018-05-04 04:32:32 MDT

See attached files for the latest incident.

Control daemon log since last restart on Tue 2018-04-24 08:04:40 BST

Slurmd daemon log since last restart on Thu 2018-04-26 08:10:40 BST

See the slurmd log at

[2018-05-03T13:44:19.018] [157205] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-03T16:51:59.325] error: accept: Bad address
[2018-05-03T17:27:03.990] error: Munge decode failed: Expired credential

17:27 BST is when the queue was resumed.

I also have another example of this from another server as well.

Comment 8 GSK-ONYX-SLURM 2018-05-04 04:34:15 MDT

Created attachment 6767 [details]
Control daemon log.

Control daemon log.

Comment 9 GSK-ONYX-SLURM 2018-05-04 04:38:26 MDT

Created attachment 6768 [details]
Slurm daemon log

Slurm daemon log.

Comment 10 Marshall Garey 2018-05-04 09:49:14 MDT

I'm noticing a very large amount of authentication errors in the slurmd. For example:

[2018-05-03T17:27:04.139] error: slurm_receive_msg_and_forward: Protocol authentication error
[2018-05-03T17:27:04.139] ENCODED: Thu May 03 13:51:33 2018
[2018-05-03T17:27:04.139] DECODED: Thu May 03 17:27:04 2018

[2018-05-03T17:27:04.139] error: authentication: Expired credential 

I also see this error a few times:

[2018-05-04T04:16:51.781] error: accept: Bad address

Can you check the following:

- Validate the munge credentials using munge tools on the controller and compute nodes.
- Are the clocks in sync?
- Have you updated Slurm recently?
- Have you changed your slurm.conf at all recently? If so, then what changed? Did you just issue a scontrol reconfigure or did you restart the various daemons after the change?


I think the cgroup error is unrelated. However, to assist with that problem, can you also upload the output of lscgroup on the afflicted compute node?

Comment 11 GSK-ONYX-SLURM 2018-05-04 10:56:56 MDT

All those munge errors are produced when the slurmd deamon "wakes up" after its hung and we've resumed it. I'm assuming they were in transit comms that have to be discarded because they are no longer valid. Maybe this is where jobs were queued to uk1salx00553 but then requeued (we see the requeue action in the logs).

To answer your other questions:

Munge appears to be working correctly.

Yes the clocks are in synch. We use ntp in our networks.

SLURM version is 17.02.7. That's what we installed originally. We haven't yet upgraded to 17.11.5 as we are still testing that in our test / dev environment.

I changed our slurm.conf today, add an existing known node to an existing queue and reconfigure. Any changes to slurm.conf were prior to the last full restarts.

I'm assuming you only want lscgroup for slurm. If you want a full lscgroup then let me know.

uk1salx00553 (The Lion): lscgroup | grep slurm
blkio:/system.slice/slurmd.service
cpu,cpuacct:/system.slice/slurmd.service
freezer:/slurm
freezer:/slurm/uid_62356
freezer:/slurm/uid_62356/job_111280
memory:/slurm
memory:/slurm/system
devices:/system.slice/slurmd.service
devices:/slurm
uk1salx00553 (The Lion):
uk1salx00553 (The Lion):
uk1salx00553 (The Lion): squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
164819 uk_hpc Percolat ll289546 R 1-02:07:54 1 uk1salx00552
167446 uk_hpc Danirixi ll289546 R 29:14 1 uk1salx00552
uk1salx00553 (The Lion):
uk1salx00553 (The Lion):

Comment 12 Marshall Garey 2018-05-04 17:16:48 MDT

> All those munge errors are produced when the slurmd deamon "wakes up" after
> its hung and we've resumed it.  I'm assuming they were in transit comms that
> have to be discarded because they are no longer valid.  Maybe this is where
> jobs were queued to uk1salx00553 but then requeued (we see the requeue
> action in the logs).

I agree.

> Munge appears to be working correctly.
>
> Yes the clocks are in synch.  We use ntp in our networks.
>
> SLURM version is 17.02.7.  That's what we installed originally.  We haven't yet upgraded to 17.11.5 as we are still testing that in our test / dev environment.


Thanks - that all sounds like it’s working correctly.

> I changed our slurm.conf today, add an existing known node to an existing
> queue and reconfigure.  Any changes to slurm.conf were prior to the last
> full restarts.
Okay - the change to the partition is just fine with a reconfigure. I just wanted to make sure you hadn’t changed a node definition and called reconfigure, since changing a node definition requires a full restart.

> I'm assuming you only want lscgroup for slurm.  If you want a full lscgroup
> then let me know.

Yes, just for Slurm. I was looking to see if there were lots of leftover cgroups that hadn’t gotten cleaned up properly, since that has happened for others with this problem. However, I only see one job cgroup that hasn’t gotten cleaned up.

Look at bug 5082 to keep track of progress on this cgroup bug - I've been working on it and have posted updates there.


I also noticed quite a few “Socket timed out” error messages, such as this
[2018-05-04T04:09:53.977] error: slurm_receive_msgs: Socket timed out on send/recv operation

I’m wondering if there are other issues in addition to the cgroup bug. Have you noticed client commands being unresponsive? Can you also upload a slurm.conf and the output of sdiag?

Comment 13 GSK-ONYX-SLURM 2018-05-08 08:33:06 MDT

Sorry for the delay.  UK Bank Holiday.

I will attach sdiag and slurm.conf.

I came in this morning to find SLURM queues down on both our main HPC servers, uk1salx00553 and uk1salx00552.  

uk1salx00552      1         uk_hpc*       down*   48   48:1:1 257526     2036      1   (null) Not responding
uk1salx00552      1 uk_columbus_tst       down*   48   48:1:1 257526     2036      1   (null) Not responding
uk1salx00553      1         uk_hpc*       down*   48   48:1:1 257526     2036      1   (null) Not responding
uk1salx00553      1 uk_columbus_tst       down*   48   48:1:1 257526     2036      1   (null) Not responding

Looking at the end of the 552 slurmd log shows:

[2018-05-04T18:19:47.471] [167485] error: task/cgroup: unable to add task[pid=6681] to memory cg '(null)'
[2018-05-04T18:19:47.473] [167485] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.509] [167487] error: task/cgroup: unable to add task[pid=6691] to memory cg '(null)'
[2018-05-04T18:19:47.511] [167487] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.527] [167491] error: task/cgroup: unable to add task[pid=6700] to memory cg '(null)'
[2018-05-04T18:19:47.532] [167491] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.542] [167488] error: task/cgroup: unable to add task[pid=6705] to memory cg '(null)'
[2018-05-04T18:19:47.544] [167488] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.564] [167486] error: task/cgroup: unable to add task[pid=6711] to memory cg '(null)'
[2018-05-04T18:19:47.566] [167486] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.575] [167489] error: task/cgroup: unable to add task[pid=6713] to memory cg '(null)'
[2018-05-04T18:19:47.577] [167489] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-06T21:05:57.891] [164819] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256

and the log on 553 is similar:

[2018-05-04T18:19:47.500] [167538] error: task/cgroup: unable to add task[pid=11468] to memory cg '(null)'
[2018-05-04T18:19:47.502] [167538] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.550] [167535] error: task/cgroup: unable to add task[pid=11471] to memory cg '(null)'
[2018-05-04T18:19:47.552] [167535] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.568] [167527] error: task/cgroup: unable to add task[pid=11474] to memory cg '(null)'
[2018-05-04T18:19:47.571] [167539] error: task/cgroup: unable to add task[pid=11481] to memory cg '(null)'
[2018-05-04T18:19:47.572] [167539] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.573] [167527] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.578] [167533] error: task/cgroup: unable to add task[pid=11484] to memory cg '(null)'
[2018-05-04T18:19:47.579] [167536] error: task/cgroup: unable to add task[pid=11485] to memory cg '(null)'
[2018-05-04T18:19:47.580] [167533] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.580] [167536] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.584] [167540] error: task/cgroup: unable to add task[pid=11488] to memory cg '(null)'
[2018-05-04T18:19:47.585] [167540] task_p_pre_launch: Using sched_affinity for tasks
[2018-05-04T18:19:47.600] [167537] error: task/cgroup: unable to add task[pid=11489] to memory cg '(null)'
[2018-05-04T18:19:47.601] [167537] task_p_pre_launch: Using sched_affinity for tasks

So we took the decision to turn over the slurmd logs as they were quite big and restart the slurmd service.

Bad move.  The slurmd service failed on both 552 and 553 with:

[2018-05-08T12:08:16.049] error: plugin_load_from_file: dlopen(/usr/local/slurm/lib64/slurm/proctrack_cgroup.so): /usr/local/slurm/lib64/slurm/proc
track_cgroup.so: cannot read file data: Cannot allocate memory
[2018-05-08T12:08:16.049] error: Couldn't load specified plugin name for proctrack/cgroup: Dlopen of plugin file failed
[2018-05-08T12:08:16.049] error: cannot create proctrack context for proctrack/cgroup
[2018-05-08T12:08:16.049] error: slurmd initialization failed

We eventually had to do a "cgclear" and were then able to get the slurmd service to restart.

However we still now have:

[2018-05-08T15:17:57.999] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm' : No space left on device
[2018-05-08T15:17:57.999] error: system cgroup: unable to build slurm cgroup for ns memory: No space left on device
[2018-05-08T15:17:58.000] error: Resource spec: unable to initialize system memory cgroup
[2018-05-08T15:17:58.000] error: Resource spec: system cgroup memory limit disabled

occurring in the 553 log file.

Comment 14 GSK-ONYX-SLURM 2018-05-08 08:33:59 MDT

Created attachment 6790 [details]
Slurm configuration

Slurm.conf

Comment 15 GSK-ONYX-SLURM 2018-05-08 08:34:35 MDT

Created attachment 6791 [details]
Sdiag information

Sdiag information

Comment 16 Marshall Garey 2018-05-08 10:15:13 MDT

I know this is a clunky workaround, but can you try rebooting the node? In every other case that has fixed the problem.

The kernel bug known to cause this is basically running out of cgroup id's - after 65535, no more can be created, despite not actually having that many active cgroups. If you're experiencing this kernel bug, I suspect that 

cat /proc/cgroups

will show a lot fewer than 64k cgroups, and that a node reboot will fix the issue. It would be good to know if we're leaking 64k cgroups, so if you could get the output of cat /proc/cgroups as well as lscgroup (again) on an afflicted node, that would be good. cat /proc/cgroups shows the actual number of cgroups used.

I also have a patch pending review that prevents freezer cgroups from leaking, since we've seen that happen. (see bug 5082) Since I personally haven't been able to reproduce the exact error you're seeing, I don't know if that patch will fix this "Unable to add task to memory cg"/"No space left on device" error, but when it's available I'll let you know.

Comment 17 GSK-ONYX-SLURM 2018-05-08 10:33:57 MDT

Here's the cgroups info.  We'll look at scheduling a reboot... as this is a production server that might take a while to schedule.  There some additional info below as well.

[root@uk1salx00553 slurm]#
[root@uk1salx00553 slurm]# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       1       1
cpu     0       1       1
cpuacct 0       1       1
memory  15      1       1
devices 17      3       1
freezer 16      2       1
net_cls 0       1       1
blkio   0       1       1
perf_event      0       1       1
hugetlb 0       1       1
pids    0       1       1
net_prio        0       1       1
[root@uk1salx00553 slurm]#
[root@uk1salx00553 slurm]# lscgroup | grep slurm
freezer:/slurm
devices:/slurm
[root@uk1salx00553 slurm]#


[root@uk1salx00552 slurm]# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       1       1
cpu     0       1       1
cpuacct 0       1       1
memory  13      4       1
devices 15      2       1
freezer 14      2       1
net_cls 0       1       1
blkio   0       1       1
perf_event      0       1       1
hugetlb 0       1       1
pids    0       1       1
net_prio        0       1       1
[root@uk1salx00552 slurm]#
[root@uk1salx00552 slurm]# lscgroup | grep slurm
memory:/slurm
memory:/slurm/uid_0
memory:/slurm/system
freezer:/slurm
devices:/slurm
[root@uk1salx00552 slurm]#


Comparing these two servers, 552 and 553, on 552

[root@uk1salx00552 slurm]# ls /sys/fs/cgroup/memory/slurm
cgroup.clone_children       memory.kmem.max_usage_in_bytes      memory.limit_in_bytes            memory.numa_stat            memory.use_hierarchy
cgroup.event_control        memory.kmem.slabinfo                memory.max_usage_in_bytes        memory.oom_control          notify_on_release
cgroup.procs                memory.kmem.tcp.failcnt             memory.memsw.failcnt             memory.pressure_level       system
memory.failcnt              memory.kmem.tcp.limit_in_bytes      memory.memsw.limit_in_bytes      memory.soft_limit_in_bytes  tasks
memory.force_empty          memory.kmem.tcp.max_usage_in_bytes  memory.memsw.max_usage_in_bytes  memory.stat                 uid_0
memory.kmem.failcnt         memory.kmem.tcp.usage_in_bytes      memory.memsw.usage_in_bytes      memory.swappiness
memory.kmem.limit_in_bytes  memory.kmem.usage_in_bytes          memory.move_charge_at_immigrate  memory.usage_in_bytes
[root@uk1salx00552 slurm]#

but on 553,

[root@uk1salx00553 slurm]# ls /sys/fs/cgroup/memory/slurm
ls: cannot access /sys/fs/cgroup/memory/slurm: No such file or directory
[root@uk1salx00553 slurm]#

Also, when we restart slurmd there is a slurmstepd reported

[root@uk1salx00553 slurm]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-05-08 17:12:04 BST; 14min ago
  Process: 5216 ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 5218 (slurmd)
   CGroup: /system.slice/slurmd.service
           ├─ 5218 /usr/local/slurm/sbin/slurmd
           └─30720 slurmstepd: [157108]

May 08 17:12:03 uk1salx00553.corpnet2.com systemd[1]: Starting Slurm node daemon...
May 08 17:12:04 uk1salx00553.corpnet2.com systemd[1]: Started Slurm node daemon.
[root@uk1salx00553 slurm]#
[root@uk1salx00553 slurm]# ps -ef | grep slurm
root      5218     1  0 17:12 ?        00:00:00 /usr/local/slurm/sbin/slurmd
root     12715  4921  0 17:27 pts/7    00:00:00 grep slurm
root     30720     1  0 May03 ?        00:00:00 slurmstepd: [157108]
[root@uk1salx00553 slurm]#

but there are no queued jobs

[root@uk1salx00553 slurm]# /usr/local/slurm/bin/squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[root@uk1salx00553 slurm]# 

and jobid 157108 finished days ago...

[root@uk1salx00553 slurm]# /usr/local/slurm/bin/sacct -j 157108
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
153212_201   COL-H004M+ uk_columb+                     1  COMPLETED      0:0
153212_201.+      batch                                1  COMPLETED      0:0
[root@uk1salx00553 slurm]# /usr/local/slurm/bin/sacct -j 157108 -o start,end,nodelist
              Start                 End        NodeList
------------------- ------------------- ---------------
2018-05-03T13:43:20 2018-05-03T13:44:15    uk1salx00553
2018-05-03T13:43:20 2018-05-03T13:44:15    uk1salx00553

Is that job 157108 stuck somehow and preventing the slurmd process starting properly?

Comment 18 GSK-ONYX-SLURM 2018-05-08 10:52:26 MDT

From the original slurmd log it looks like job 157108 as the last successful scheduled job before we started getting the initial cannot create cgroup message at around 13:44 on 3rd May.

Should we try killing the slurmstepd process and restarting the daemon to see if that changes anything?

Comment 19 Marshall Garey 2018-05-08 11:04:33 MDT

(In reply to GSK-EIS-SLURM from comment #13)
> Bad move.  The slurmd service failed on both 552 and 553 with:
Did the slurmd generate a core file? If so, can you get a full backtrace from the slurmd?

thread apply all bt full

(In reply to GSK-EIS-SLURM from comment #17)
> Here's the cgroups info.  We'll look at scheduling a reboot... as this is a
> production server that might take a while to schedule.

Yes, that's the problem with a reboot being a workaround.

> There some additional info below as well.
> 
> [root@uk1salx00553 slurm]#
> [root@uk1salx00553 slurm]# cat /proc/cgroups
> #subsys_name    hierarchy       num_cgroups     enabled
> cpuset  0       1       1
> cpu     0       1       1
> cpuacct 0       1       1
> memory  15      1       1
> devices 17      3       1
> freezer 16      2       1
> net_cls 0       1       1
> blkio   0       1       1
> perf_event      0       1       1
> hugetlb 0       1       1
> pids    0       1       1
> net_prio        0       1       1
> [root@uk1salx00553 slurm]#
> [root@uk1salx00553 slurm]# lscgroup | grep slurm
> freezer:/slurm
> devices:/slurm
> [root@uk1salx00553 slurm]#
> 
> 
> [root@uk1salx00552 slurm]# cat /proc/cgroups
> #subsys_name    hierarchy       num_cgroups     enabled
> cpuset  0       1       1
> cpu     0       1       1
> cpuacct 0       1       1
> memory  13      4       1
> devices 15      2       1
> freezer 14      2       1
> net_cls 0       1       1
> blkio   0       1       1
> perf_event      0       1       1
> hugetlb 0       1       1
> pids    0       1       1
> net_prio        0       1       1
> [root@uk1salx00552 slurm]#
> [root@uk1salx00552 slurm]# lscgroup | grep slurm
> memory:/slurm
> memory:/slurm/uid_0
> memory:/slurm/system
> freezer:/slurm
> devices:/slurm
> [root@uk1salx00552 slurm]#

That's not very many active cgroups - it should be able to create more cgroups. (Should being the operative word here.)

Unfortunately, I guess this is after you used cgclear, so I don't know if Slurm leaked cgroups or not. If this happens on another node, get the output of cat /proc/cgroups on that one, too.

> 
> Comparing these two servers, 552 and 553, on 552
> 
> [root@uk1salx00552 slurm]# ls /sys/fs/cgroup/memory/slurm
> cgroup.clone_children       memory.kmem.max_usage_in_bytes     
> memory.limit_in_bytes            memory.numa_stat           
> memory.use_hierarchy
> cgroup.event_control        memory.kmem.slabinfo               
> memory.max_usage_in_bytes        memory.oom_control         
> notify_on_release
> cgroup.procs                memory.kmem.tcp.failcnt            
> memory.memsw.failcnt             memory.pressure_level       system
> memory.failcnt              memory.kmem.tcp.limit_in_bytes     
> memory.memsw.limit_in_bytes      memory.soft_limit_in_bytes  tasks
> memory.force_empty          memory.kmem.tcp.max_usage_in_bytes 
> memory.memsw.max_usage_in_bytes  memory.stat                 uid_0
> memory.kmem.failcnt         memory.kmem.tcp.usage_in_bytes     
> memory.memsw.usage_in_bytes      memory.swappiness
> memory.kmem.limit_in_bytes  memory.kmem.usage_in_bytes         
> memory.move_charge_at_immigrate  memory.usage_in_bytes
> [root@uk1salx00552 slurm]#
> 
> but on 553,
> 
> [root@uk1salx00553 slurm]# ls /sys/fs/cgroup/memory/slurm
> ls: cannot access /sys/fs/cgroup/memory/slurm: No such file or directory
> [root@uk1salx00553 slurm]#

Is that because you deleted the cgroups with cgclear? It should get re-created with a restart of the slurmd

> Also, when we restart slurmd there is a slurmstepd reported
...
> Is that job 157108 stuck somehow and preventing the slurmd process starting
> properly?

Deadlocked or hanging slurmstepd's shouldn't prevent the slurmd from starting.

Without seeing the slurmd log file of when that job completed, I can only guess what happened. The job clearly isn't stuck, since its state is COMPLETED. If the job was stuck, it would still be in the slurmctld's state, and you could view it with scontrol show job 157108. Before you kill it, can you get a backtrace from it?

gdb attach <slurmstepd pid>
thread apply all bt

There is a known bug that causes a slurmstepd to deadlock. That is fixed in bug 5103 and will be in 17.11.6.

Comment 20 GSK-ONYX-SLURM 2018-05-08 11:15:34 MDT

[root@uk1salx00553 ~]# /usr/local/slurm/bin/scontrol show job 157108
slurm_load_jobs error: Invalid job id specified
[root@uk1salx00553 ~]#

[root@uk1salx00553 ~]# gdb attach 30720
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 30720
Reading symbols from /usr/local/slurm/sbin/slurmstepd...done.
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 30723]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_compat.so.2
Reading symbols from /usr/lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnsl.so.1
Reading symbols from /usr/lib64/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_nis.so.2
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/select_cons_res.so
Reading symbols from /usr/local/slurm/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/switch_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/gres_gpu.so
Reading symbols from /usr/local/slurm/lib64/slurm/core_spec_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/core_spec_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_affinity.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_affinity.so
Reading symbols from /usr/local/slurm/lib64/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/proctrack_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/crypto_munge.so
Reading symbols from /usr/local/slurm/lib64/slurm/job_container_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/job_container_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/mpi_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/mpi_none.so
Reading symbols from /usr/lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_dns.so.2
Reading symbols from /usr/lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libresolv.so.2
0x00002b35ef5bdf57 in pthread_join () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-17.02.7-1.el7.x86_64
(gdb) thread apply all bt

Thread 2 (Thread 0x2b35f2567700 (LWP 30723)):
#0  0x00002b35ef8d6eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b35ef93860d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00002b35ef9385bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00002b35ef938662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00002b35ef5bce38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b35ef8c934d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x2b35ee74e5c0 (LWP 30720)):
#0  0x00002b35ef5bdf57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x000000000042a787 in stepd_cleanup (msg=0xc209f0, job=0xc1fe20, cli=0xc1cab0, self=0x0, rc=0, only_mem=false) at slurmstepd.c:200
#2  0x000000000042a6ce in main (argc=1, argv=0x7fffc2271168) at slurmstepd.c:185
(gdb)

Comment 21 Marshall Garey 2018-05-08 11:18:34 MDT

Great - so we know that job isn't stuck since it isn't in the slurmctld state. That slurmstepd backtrace looks exactly like the deadlock bug that is fixed in bug 5103. Feel free to kill -9 that stepd.

Comment 22 GSK-ONYX-SLURM 2018-05-09 07:01:48 MDT

Both the uk1saxl00552 and uk1salx00553 servers have been rebooted and SLURM has started normally.  The missing /sys/fs/cgroup/memory/slurm cgroup for 553 now exists.

Scheduling a couple of test jobs was successful.

Is there any information we should collect at this point in time ready for comparison when the issue returns?

Comment 23 Marshall Garey 2018-05-09 09:45:53 MDT

(In reply to GSK-EIS-SLURM from comment #22)
> Both the uk1saxl00552 and uk1salx00553 servers have been rebooted and SLURM
> has started normally.  The missing /sys/fs/cgroup/memory/slurm cgroup for
> 553 now exists.
> 
> Scheduling a couple of test jobs was successful.
> 
> Is there any information we should collect at this point in time ready for
> comparison when the issue returns?

I don't think we need anything right now. If you do hit this problem again, run cat /proc/cgroups and see how many cgroups there are - if there aren't 64k, you should still be able to create it. Also run lscgroup to see if Slurm has leaked any cgroups.

See my latest post on bug 5082 - none of the fixes for this "No space left on device" bug when trying to create memory cgroups have been backported to Linux 3.10 - they're all on Linux 4.<something>. One of the sites has opened a ticket with RedHat. I'd keep bugging RedHat to backport the fixes.

If it's okay with you, can I close this as a duplicate of bug 5082? Feel free to CC yourself on that ticket.

Comment 24 Marshall Garey 2018-05-09 09:53:33 MDT

Also, I took a look at your slurm.conf and sdiag output. The sdiag output was only for a very short amount of time, so it didn't give me very much information. The little information that it did give showed the system working fine. If you find that your system is slow and need tuning recommendations, feel free to open a new ticket, and upload the output of sdiag and your current slurm.conf on that one. One thing to watch out for is the "Socket timed out on send/recv" messages.

Comment 25 GSK-ONYX-SLURM 2018-05-11 05:24:05 MDT

Yes, sure please go ahead and close.  I couldn't cc myself on that other ticket.  It seemed to want me to complete the version fixed field as it says it mandatory.  I didn't think it appropriate for me to be changing any other fields other than adding a cc.  So I didn't.

Comment 26 Marshall Garey 2018-05-11 17:50:54 MDT

Closing as a duplicate of bug 5082.

I've add the following email to the CC list on bug 5082:

GSK-EIS-SLURM@gsk.com

Please feel free to comment on that bug.

*** This ticket has been marked as a duplicate of ticket 5082 ***