Bug 3890

Summary:	Cgroup ConstrainRAMSpace or ConstrainSwapSpace not enforced
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Limits	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, jennyw
Version:	16.05.10
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5504 https://bugs.schedmd.com/show_bug.cgi?id=5507
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2017-06-13 07:47:12 MDT

We constrain user jobs in cgroup.conf:
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

A user jobs has nevertheless been able to grossly exceed the memory limits and the compute node is paging wildly.  I logged into the node and captured the user's processes:

# ps -eo "user,pid,vsz,rss,comm"
USER       PID    VSZ   RSS COMMAND
...
tols      1479 113396   316 slurm_script
tols      1580  97016   364 mpiexec
tols      1582 5595692 2696984 gpaw-python
tols      1583 5596696 2696536 gpaw-python
tols      1584 5594084 2596748 gpaw-python
tols      1585 5589220 2579076 gpaw-python
tols      1586 5595720 2674540 gpaw-python
tols      1587 5596612 2514800 gpaw-python
tols      1588 5593968 2768952 gpaw-python
tols      1589 5589160 2641000 gpaw-python

The RSS is correctly limited below the node RAM/core at 2800M, but the VSZ shows that the job uses an additional 3000M per process.  I thought this would be prohibited by the Cgroup setup.

The slurmd.log shows some information about this job:

[2017-06-13T13:42:56.806] task_p_slurmd_batch_request: 84650
[2017-06-13T13:42:56.806] task/affinity: job 84650 CPU input mask for node: 0xFF
[2017-06-13T13:42:56.806] task/affinity: job 84650 CPU final HW mask for node: 0xFF
[2017-06-13T13:42:56.807] _run_prolog: run job script took usec=3
[2017-06-13T13:42:56.807] _run_prolog: prolog with lock for job 84650 ran for 0 seconds
[2017-06-13T13:42:56.807] Launching batch job 84650 for UID 15265
[2017-06-13T13:42:56.828] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
[2017-06-13T13:42:56.836] [84650] error: task/cgroup: unable to add task[pid=1479] to memory cg '(null)'
[2017-06-13T13:42:56.837] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
[2017-06-13T13:42:56.838] [84650] error: jobacct_gather/cgroup: unable to instanciate user 15265 memory cgroup
[2017-06-13T13:42:56.840] [84650] task_p_pre_launch: Using sched_affinity for tasks
[2017-06-13T15:04:08.422] [84650] error: *** JOB 84650 ON b099 CANCELLED AT 2017-06-13T15:04:08 ***
[2017-06-13T15:04:17.927] [84650] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
[2017-06-13T15:04:18.004] [84650] done with job

I wonder about the meaning of the cgroup error "No space left on device"?

Job information (while it was still running):

# scontrol show job 84650
JobId=84650 JobName=sbatch
   UserId=tols(15265) GroupId=camdvip(1250) MCS_label=N/A
   Priority=32577 Nice=0 Account=camdvip QOS=fysik_qos
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:36:19 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2017-06-13T12:52:36 EligibleTime=2017-06-13T12:52:36
   StartTime=2017-06-13T13:42:56 EndTime=2017-06-14T01:42:56 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xeon8 AllocNode:Sid=sylg:40406
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=b099
   BatchHost=b099
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=22400M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2800M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/niflheim/tols/cmr
   StdErr=/home/niflheim/tols/cmr/slurm-84650.out
   StdIn=/dev/null
   StdOut=/home/niflheim/tols/cmr/slurm-84650.out
   Power=

Question: Can you help me figure out if I made a configuration error, or if something else is wrong here?

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2017-06-13 07:48:38 MDT

The process tree looked pretty normail to me:

# pstree -p
systemd(1)─┬─NetworkManager(727)─┬─dhclient(803)
           │                     ├─{NetworkManager}(750)
           │                     └─{NetworkManager}(752)
      ...
           ├─slurmd(20177)
           ├─slurmstepd(1475)─┬─slurm_script(1479)───mpiexec(1580)─┬─gpaw-python(1582)───{gpaw-python}(1594)
           │                  │                                    ├─gpaw-python(1583)───{gpaw-python}(1596)
           │                  │                                    ├─gpaw-python(1584)───{gpaw-python}(1593)
           │                  │                                    ├─gpaw-python(1585)───{gpaw-python}(1598)
           │                  │                                    ├─gpaw-python(1586)───{gpaw-python}(1591)
           │                  │                                    ├─gpaw-python(1587)───{gpaw-python}(1595)
           │                  │                                    ├─gpaw-python(1588)───{gpaw-python}(1592)
           │                  │                                    ├─gpaw-python(1589)───{gpaw-python}(1597)
           │                  │                                    └─{mpiexec}(1581)
           │                  ├─{slurmstepd}(1476)
           │                  ├─{slurmstepd}(1477)
           │                  └─{slurmstepd}(1478)

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2017-06-15 00:29:29 MDT

In order to fix the error: 
xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
I've rebooted the node.  Now jobs have been started on the node without causing such error messages.  I have no idea how the /sys/fs/cgroup may filesystem might be filled up.

FYI: Our cgroup.conf file is:
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

Comment 3 Dominik Bartkiewicz 2017-06-15 04:31:33 MDT

Hi

This can be one of kernel bug eg.:
https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546
https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9

We have seen similar problem before, removeing "CgroupReleaseAgentDir" line from config, and only using only  slurm self-cleaning  masks this behavior. 

Dominik

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2017-06-15 05:33:29 MDT

(In reply to Dominik Bartkiewicz from comment #3)
> This can be one of kernel bug eg.:
> https://github.com/torvalds/linux/commit/
> 73f576c04b9410ed19660f74f97521bee6e1c546
> https://github.com/torvalds/linux/commit/
> 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9

Thanks for identifying this as a Linux kernel bug.  I don't expect the CentOS 7.3 kernel will ever receive a backported patch?  

> We have seen similar problem before, removeing "CgroupReleaseAgentDir" line
> from config, and only using only  slurm self-cleaning  masks this behavior. 

Thanks for the workaround.  However, according to https://bugs.schedmd.com/show_bug.cgi?id=3853#c6 this requires Slurm 17.02.3 or later.  Would you agree?

For older Slurm releases (we run 16.05.10) the only workaround is to reboot the node in order to clear the /sys/fs/cgroup ?

/Ole

Comment 5 Dominik Bartkiewicz 2017-06-15 06:04:44 MDT

Hi

You are right.
This problem is solved by:
https://github.com/SchedMD/slurm/commit/24e2cb07e8e363f24
This commit is in 17.02.3 and above.

Dominik

Comment 6 Jenny Williams 2017-06-21 10:11:11 MDT

We see the same error here running v.17.02.3

Comment 7 Dominik Bartkiewicz 2017-06-23 08:19:14 MDT

Hi

Have you removed "CgroupReleaseAgentDir" line from config?
Witch kernel version do you use?

Dominik

Comment 8 Dominik Bartkiewicz 2017-07-06 05:34:37 MDT

Hi

Any news on this?

Dominik

Comment 9 Jenny Williams 2017-07-06 05:34:44 MDT


I am out of the office.  If you have questions that need a reply the week of July 4th  please cc research@unc.edu.

Regards,
Jenny Williams
Systems Administrator
UNC Chapel Hill

Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2017-07-06 06:01:42 MDT

(In reply to Dominik Bartkiewicz from comment #7)
> Hi
> 
> Have you removed "CgroupReleaseAgentDir" line from config?

No, because we run Slurm 16.05.  Upgrade to 17.02 will be done soon.

> Witch kernel version do you use?

CentOS 7.3 with these kernels on compute nodes:
3.10.0-514.2.2.el7.x86_64
3.10.0-514.6.1.el7.x86_64

The "error: xcgroup_instantiate: unable to create cgroup" message occurs with both of these kernels.

Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2017-07-13 06:48:59 MDT

(In reply to Dominik Bartkiewicz from comment #7)
> Have you removed "CgroupReleaseAgentDir" line from config?
> Witch kernel version do you use?

Status update: All nodes have now been upgraded to Slurm 17.02.6 and rebooted subsequently with CentOS 7.3 kernel 3.10.0-514.6.1.el7.x86_64.

The CgroupReleaseAgentDir line has been removed from cgroup.conf and the file now contains:

CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

On all nodes I have now searched for the previously experienced error message:

# grep /sys/fs/cgroup/memory/slurm/ /var/log/slurm/slurmd.log 

The result is zero hits :-)

I don't know whether this shows the absence of Cgroup problems.  We have only been running 17.02 for a few days at this time.

Can you possibly suggest some other search for Cgroup symptoms?  

So perhaps this case has been resolved by the upgrade to 17.02.6.

Comment 12 Dominik Bartkiewicz 2017-07-14 06:27:22 MDT

Hi

Not really, only what I put on comment 3. During searching I noticed that many people have some docker/cgroup_memory related problems on different kernels. I can have only hope that on this config
slurm will work fine. But I am sure that this log "No space left on device" is kernel not slurm bug. Let me know if this happen again.

Dominik

Comment 13 Dominik Bartkiewicz 2017-07-27 03:32:03 MDT

Hi

I am closing this as "INFOGIVEN",
Pleas re-open if you have any questions.

Dominik