Bug 3890

Summary: Cgroup ConstrainRAMSpace or ConstrainSwapSpace not enforced
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: LimitsAssignee: Dominik Bartkiewicz <bart>
Severity: 3 - Medium Impact    
Priority: --- CC: bart, jennyw
Version: 16.05.10   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5504
Site: DTU Physics Alineos Sites: ---
Bull/Atos Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
SFW Sites: --- SNIC sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2017-06-13 07:47:12 MDT
We constrain user jobs in cgroup.conf:

A user jobs has nevertheless been able to grossly exceed the memory limits and the compute node is paging wildly.  I logged into the node and captured the user's processes:

# ps -eo "user,pid,vsz,rss,comm"
tols      1479 113396   316 slurm_script
tols      1580  97016   364 mpiexec
tols      1582 5595692 2696984 gpaw-python
tols      1583 5596696 2696536 gpaw-python
tols      1584 5594084 2596748 gpaw-python
tols      1585 5589220 2579076 gpaw-python
tols      1586 5595720 2674540 gpaw-python
tols      1587 5596612 2514800 gpaw-python
tols      1588 5593968 2768952 gpaw-python
tols      1589 5589160 2641000 gpaw-python

The RSS is correctly limited below the node RAM/core at 2800M, but the VSZ shows that the job uses an additional 3000M per process.  I thought this would be prohibited by the Cgroup setup.

The slurmd.log shows some information about this job:

[2017-06-13T13:42:56.806] task_p_slurmd_batch_request: 84650
[2017-06-13T13:42:56.806] task/affinity: job 84650 CPU input mask for node: 0xFF
[2017-06-13T13:42:56.806] task/affinity: job 84650 CPU final HW mask for node: 0xFF
[2017-06-13T13:42:56.807] _run_prolog: run job script took usec=3
[2017-06-13T13:42:56.807] _run_prolog: prolog with lock for job 84650 ran for 0 seconds
[2017-06-13T13:42:56.807] Launching batch job 84650 for UID 15265
[2017-06-13T13:42:56.828] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
[2017-06-13T13:42:56.836] [84650] error: task/cgroup: unable to add task[pid=1479] to memory cg '(null)'
[2017-06-13T13:42:56.837] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
[2017-06-13T13:42:56.838] [84650] error: jobacct_gather/cgroup: unable to instanciate user 15265 memory cgroup
[2017-06-13T13:42:56.840] [84650] task_p_pre_launch: Using sched_affinity for tasks
[2017-06-13T15:04:08.422] [84650] error: *** JOB 84650 ON b099 CANCELLED AT 2017-06-13T15:04:08 ***
[2017-06-13T15:04:17.927] [84650] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
[2017-06-13T15:04:18.004] [84650] done with job

I wonder about the meaning of the cgroup error "No space left on device"?

Job information (while it was still running):

# scontrol show job 84650
JobId=84650 JobName=sbatch
   UserId=tols(15265) GroupId=camdvip(1250) MCS_label=N/A
   Priority=32577 Nice=0 Account=camdvip QOS=fysik_qos
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:36:19 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2017-06-13T12:52:36 EligibleTime=2017-06-13T12:52:36
   StartTime=2017-06-13T13:42:56 EndTime=2017-06-14T01:42:56 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xeon8 AllocNode:Sid=sylg:40406
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2800M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Question: Can you help me figure out if I made a configuration error, or if something else is wrong here?
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2017-06-13 07:48:38 MDT
The process tree looked pretty normail to me:

# pstree -p
           │                     ├─{NetworkManager}(750)
           │                     └─{NetworkManager}(752)
           │                  │                                    ├─gpaw-python(1583)───{gpaw-python}(1596)
           │                  │                                    ├─gpaw-python(1584)───{gpaw-python}(1593)
           │                  │                                    ├─gpaw-python(1585)───{gpaw-python}(1598)
           │                  │                                    ├─gpaw-python(1586)───{gpaw-python}(1591)
           │                  │                                    ├─gpaw-python(1587)───{gpaw-python}(1595)
           │                  │                                    ├─gpaw-python(1588)───{gpaw-python}(1592)
           │                  │                                    ├─gpaw-python(1589)───{gpaw-python}(1597)
           │                  │                                    └─{mpiexec}(1581)
           │                  ├─{slurmstepd}(1476)
           │                  ├─{slurmstepd}(1477)
           │                  └─{slurmstepd}(1478)
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2017-06-15 00:29:29 MDT
In order to fix the error: 
xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device
I've rebooted the node.  Now jobs have been started on the node without causing such error messages.  I have no idea how the /sys/fs/cgroup may filesystem might be filled up.

FYI: Our cgroup.conf file is:
Comment 3 Dominik Bartkiewicz 2017-06-15 04:31:33 MDT

This can be one of kernel bug eg.:

We have seen similar problem before, removeing "CgroupReleaseAgentDir" line from config, and only using only  slurm self-cleaning  masks this behavior. 

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2017-06-15 05:33:29 MDT
(In reply to Dominik Bartkiewicz from comment #3)
> This can be one of kernel bug eg.:
> https://github.com/torvalds/linux/commit/
> 73f576c04b9410ed19660f74f97521bee6e1c546
> https://github.com/torvalds/linux/commit/
> 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9

Thanks for identifying this as a Linux kernel bug.  I don't expect the CentOS 7.3 kernel will ever receive a backported patch?  

> We have seen similar problem before, removeing "CgroupReleaseAgentDir" line
> from config, and only using only  slurm self-cleaning  masks this behavior. 

Thanks for the workaround.  However, according to https://bugs.schedmd.com/show_bug.cgi?id=3853#c6 this requires Slurm 17.02.3 or later.  Would you agree?

For older Slurm releases (we run 16.05.10) the only workaround is to reboot the node in order to clear the /sys/fs/cgroup ?

Comment 5 Dominik Bartkiewicz 2017-06-15 06:04:44 MDT

You are right.
This problem is solved by:
This commit is in 17.02.3 and above.

Comment 6 Jenny Williams 2017-06-21 10:11:11 MDT
We see the same error here running v.17.02.3
Comment 7 Dominik Bartkiewicz 2017-06-23 08:19:14 MDT

Have you removed "CgroupReleaseAgentDir" line from config?
Witch kernel version do you use?

Comment 8 Dominik Bartkiewicz 2017-07-06 05:34:37 MDT

Any news on this?

Comment 9 Jenny Williams 2017-07-06 05:34:44 MDT

I am out of the office.  If you have questions that need a reply the week of July 4th  please cc research@unc.edu.

Jenny Williams
Systems Administrator
UNC Chapel Hill
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2017-07-06 06:01:42 MDT
(In reply to Dominik Bartkiewicz from comment #7)
> Hi
> Have you removed "CgroupReleaseAgentDir" line from config?

No, because we run Slurm 16.05.  Upgrade to 17.02 will be done soon.

> Witch kernel version do you use?

CentOS 7.3 with these kernels on compute nodes:

The "error: xcgroup_instantiate: unable to create cgroup" message occurs with both of these kernels.
Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2017-07-13 06:48:59 MDT
(In reply to Dominik Bartkiewicz from comment #7)
> Have you removed "CgroupReleaseAgentDir" line from config?
> Witch kernel version do you use?

Status update: All nodes have now been upgraded to Slurm 17.02.6 and rebooted subsequently with CentOS 7.3 kernel 3.10.0-514.6.1.el7.x86_64.

The CgroupReleaseAgentDir line has been removed from cgroup.conf and the file now contains:


On all nodes I have now searched for the previously experienced error message:

# grep /sys/fs/cgroup/memory/slurm/ /var/log/slurm/slurmd.log 

The result is zero hits :-)

I don't know whether this shows the absence of Cgroup problems.  We have only been running 17.02 for a few days at this time.

Can you possibly suggest some other search for Cgroup symptoms?  

So perhaps this case has been resolved by the upgrade to 17.02.6.
Comment 12 Dominik Bartkiewicz 2017-07-14 06:27:22 MDT

Not really, only what I put on comment 3. During searching I noticed that many people have some docker/cgroup_memory related problems on different kernels. I can have only hope that on this config
slurm will work fine. But I am sure that this log "No space left on device" is kernel not slurm bug. Let me know if this happen again.

Comment 13 Dominik Bartkiewicz 2017-07-27 03:32:03 MDT

I am closing this as "INFOGIVEN",
Pleas re-open if you have any questions.