We constrain user jobs in cgroup.conf: ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes A user jobs has nevertheless been able to grossly exceed the memory limits and the compute node is paging wildly. I logged into the node and captured the user's processes: # ps -eo "user,pid,vsz,rss,comm" USER PID VSZ RSS COMMAND ... tols 1479 113396 316 slurm_script tols 1580 97016 364 mpiexec tols 1582 5595692 2696984 gpaw-python tols 1583 5596696 2696536 gpaw-python tols 1584 5594084 2596748 gpaw-python tols 1585 5589220 2579076 gpaw-python tols 1586 5595720 2674540 gpaw-python tols 1587 5596612 2514800 gpaw-python tols 1588 5593968 2768952 gpaw-python tols 1589 5589160 2641000 gpaw-python The RSS is correctly limited below the node RAM/core at 2800M, but the VSZ shows that the job uses an additional 3000M per process. I thought this would be prohibited by the Cgroup setup. The slurmd.log shows some information about this job: [2017-06-13T13:42:56.806] task_p_slurmd_batch_request: 84650 [2017-06-13T13:42:56.806] task/affinity: job 84650 CPU input mask for node: 0xFF [2017-06-13T13:42:56.806] task/affinity: job 84650 CPU final HW mask for node: 0xFF [2017-06-13T13:42:56.807] _run_prolog: run job script took usec=3 [2017-06-13T13:42:56.807] _run_prolog: prolog with lock for job 84650 ran for 0 seconds [2017-06-13T13:42:56.807] Launching batch job 84650 for UID 15265 [2017-06-13T13:42:56.828] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device [2017-06-13T13:42:56.836] [84650] error: task/cgroup: unable to add task[pid=1479] to memory cg '(null)' [2017-06-13T13:42:56.837] [84650] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device [2017-06-13T13:42:56.838] [84650] error: jobacct_gather/cgroup: unable to instanciate user 15265 memory cgroup [2017-06-13T13:42:56.840] [84650] task_p_pre_launch: Using sched_affinity for tasks [2017-06-13T15:04:08.422] [84650] error: *** JOB 84650 ON b099 CANCELLED AT 2017-06-13T15:04:08 *** [2017-06-13T15:04:17.927] [84650] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 [2017-06-13T15:04:18.004] [84650] done with job I wonder about the meaning of the cgroup error "No space left on device"? Job information (while it was still running): # scontrol show job 84650 JobId=84650 JobName=sbatch UserId=tols(15265) GroupId=camdvip(1250) MCS_label=N/A Priority=32577 Nice=0 Account=camdvip QOS=fysik_qos JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:36:19 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2017-06-13T12:52:36 EligibleTime=2017-06-13T12:52:36 StartTime=2017-06-13T13:42:56 EndTime=2017-06-14T01:42:56 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=xeon8 AllocNode:Sid=sylg:40406 ReqNodeList=(null) ExcNodeList=(null) NodeList=b099 BatchHost=b099 NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=22400M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2800M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/niflheim/tols/cmr StdErr=/home/niflheim/tols/cmr/slurm-84650.out StdIn=/dev/null StdOut=/home/niflheim/tols/cmr/slurm-84650.out Power= Question: Can you help me figure out if I made a configuration error, or if something else is wrong here?
The process tree looked pretty normail to me: # pstree -p systemd(1)─┬─NetworkManager(727)─┬─dhclient(803) │ ├─{NetworkManager}(750) │ └─{NetworkManager}(752) ... ├─slurmd(20177) ├─slurmstepd(1475)─┬─slurm_script(1479)───mpiexec(1580)─┬─gpaw-python(1582)───{gpaw-python}(1594) │ │ ├─gpaw-python(1583)───{gpaw-python}(1596) │ │ ├─gpaw-python(1584)───{gpaw-python}(1593) │ │ ├─gpaw-python(1585)───{gpaw-python}(1598) │ │ ├─gpaw-python(1586)───{gpaw-python}(1591) │ │ ├─gpaw-python(1587)───{gpaw-python}(1595) │ │ ├─gpaw-python(1588)───{gpaw-python}(1592) │ │ ├─gpaw-python(1589)───{gpaw-python}(1597) │ │ └─{mpiexec}(1581) │ ├─{slurmstepd}(1476) │ ├─{slurmstepd}(1477) │ └─{slurmstepd}(1478)
In order to fix the error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device I've rebooted the node. Now jobs have been started on the node without causing such error messages. I have no idea how the /sys/fs/cgroup may filesystem might be filled up. FYI: Our cgroup.conf file is: CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes
Hi This can be one of kernel bug eg.: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 We have seen similar problem before, removeing "CgroupReleaseAgentDir" line from config, and only using only slurm self-cleaning masks this behavior. Dominik
(In reply to Dominik Bartkiewicz from comment #3) > This can be one of kernel bug eg.: > https://github.com/torvalds/linux/commit/ > 73f576c04b9410ed19660f74f97521bee6e1c546 > https://github.com/torvalds/linux/commit/ > 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 Thanks for identifying this as a Linux kernel bug. I don't expect the CentOS 7.3 kernel will ever receive a backported patch? > We have seen similar problem before, removeing "CgroupReleaseAgentDir" line > from config, and only using only slurm self-cleaning masks this behavior. Thanks for the workaround. However, according to https://bugs.schedmd.com/show_bug.cgi?id=3853#c6 this requires Slurm 17.02.3 or later. Would you agree? For older Slurm releases (we run 16.05.10) the only workaround is to reboot the node in order to clear the /sys/fs/cgroup ? /Ole
Hi You are right. This problem is solved by: https://github.com/SchedMD/slurm/commit/24e2cb07e8e363f24 This commit is in 17.02.3 and above. Dominik
We see the same error here running v.17.02.3
Hi Have you removed "CgroupReleaseAgentDir" line from config? Witch kernel version do you use? Dominik
Hi Any news on this? Dominik
I am out of the office. If you have questions that need a reply the week of July 4th please cc research@unc.edu. Regards, Jenny Williams Systems Administrator UNC Chapel Hill
(In reply to Dominik Bartkiewicz from comment #7) > Hi > > Have you removed "CgroupReleaseAgentDir" line from config? No, because we run Slurm 16.05. Upgrade to 17.02 will be done soon. > Witch kernel version do you use? CentOS 7.3 with these kernels on compute nodes: 3.10.0-514.2.2.el7.x86_64 3.10.0-514.6.1.el7.x86_64 The "error: xcgroup_instantiate: unable to create cgroup" message occurs with both of these kernels.
(In reply to Dominik Bartkiewicz from comment #7) > Have you removed "CgroupReleaseAgentDir" line from config? > Witch kernel version do you use? Status update: All nodes have now been upgraded to Slurm 17.02.6 and rebooted subsequently with CentOS 7.3 kernel 3.10.0-514.6.1.el7.x86_64. The CgroupReleaseAgentDir line has been removed from cgroup.conf and the file now contains: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes On all nodes I have now searched for the previously experienced error message: # grep /sys/fs/cgroup/memory/slurm/ /var/log/slurm/slurmd.log The result is zero hits :-) I don't know whether this shows the absence of Cgroup problems. We have only been running 17.02 for a few days at this time. Can you possibly suggest some other search for Cgroup symptoms? So perhaps this case has been resolved by the upgrade to 17.02.6.
Hi Not really, only what I put on comment 3. During searching I noticed that many people have some docker/cgroup_memory related problems on different kernels. I can have only hope that on this config slurm will work fine. But I am sure that this log "No space left on device" is kernel not slurm bug. Let me know if this happen again. Dominik
Hi I am closing this as "INFOGIVEN", Pleas re-open if you have any questions. Dominik