Summary: | Cgroup ConstrainRAMSpace or ConstrainSwapSpace not enforced | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | Limits | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart, jennyw |
Version: | 16.05.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5504 https://bugs.schedmd.com/show_bug.cgi?id=5507 |
||
Site: | DTU Physics | Alineos Sites: | --- |
Bull/Atos Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2017-06-13 07:47:12 MDT
The process tree looked pretty normail to me: # pstree -p systemd(1)─┬─NetworkManager(727)─┬─dhclient(803) │ ├─{NetworkManager}(750) │ └─{NetworkManager}(752) ... ├─slurmd(20177) ├─slurmstepd(1475)─┬─slurm_script(1479)───mpiexec(1580)─┬─gpaw-python(1582)───{gpaw-python}(1594) │ │ ├─gpaw-python(1583)───{gpaw-python}(1596) │ │ ├─gpaw-python(1584)───{gpaw-python}(1593) │ │ ├─gpaw-python(1585)───{gpaw-python}(1598) │ │ ├─gpaw-python(1586)───{gpaw-python}(1591) │ │ ├─gpaw-python(1587)───{gpaw-python}(1595) │ │ ├─gpaw-python(1588)───{gpaw-python}(1592) │ │ ├─gpaw-python(1589)───{gpaw-python}(1597) │ │ └─{mpiexec}(1581) │ ├─{slurmstepd}(1476) │ ├─{slurmstepd}(1477) │ └─{slurmstepd}(1478) In order to fix the error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_15265' : No space left on device I've rebooted the node. Now jobs have been started on the node without causing such error messages. I have no idea how the /sys/fs/cgroup may filesystem might be filled up. FYI: Our cgroup.conf file is: CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes Hi This can be one of kernel bug eg.: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 We have seen similar problem before, removeing "CgroupReleaseAgentDir" line from config, and only using only slurm self-cleaning masks this behavior. Dominik (In reply to Dominik Bartkiewicz from comment #3) > This can be one of kernel bug eg.: > https://github.com/torvalds/linux/commit/ > 73f576c04b9410ed19660f74f97521bee6e1c546 > https://github.com/torvalds/linux/commit/ > 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 Thanks for identifying this as a Linux kernel bug. I don't expect the CentOS 7.3 kernel will ever receive a backported patch? > We have seen similar problem before, removeing "CgroupReleaseAgentDir" line > from config, and only using only slurm self-cleaning masks this behavior. Thanks for the workaround. However, according to https://bugs.schedmd.com/show_bug.cgi?id=3853#c6 this requires Slurm 17.02.3 or later. Would you agree? For older Slurm releases (we run 16.05.10) the only workaround is to reboot the node in order to clear the /sys/fs/cgroup ? /Ole Hi You are right. This problem is solved by: https://github.com/SchedMD/slurm/commit/24e2cb07e8e363f24 This commit is in 17.02.3 and above. Dominik We see the same error here running v.17.02.3 Hi Have you removed "CgroupReleaseAgentDir" line from config? Witch kernel version do you use? Dominik Hi Any news on this? Dominik I am out of the office. If you have questions that need a reply the week of July 4th please cc research@unc.edu. Regards, Jenny Williams Systems Administrator UNC Chapel Hill (In reply to Dominik Bartkiewicz from comment #7) > Hi > > Have you removed "CgroupReleaseAgentDir" line from config? No, because we run Slurm 16.05. Upgrade to 17.02 will be done soon. > Witch kernel version do you use? CentOS 7.3 with these kernels on compute nodes: 3.10.0-514.2.2.el7.x86_64 3.10.0-514.6.1.el7.x86_64 The "error: xcgroup_instantiate: unable to create cgroup" message occurs with both of these kernels. (In reply to Dominik Bartkiewicz from comment #7) > Have you removed "CgroupReleaseAgentDir" line from config? > Witch kernel version do you use? Status update: All nodes have now been upgraded to Slurm 17.02.6 and rebooted subsequently with CentOS 7.3 kernel 3.10.0-514.6.1.el7.x86_64. The CgroupReleaseAgentDir line has been removed from cgroup.conf and the file now contains: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes On all nodes I have now searched for the previously experienced error message: # grep /sys/fs/cgroup/memory/slurm/ /var/log/slurm/slurmd.log The result is zero hits :-) I don't know whether this shows the absence of Cgroup problems. We have only been running 17.02 for a few days at this time. Can you possibly suggest some other search for Cgroup symptoms? So perhaps this case has been resolved by the upgrade to 17.02.6. Hi Not really, only what I put on comment 3. During searching I noticed that many people have some docker/cgroup_memory related problems on different kernels. I can have only hope that on this config slurm will work fine. But I am sure that this log "No space left on device" is kernel not slurm bug. Let me know if this happen again. Dominik Hi I am closing this as "INFOGIVEN", Pleas re-open if you have any questions. Dominik |