Created attachment 7466 [details] Latest slurm.conf Hi Team, We are seeing the below error message when ever we invoke srun, could you please assist on the same ? $srun hostname srun: job 357102 queued and waiting for resources srun: error: Lookup failed: Unknown host srun: job 357102 has been allocated resources seskscpn084.prim.scp slurmstepd: error: task/cgroup: unable to add task[pid=85894] to memory cg '(null)' slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_684277182' : No space left on device slurmstepd: error: jobacct_gather/cgroup: unable to instanciate user 684277182 memory cgroup OS Version: CentOS Linux release 7.4.1708 (Core) Kernel Version: Linux version 4.4.0-124-generic Please do let me know if you need anything from my end to take this further. Thanks Sathish
*** Bug 5504 has been marked as a duplicate of this bug. ***
Sathish, I would need also cgroup.conf, thanks.
Created attachment 7467 [details] cgroup.conf cgroup.conf is attached for your review.
(In reply to Sathishkumar from comment #3) > Created attachment 7467 [details] > cgroup.conf > > cgroup.conf is attached for your review. Thanks, Your error is probably related to a bug in the kernel on applications that are setting kmem limit. It seems kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed, thus filling up an internal 64K cache that ends showing the error you see in Slurm logs. Following patches that would fix the issue seem not to be commited to RHEL kernels because incompatibilities with OpenShift software: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 There's currently an open bug in RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1507149 Bug 5082 seems a possible duplicate of your issue, if you are curious of about the full explanation. The workaround is simply to set "ConstrainKmemSpace=No" in cgroup.conf, then you must reboot the affected nodes. This parameter has been set to "no" by default on 18.08. Commit 32fabc5e006b8f41. Tell me if this workaround works for you.
Sathishkumar, Can you please confirm everything is running after applying the suggested change? Thanks, Felip
Created attachment 7537 [details] slurmd.log
Comment on attachment 7537 [details] slurmd.log Felip Moll, We have set ConstrainKmemSpace=no but still having this issue. Also, we are facing this issue after changing TaskPlugin from linux to cgroup. We are also seeing some jobacct_gather_linux errors in the log. Attached the same for your review. Best Regards Sathish
(In reply to Sathishkumar from comment #9) > Comment on attachment 7537 [details] > slurmd.log > > Felip Moll, We have set ConstrainKmemSpace=no but still having this issue. > Also, we are facing this issue after changing TaskPlugin from linux to > cgroup. We are also seeing some jobacct_gather_linux errors in the log. > Attached the same for your review. > > Best Regards > Sathish Well, in this slurmd log I don't see the same error than before. In any case, have you *rebooted* the nodes after changing the ConstrainKmemSpace? I checked your slurm.conf and cgroup.conf, and I see a couple of issues/improvements to do: Issue 1, no task affinity: ------------------------ You have slurm.conf setting: TaskPlugin=task/cgroup, and in cgroup.conf there's no TaskAffinity=yes setting. This means that your jobs will not be bind to cores and therefore you will probably have affinity problems. I encourage you to set this in slurm.conf: TaskPlugin=task/cgroup,task/affinity This will use task/cgroup for memory (ConstrainRAM+ConstrainSwap)& cores (ConstrainCores) constraining, and task/affinity for pinning processes to cores. This is the recommended setup. Issue 2, duplicate mechanism for enforce memory: ------------------------------------------------ This is gonna change in Slurm 18.08, but currently there are 2 mechanisms in Slurm to control Out-Of-Memory and kill jobs & steps, jobacctgather or cgroups. JobacctGather mechanism is enabled by default. Cgroup mechanism is set when you use task/cgroup and ConstrainRAMSpace=yes or ConstrainSwapSpace=yes. Having both mechanisms enabled can cause problems like the one you're seeing in the last slurmd, that one mechanism kills the job or steps while the other is still reading the task to do the memory calculations. To disable the JobAcctGather mechanism, you must set, in slurm.conf: MemLimitEnforce=yes JobAcctGatherParams=NoOverMemoryKill Issue 3, use jobacct_gather/linux ------------------------------------- We usually recommend jobacct_gather/linux over jobacct_gather/cgroup. This last one has an added overload while it doesn't provide any important benefit, it first does all the things that jobacct_gather/linux does, and then just overrides some fields reading from the cgroup, specifically the fields rss, pages (read from memory.stat) and usec, ssec (read from cpuacct.stat). So I recommend you to set: JobAcctGatherType=jobacct_gather/linux May you try this settings, specially Issue2? Remember to reboot the nodes that faced a cgroup "No space left on device", after changing ConstrainKmemSpace=no.
(In reply to Felip Moll from comment #10) > (In reply to Sathishkumar from comment #9) > > Comment on attachment 7537 [details] > > slurmd.log > > > > Felip Moll, We have set ConstrainKmemSpace=no but still having this issue. > > Also, we are facing this issue after changing TaskPlugin from linux to > > cgroup. We are also seeing some jobacct_gather_linux errors in the log. > > Attached the same for your review. > > > > Best Regards > > Sathish > > Well, in this slurmd log I don't see the same error than before. > > In any case, have you *rebooted* the nodes after changing the > ConstrainKmemSpace? > > I checked your slurm.conf and cgroup.conf, and I see a couple of > issues/improvements to do: > > Issue 1, no task affinity: > ------------------------ > You have slurm.conf setting: TaskPlugin=task/cgroup, and in cgroup.conf > there's no TaskAffinity=yes setting. This means that your jobs will not be > bind to cores and therefore you will probably have affinity problems. > > I encourage you to set this in slurm.conf: > TaskPlugin=task/cgroup,task/affinity > > This will use task/cgroup for memory (ConstrainRAM+ConstrainSwap)& cores > (ConstrainCores) constraining, and task/affinity for pinning processes to > cores. This is the recommended setup. > > Issue 2, duplicate mechanism for enforce memory: > ------------------------------------------------ > > This is gonna change in Slurm 18.08, but currently there are 2 mechanisms in > Slurm to control Out-Of-Memory and kill jobs & steps, jobacctgather or > cgroups. > > JobacctGather mechanism is enabled by default. > Cgroup mechanism is set when you use task/cgroup and ConstrainRAMSpace=yes > or ConstrainSwapSpace=yes. > > Having both mechanisms enabled can cause problems like the one you're seeing > in the last slurmd, that one mechanism kills the job or steps while the > other is still reading the task to do the memory calculations. > > To disable the JobAcctGather mechanism, you must set, in slurm.conf: > > MemLimitEnforce=yes > JobAcctGatherParams=NoOverMemoryKill > > Issue 3, use jobacct_gather/linux > ------------------------------------- > > We usually recommend jobacct_gather/linux over jobacct_gather/cgroup. This > last one has an added overload while it doesn't provide any important > benefit, it first does all the things that jobacct_gather/linux does, and > then just overrides some fields reading from the cgroup, specifically the > fields rss, pages (read from memory.stat) and usec, ssec (read from > cpuacct.stat). > > So I recommend you to set: > JobAcctGatherType=jobacct_gather/linux > > > > May you try this settings, specially Issue2? > > Remember to reboot the nodes that faced a cgroup "No space left on device", > after changing ConstrainKmemSpace=no. Hi, Have you had any chance to try the suggested settings? How is everything going in your cluster? Thanks
Hi, Any news about this issue?
The suggested changes had already applied to the cluster and we are not seeing the reported errors. Thanks for all your assistance on this, you can proceed with closing this ticket.
(In reply to Sathishkumar from comment #13) > The suggested changes had already applied to the cluster and we are not > seeing the reported errors. Thanks for all your assistance on this, you can > proceed with closing this ticket. Thanks, I'm glad it helped. Closing issue now. Regards, Felip