Bug 5507

Summary: slurmstepd: error
Product: Slurm Reporter: Sathishkumar <sathishkumar.ranganathan>
Component: slurmstepdAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll, justin.lecher, sathish.sathishkumar
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3890
Site: AstraZeneca Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Latest slurm.conf
cgroup.conf
slurmd.log

Description Sathishkumar 2018-07-31 08:53:55 MDT
Created attachment 7466 [details]
Latest slurm.conf

Hi Team, 

We are seeing the below error message when ever we invoke srun, could you please assist on the same ? 


$srun hostname
srun: job 357102 queued and waiting for resources
srun: error: Lookup failed: Unknown host
srun: job 357102 has been allocated resources
seskscpn084.prim.scp
slurmstepd: error: task/cgroup: unable to add task[pid=85894] to memory cg '(null)'
slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_684277182' : No space left on device
slurmstepd: error: jobacct_gather/cgroup: unable to instanciate user 684277182 memory cgroup


OS Version: CentOS Linux release 7.4.1708 (Core)
Kernel Version: Linux version 4.4.0-124-generic


Please do let me know if you need anything from my end to take this further.



Thanks
Sathish
Comment 1 Sathish 2018-07-31 08:58:34 MDT
*** Bug 5504 has been marked as a duplicate of this bug. ***
Comment 2 Felip Moll 2018-07-31 09:34:06 MDT
Sathish, I would need also cgroup.conf, thanks.
Comment 3 Sathishkumar 2018-07-31 09:46:16 MDT
Created attachment 7467 [details]
cgroup.conf

cgroup.conf is attached for your review.
Comment 5 Felip Moll 2018-07-31 10:15:26 MDT
(In reply to Sathishkumar from comment #3)
> Created attachment 7467 [details]
> cgroup.conf
> 
> cgroup.conf is attached for your review.

Thanks,

Your error is probably related to a bug in the kernel on applications that are setting kmem limit.

It seems kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed, thus filling up an internal 64K cache that ends showing the error you see in Slurm logs.

Following patches that would fix the issue seem not to be commited to RHEL kernels because incompatibilities with OpenShift software:

https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546
https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9

There's currently an open bug in RHEL:
https://bugzilla.redhat.com/show_bug.cgi?id=1507149

Bug 5082 seems a possible duplicate of your issue, if you are curious of about the full explanation.

The workaround is simply to set "ConstrainKmemSpace=No" in cgroup.conf, then you must reboot the affected nodes.

This parameter has been set to "no" by default on 18.08. Commit 32fabc5e006b8f41.

Tell me if this workaround works for you.
Comment 7 Felip Moll 2018-08-07 06:46:14 MDT
Sathishkumar,

Can you please confirm everything is running after applying the suggested change?

Thanks,
Felip
Comment 8 Sathishkumar 2018-08-08 05:58:02 MDT
Created attachment 7537 [details]
slurmd.log
Comment 9 Sathishkumar 2018-08-08 05:59:59 MDT
Comment on attachment 7537 [details]
slurmd.log

Felip Moll, We have set ConstrainKmemSpace=no but still having this issue. Also, we are facing this issue after changing TaskPlugin from linux to cgroup. We are also  seeing some jobacct_gather_linux errors in the log. Attached the same for your review. 

Best Regards
Sathish
Comment 10 Felip Moll 2018-08-08 08:43:43 MDT
(In reply to Sathishkumar from comment #9)
> Comment on attachment 7537 [details]
> slurmd.log
> 
> Felip Moll, We have set ConstrainKmemSpace=no but still having this issue.
> Also, we are facing this issue after changing TaskPlugin from linux to
> cgroup. We are also  seeing some jobacct_gather_linux errors in the log.
> Attached the same for your review. 
> 
> Best Regards
> Sathish

Well, in this slurmd log I don't see the same error than before.

In any case, have you *rebooted* the nodes after changing the ConstrainKmemSpace?

I checked your slurm.conf and cgroup.conf, and I see a couple of issues/improvements to do:

Issue 1, no task affinity:
------------------------
You have slurm.conf setting: TaskPlugin=task/cgroup, and in cgroup.conf there's no TaskAffinity=yes setting. This means that your jobs will not be bind to cores and therefore you will probably have affinity problems.

I encourage you to set this in slurm.conf:
TaskPlugin=task/cgroup,task/affinity

This will use task/cgroup for memory (ConstrainRAM+ConstrainSwap)& cores (ConstrainCores) constraining, and task/affinity for pinning processes to cores. This is the recommended setup.

Issue 2, duplicate mechanism for enforce memory:
------------------------------------------------

This is gonna change in Slurm 18.08, but currently there are 2 mechanisms in Slurm to control Out-Of-Memory and kill jobs & steps, jobacctgather or cgroups.

JobacctGather mechanism is enabled by default.
Cgroup mechanism is set when you use task/cgroup and ConstrainRAMSpace=yes or ConstrainSwapSpace=yes.

Having both mechanisms enabled can cause problems like the one you're seeing in the last slurmd, that one mechanism kills the job or steps while the other is still reading the task to do the memory calculations.

To disable the JobAcctGather mechanism, you must set, in slurm.conf:

MemLimitEnforce=yes
JobAcctGatherParams=NoOverMemoryKill

Issue 3, use jobacct_gather/linux
-------------------------------------

We usually recommend jobacct_gather/linux over jobacct_gather/cgroup. This last one has an added overload while it doesn't provide any important benefit, it first does all the things that jobacct_gather/linux does, and then just overrides some fields reading from the cgroup, specifically the fields rss, pages (read from memory.stat) and usec, ssec (read from cpuacct.stat).

So I recommend you to set:
JobAcctGatherType=jobacct_gather/linux



May you try this settings, specially Issue2?

Remember to reboot the nodes that faced a cgroup "No space left on device", after changing ConstrainKmemSpace=no.
Comment 11 Felip Moll 2018-08-20 09:27:11 MDT
(In reply to Felip Moll from comment #10)
> (In reply to Sathishkumar from comment #9)
> > Comment on attachment 7537 [details]
> > slurmd.log
> > 
> > Felip Moll, We have set ConstrainKmemSpace=no but still having this issue.
> > Also, we are facing this issue after changing TaskPlugin from linux to
> > cgroup. We are also  seeing some jobacct_gather_linux errors in the log.
> > Attached the same for your review. 
> > 
> > Best Regards
> > Sathish
> 
> Well, in this slurmd log I don't see the same error than before.
> 
> In any case, have you *rebooted* the nodes after changing the
> ConstrainKmemSpace?
> 
> I checked your slurm.conf and cgroup.conf, and I see a couple of
> issues/improvements to do:
> 
> Issue 1, no task affinity:
> ------------------------
> You have slurm.conf setting: TaskPlugin=task/cgroup, and in cgroup.conf
> there's no TaskAffinity=yes setting. This means that your jobs will not be
> bind to cores and therefore you will probably have affinity problems.
> 
> I encourage you to set this in slurm.conf:
> TaskPlugin=task/cgroup,task/affinity
> 
> This will use task/cgroup for memory (ConstrainRAM+ConstrainSwap)& cores
> (ConstrainCores) constraining, and task/affinity for pinning processes to
> cores. This is the recommended setup.
> 
> Issue 2, duplicate mechanism for enforce memory:
> ------------------------------------------------
> 
> This is gonna change in Slurm 18.08, but currently there are 2 mechanisms in
> Slurm to control Out-Of-Memory and kill jobs & steps, jobacctgather or
> cgroups.
> 
> JobacctGather mechanism is enabled by default.
> Cgroup mechanism is set when you use task/cgroup and ConstrainRAMSpace=yes
> or ConstrainSwapSpace=yes.
> 
> Having both mechanisms enabled can cause problems like the one you're seeing
> in the last slurmd, that one mechanism kills the job or steps while the
> other is still reading the task to do the memory calculations.
> 
> To disable the JobAcctGather mechanism, you must set, in slurm.conf:
> 
> MemLimitEnforce=yes
> JobAcctGatherParams=NoOverMemoryKill
> 
> Issue 3, use jobacct_gather/linux
> -------------------------------------
> 
> We usually recommend jobacct_gather/linux over jobacct_gather/cgroup. This
> last one has an added overload while it doesn't provide any important
> benefit, it first does all the things that jobacct_gather/linux does, and
> then just overrides some fields reading from the cgroup, specifically the
> fields rss, pages (read from memory.stat) and usec, ssec (read from
> cpuacct.stat).
> 
> So I recommend you to set:
> JobAcctGatherType=jobacct_gather/linux
> 
> 
> 
> May you try this settings, specially Issue2?
> 
> Remember to reboot the nodes that faced a cgroup "No space left on device",
> after changing ConstrainKmemSpace=no.

Hi,

Have you had any chance to try the suggested settings?
How is everything going in your cluster?

Thanks
Comment 12 Felip Moll 2018-08-28 02:10:58 MDT
Hi,

Any news about this issue?
Comment 13 Sathishkumar 2018-08-28 03:14:32 MDT
The suggested changes had already applied to the cluster and we are not seeing the reported errors. Thanks for all your assistance on this, you can proceed with closing this ticket.
Comment 14 Felip Moll 2018-08-28 03:18:35 MDT
(In reply to Sathishkumar from comment #13)
> The suggested changes had already applied to the cluster and we are not
> seeing the reported errors. Thanks for all your assistance on this, you can
> proceed with closing this ticket.

Thanks, I'm glad it helped.

Closing issue now.

Regards,
Felip