Summary: | Objects remaining in numa_policy(374:task_0) on kmem_cache_close() | ||
---|---|---|---|
Product: | Slurm | Reporter: | John Hanks <griznog> |
Component: | Limits | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
John Hanks
2018-04-20 20:44:48 MDT
I've set ConstrainKmemSpace=no in my cgroup.conf now, and will follow up if/when the problem reappears. jbh Is there any chance you can test a newer kernel version out on some of the nodes and see if this reproduces there? That kernel version, from what I can quickly put together, was the first with RHEL's Meltdown/Spectre mitigations included, and I'm wondering if there may have been an issue that's already affected. Generally speaking, any error that leads to a kernel BUG is by definition a kernel bug - Slurm is using the userspace APIs to setup and manipulate these cgroups, but nothing we do through that interface should trigger a BUG. - Tim (In reply to Tim Wickberg from comment #2) > Is there any chance you can test a newer kernel version out on some of the > nodes and see if this reproduces there? > We haven't had this occur again since setting ConstrainKmemSpace=no but, we also haven't had another surge of large numbers of short, small jobs with lots of job turnover. If that doesn't happen before our next maintenance then I'll fake it and see if I can reproduce. jbh (In reply to John Hanks from comment #3) > (In reply to Tim Wickberg from comment #2) > > Is there any chance you can test a newer kernel version out on some of the > > nodes and see if this reproduces there? > > > > We haven't had this occur again since setting > > ConstrainKmemSpace=no > > but, we also haven't had another surge of large numbers of short, small jobs > with lots of job turnover. If that doesn't happen before our next > maintenance then I'll fake it and see if I can reproduce. > > jbh Any update on this? We've found a bit of additional evidence that there are additional kernel bugs related to the Kmem limits, but they're only fixed in very recent 4.0+ kernels. Bug 5082 has further details on that issue if you're interested. - Tim Hi Tim, I don't see any more messages about this so I think we are good. At least until/unless we need to set ConstrainKmemSpace=yes, maybe you know a reason why we might need to do that? Will follow the other bug for now. jbh > I don't see any more messages about this so I think we are good. At least
> until/unless we need to set ConstrainKmemSpace=yes, maybe you know a reason
> why we might need to do that?
Not in most situations. It can help rein in page cache usage, but usually the Linux kernel sorts that out itself.
Marking resolved/infogiven now.
- Tim
|