Ticket 5092

Summary:	Objects remaining in numa_policy(374:task_0) on kmem_cache_close()
Product:	Slurm	Reporter:	John Hanks <griznog>
Component:	Limits	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	Stanford	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description John Hanks 2018-04-20 20:44:48 MDT

This is moved from 3694.

I can't say for sure this is causing us any problems, just that it leads to verbosity the kernel log.

My cgroup.conf is:

CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes 
TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedSwapSpace=50


Call traces from dmesg:

[Thu Apr 19 08:48:41 2018] =============================================================================
[Thu Apr 19 08:48:41 2018] BUG numa_policy(374:task_0) (Tainted: P           OEL ------------ T): Objects remaining in numa_policy(374:task_0) on kmem_cache_close()
[Thu Apr 19 08:48:41 2018] -----------------------------------------------------------------------------

[Thu Apr 19 08:48:41 2018] INFO: Slab 0xffffea39e8389a00 objects=62 used=1 fp=0xffff967a0e269ce0 flags=0x3afffff00004080
[Thu Apr 19 08:48:41 2018] CPU: 154 PID: 60268 Comm: python Tainted: P    B      OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[Thu Apr 19 08:48:41 2018] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018
[Thu Apr 19 08:48:41 2018] Call Trace:
[Thu Apr 19 08:48:41 2018]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e0904>] slab_err+0xb4/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8118fef0>] ? __free_memcg_kmem_pages+0x40/0x50
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e2350>] ? kmem_cache_alloc_bulk+0x140/0x140
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffff8120791c>] __fput+0xec/0x260
[Thu Apr 19 08:48:41 2018]  [<ffffffff81207b7e>] ____fput+0xe/0x10
[Thu Apr 19 08:48:41 2018]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff81090ed1>] do_exit+0x2d1/0xa40
[Thu Apr 19 08:48:41 2018]  [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280
[Thu Apr 19 08:48:41 2018]  [<ffffffff810916bf>] do_group_exit+0x3f/0xa0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102a457>] do_signal+0x57/0x6c0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fb756>] ? do_futex+0x106/0x5a0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff816c0a5d>] int_signal+0x12/0x17
[Thu Apr 19 08:48:41 2018] INFO: Object 0xffff967a0e26a100 @offset=8448
[Thu Apr 19 08:48:41 2018] =============================================================================
[Thu Apr 19 08:48:41 2018] BUG numa_policy(374:task_0) (Tainted: P    B      OEL ------------ T): Objects remaining in numa_policy(374:task_0) on kmem_cache_close()
[Thu Apr 19 08:48:41 2018] -----------------------------------------------------------------------------

[Thu Apr 19 08:48:41 2018] INFO: Slab 0xffffea39e82b3600 objects=62 used=1 fp=0xffff967a0acd8108 flags=0x3afffff00004080
[Thu Apr 19 08:48:41 2018] CPU: 154 PID: 60268 Comm: python Tainted: P    B      OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[Thu Apr 19 08:48:41 2018] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018
[Thu Apr 19 08:48:41 2018] Call Trace:
[Thu Apr 19 08:48:41 2018]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e0904>] slab_err+0xb4/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40
[Thu Apr 19 08:48:41 2018]  [<ffffffff816a87cb>] ? printk+0x60/0x77
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffff8120791c>] __fput+0xec/0x260
[Thu Apr 19 08:48:41 2018]  [<ffffffff81207b7e>] ____fput+0xe/0x10
[Thu Apr 19 08:48:41 2018]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff81090ed1>] do_exit+0x2d1/0xa40
[Thu Apr 19 08:48:41 2018]  [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280
[Thu Apr 19 08:48:41 2018]  [<ffffffff810916bf>] do_group_exit+0x3f/0xa0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102a457>] do_signal+0x57/0x6c0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fb756>] ? do_futex+0x106/0x5a0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff816c0a5d>] int_signal+0x12/0x17
[Thu Apr 19 08:48:41 2018] INFO: Object 0xffff967a0acd8f78 @offset=3960
[Thu Apr 19 08:48:41 2018] =============================================================================
[Thu Apr 19 08:48:41 2018] BUG numa_policy(374:task_0) (Tainted: P    B      OEL ------------ T): Objects remaining in numa_policy(374:task_0) on kmem_cache_close()
[Thu Apr 19 08:48:41 2018] -----------------------------------------------------------------------------

[Thu Apr 19 08:48:41 2018] INFO: Slab 0xffffea390e14a100 objects=62 used=2 fp=0xffff964385284210 flags=0x3afffff00004080
[Thu Apr 19 08:48:41 2018] CPU: 154 PID: 60268 Comm: python Tainted: P    B      OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[Thu Apr 19 08:48:41 2018] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018
[Thu Apr 19 08:48:41 2018] Call Trace:
[Thu Apr 19 08:48:41 2018]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e0904>] slab_err+0xb4/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40
[Thu Apr 19 08:48:41 2018]  [<ffffffff816a87cb>] ? printk+0x60/0x77
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm]
[Thu Apr 19 08:48:41 2018]  [<ffffffff8120791c>] __fput+0xec/0x260
[Thu Apr 19 08:48:41 2018]  [<ffffffff81207b7e>] ____fput+0xe/0x10
[Thu Apr 19 08:48:41 2018]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[Thu Apr 19 08:48:41 2018]  [<ffffffff81090ed1>] do_exit+0x2d1/0xa40
[Thu Apr 19 08:48:41 2018]  [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280
[Thu Apr 19 08:48:41 2018]  [<ffffffff810916bf>] do_group_exit+0x3f/0xa0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102a457>] do_signal+0x57/0x6c0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fb756>] ? do_futex+0x106/0x5a0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0
[Thu Apr 19 08:48:41 2018]  [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180
[Thu Apr 19 08:48:41 2018]  [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[Thu Apr 19 08:48:41 2018]  [<ffffffff816c0a5d>] int_signal+0x12/0x17
[Thu Apr 19 08:48:41 2018] INFO: Object 0xffff964385284b58 @offset=2904
[Thu Apr 19 08:48:41 2018] INFO: Object 0xffff964385284c60 @offset=3168

Comment 1 John Hanks 2018-04-20 20:48:58 MDT

I've set ConstrainKmemSpace=no in my cgroup.conf now, and will follow up if/when the problem reappears.

jbh

Comment 2 Tim Wickberg 2018-04-23 21:55:44 MDT

Is there any chance you can test a newer kernel version out on some of the nodes and see if this reproduces there?

That kernel version, from what I can quickly put together, was the first with RHEL's Meltdown/Spectre mitigations included, and I'm wondering if there may have been an issue that's already affected.

Generally speaking, any error that leads to a kernel BUG is by definition a kernel bug - Slurm is using the userspace APIs to setup and manipulate these cgroups, but nothing we do through that interface should trigger a BUG.

- Tim

Comment 3 John Hanks 2018-04-24 09:08:21 MDT

(In reply to Tim Wickberg from comment #2)
> Is there any chance you can test a newer kernel version out on some of the
> nodes and see if this reproduces there?
> 

We haven't had this occur again since setting 

ConstrainKmemSpace=no

but, we also haven't had another surge of large numbers of short, small jobs with lots of job turnover. If that doesn't happen before our next maintenance then I'll fake it and see if I can reproduce.

jbh

Comment 4 Tim Wickberg 2018-05-08 22:43:56 MDT

(In reply to John Hanks from comment #3)
> (In reply to Tim Wickberg from comment #2)
> > Is there any chance you can test a newer kernel version out on some of the
> > nodes and see if this reproduces there?
> > 
> 
> We haven't had this occur again since setting 
> 
> ConstrainKmemSpace=no
> 
> but, we also haven't had another surge of large numbers of short, small jobs
> with lots of job turnover. If that doesn't happen before our next
> maintenance then I'll fake it and see if I can reproduce.
> 
> jbh

Any update on this?

We've found a bit of additional evidence that there are additional kernel bugs related to the Kmem limits, but they're only fixed in very recent 4.0+ kernels. Bug 5082 has further details on that issue if you're interested.

- Tim

Comment 5 John Hanks 2018-05-09 07:29:51 MDT

Hi Tim,

I don't see any more messages about this so I think we are good. At least until/unless we need to set ConstrainKmemSpace=yes, maybe you know a reason why we might need to do that?

Will follow the other bug for now.

jbh

Comment 6 Tim Wickberg 2018-06-27 14:28:59 MDT

> I don't see any more messages about this so I think we are good. At least
> until/unless we need to set ConstrainKmemSpace=yes, maybe you know a reason
> why we might need to do that?

Not in most situations. It can help rein in page cache usage, but usually the Linux kernel sorts that out itself.

Marking resolved/infogiven now.

- Tim