Bug 3694 - task/cgroup causing Wrong slab cache
Summary: task/cgroup causing Wrong slab cache
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 16.05.8
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-04-13 18:50 MDT by OCF Support
Modified: 2018-04-20 20:48 MDT (History)
5 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Blue Crystal 4
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (75.48 KB, text/plain)
2017-04-13 18:50 MDT, OCF Support
Details
cgroup config (293 bytes, text/x-matlab)
2017-04-13 18:51 MDT, OCF Support
Details
topology config (1.06 KB, text/plain)
2017-04-13 18:51 MDT, OCF Support
Details

Note You need to log in before you can comment on or make changes to this bug.
Description OCF Support 2017-04-13 18:50:27 MDT
Created attachment 4358 [details]
slurm.conf file

Hi,

We have been working with Intel with a problem running a job within Slurm. The slurm.conf is attached. The stack traces we received in syslog, shown below indicated at first that it may be an OPA issue.

However, going through the process of changing the TaskPLugin from task/cgroup to task/affinity or task/none, we no longer get the issues shown below. Intel also found that if we have ConstrainRAMSpace=no, we also no longer have the issue.

The final conclusion from Intel OPA Developers was 

"This is not a hfi1 driver or OPA issue.  A work around is to NOT use the
ConstrainRAMSpace parameter in the slurm cgroup.conf file.
 
An alternative would be to modify the slurm plugin code to NOT set the
memory.kmem.limit_in_bytes parameter in the cgroup plugin source code."

Would you be able to shed any light on this, and can this be resolved as per their comments? Let me know, if there is any other output you require?


Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node
Apr  4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache

and

Apr  4 20:34:13 gpu14 kernel: =============================================================================
Apr  4 20:34:13 gpu14 kernel: BUG numa_policy(223:task_0) (Tainted: P    B   W  OE  ------------  ): Objects remaining in numa_policy(223:task_0) on kmem_cache_close()
Apr  4 20:34:13 gpu14 kernel: -----------------------------------------------------------------------------
Apr  4 20:34:13 gpu14 kernel: INFO: Slab 0xffffea000885f700 objects=31 used=6 fp=0xffff8802217ddef0 flags=0x2fffff00004080
Apr  4 20:34:13 gpu14 kernel: CPU: 14 PID: 25290 Comm: mdrun_mpi Tainted: P    B   W  OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
Apr  4 20:34:13 gpu14 kernel: Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE128H-2.30]- 12/13/2016
Apr  4 20:34:13 gpu14 kernel: ffffea000885f700 000000000cd24b2d ffff881b47287b50 ffffffff816864ef
Apr  4 20:34:13 gpu14 kernel: ffff881b47287c28 ffffffff811da054 ffff881900000020 ffff881b47287c38
Apr  4 20:34:13 gpu14 kernel: ffff881b47287be8 656a624f0000001c 616d657220737463 6e6920676e696e69
Apr  4 20:34:13 gpu14 kernel: Call Trace:
Apr  4 20:34:13 gpu14 kernel: [<ffffffff816864ef>] dump_stack+0x19/0x1b
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811da054>] slab_err+0xb4/0xe0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff81317ca9>] ? free_cpumask_var+0x9/0x10
Apr  4 20:34:13 gpu14 kernel: [<ffffffff810f9d2d>] ? on_each_cpu_cond+0xcd/0x180
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811dba90>] ? kmem_cache_alloc_bulk+0x140/0x140
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811dd353>] ? __kmalloc+0x1f3/0x240
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811dfa2b>] ? kmem_cache_close+0x12b/0x2f0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811dfa4c>] kmem_cache_close+0x14c/0x2f0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811dfc04>] __kmem_cache_shutdown+0x14/0x80
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811a5904>] kmem_cache_destroy+0x44/0xf0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811f3f49>] kmem_cache_destroy_memcg_children+0x89/0xb0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811a58d9>] kmem_cache_destroy+0x19/0xf0
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa1616af7>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa161870e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15edc50>] remove_gpu+0x220/0x300 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15edf71>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15f1458>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15e7501>] uvm_release+0x11/0x20 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffff81200109>] __fput+0xe9/0x260
Apr  4 20:34:13 gpu14 kernel: [<ffffffff812003be>] ____fput+0xe/0x10
Apr  4 20:34:13 gpu14 kernel: [<ffffffff810aceb4>] task_work_run+0xc4/0xe0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108bdd8>] do_exit+0x2d8/0xa40
Apr  4 20:34:13 gpu14 kernel: [<ffffffff810c5070>] ? wake_up_state+0x10/0x20
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8109aede>] ? signal_wake_up_state+0x1e/0x30
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8109c342>] ? zap_other_threads+0x92/0xc0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108c5bf>] do_group_exit+0x3f/0xa0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108c634>] SyS_exit_group+0x14/0x20
Apr  4 20:34:13 gpu14 kernel: [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dc000 @offset=0
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dce70 @offset=3696
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd080 @offset=4224
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd290 @offset=4752
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd4a0 @offset=5280
Apr  4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd6b0 @offset=5808
Apr  4 20:34:13 gpu14 kernel: kmem_cache_destroy numa_policy(223:task_0): Slab cache still has objects
Apr  4 20:34:13 gpu14 kernel: CPU: 14 PID: 25290 Comm: mdrun_mpi Tainted: P    B   W  OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
Apr  4 20:34:13 gpu14 kernel: Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE128H-2.30]- 12/13/2016
Apr  4 20:34:13 gpu14 kernel: ffff88196c2c8200 000000000cd24b2d ffff881b47287ca8 ffffffff816864ef
Apr  4 20:34:13 gpu14 kernel: ffff881b47287cc8 ffffffff811a59a0 00000000000000df ffff88196c2c8200
Apr  4 20:34:13 gpu14 kernel: ffff881b47287cf0 ffffffff811f3f49 ffff88017fc02200 ffff88197ff5a468
Apr  4 20:34:13 gpu14 kernel: Call Trace:
Apr  4 20:34:13 gpu14 kernel: [<ffffffff816864ef>] dump_stack+0x19/0x1b
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811a59a0>] kmem_cache_destroy+0xe0/0xf0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811f3f49>] kmem_cache_destroy_memcg_children+0x89/0xb0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff811a58d9>] kmem_cache_destroy+0x19/0xf0
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa1616af7>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa161870e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15edc50>] remove_gpu+0x220/0x300 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15edf71>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15f1458>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffffa15e7501>] uvm_release+0x11/0x20 [nvidia_uvm]
Apr  4 20:34:13 gpu14 kernel: [<ffffffff81200109>] __fput+0xe9/0x260
Apr  4 20:34:13 gpu14 kernel: [<ffffffff812003be>] ____fput+0xe/0x10
Apr  4 20:34:13 gpu14 kernel: [<ffffffff810aceb4>] task_work_run+0xc4/0xe0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108bdd8>] do_exit+0x2d8/0xa40
Apr  4 20:34:13 gpu14 kernel: [<ffffffff810c5070>] ? wake_up_state+0x10/0x20
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8109aede>] ? signal_wake_up_state+0x1e/0x30
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8109c342>] ? zap_other_threads+0x92/0xc0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108c5bf>] do_group_exit+0x3f/0xa0
Apr  4 20:34:13 gpu14 kernel: [<ffffffff8108c634>] SyS_exit_group+0x14/0x20
Apr  4 20:34:13 gpu14 kernel: [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
Comment 1 OCF Support 2017-04-13 18:51:03 MDT
Created attachment 4359 [details]
cgroup config
Comment 2 OCF Support 2017-04-13 18:51:26 MDT
Created attachment 4360 [details]
topology config
Comment 4 Benedikt Schaefer 2017-07-04 07:40:15 MDT
We saw this also with slurm version 17.02.4-1.el7.centos.

error messages at /var/log/messages
Jul  4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node
Jul  4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node
Jul  4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node
Jul  4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node
Comment 5 Benedikt Schaefer 2017-07-04 07:49:05 MDT
System config:
OS: CentOS-7.3
kernel: 3.10.0-514.10.2.el7.x86_64
slurm: 17.02.4-1.el7.centos
OPA: 10.3.1.0-7.el7
GPFS: 4.2.3

slurm config:
TaskPlugin=task/cgroup

cgroup.conf:
--
CgroupAutomount=no
CgroupReleaseAgentDir="/etc/slurm/cgroup"
CgroupMountpoint=/sys/fs/cgroup

ConstrainCores=no
ConstrainRAMSpace=yes

# Prevent job to use swap
ConstrainSwapSpace=yes
AllowedSwapSpace=0

# Maximum usage of memory if user did not specify value
# Nodes have 128GB, keep 4GB for GPFS pagepool, few GBs for the system
MaxRAMPercent=95
--
Comment 6 John Hanks 2018-04-20 07:50:50 MDT
Just adding a "me too":

[259737.567508] INFO: Object 0xffff961750384a50 @offset=2640
[259737.567512] =============================================================================
[259737.567514] BUG numa_policy(374:task_0) (Tainted: P    B      OEL ------------ T): Objects remaining in numa_policy(374:task_0) on kmem_cache_close()
[259737.567515] -----------------------------------------------------------------------------

[259737.567517] INFO: Slab 0xffffea385ebcff00 objects=62 used=2 fp=0xffff9617af3ffac8 flags=0x3afffff00004080
[259737.567519] CPU: 154 PID: 60268 Comm: python Tainted: P    B      OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[259737.567521] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018
[259737.567522] Call Trace:
[259737.567525]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[259737.567528]  [<ffffffff811e0904>] slab_err+0xb4/0xe0
[259737.567530]  [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40
[259737.567533]  [<ffffffff816a87cb>] ? printk+0x60/0x77
[259737.567535]  [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230
[259737.567544]  [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0
[259737.567547]  [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0
[259737.567550]  [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80
[259737.567553]  [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0
[259737.567555]  [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0
[259737.567558]  [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0
[259737.567570]  [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[259737.567590]  [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm]
[259737.567601]  [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm]
[259737.567611]  [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[259737.567630]  [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm]
[259737.567639]  [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm]
[259737.567642]  [<ffffffff8120791c>] __fput+0xec/0x260
[259737.567646]  [<ffffffff81207b7e>] ____fput+0xe/0x10
[259737.567648]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[259737.567659]  [<ffffffff81090ed1>] do_exit+0x2d1/0xa40
[259737.567664]  [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280
[259737.567667]  [<ffffffff810916bf>] do_group_exit+0x3f/0xa0
[259737.567671]  [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0
[259737.567675]  [<ffffffff8102a457>] do_signal+0x57/0x6c0
[259737.567679]  [<ffffffff810fb756>] ? do_futex+0x106/0x5a0
[259737.567682]  [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0
[259737.567686]  [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180
[259737.567697]  [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[259737.567702]  [<ffffffff816c0a5d>] int_signal+0x12/0x17
[259737.567735] INFO: Object 0xffff9617af3ff180 @offset=12672
[259737.567738] INFO: Object 0xffff9617af3ff5a0 @offset=13728
[259737.567742] kmem_cache_destroy numa_policy(374:task_0): Slab cache still has objects
[259737.567746] CPU: 154 PID: 60268 Comm: python Tainted: P    B      OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[259737.567748] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018
[259737.567749] Call Trace:
[259737.567753]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[259737.567756]  [<ffffffff811ab000>] kmem_cache_destroy+0xe0/0xf0
[259737.567767]  [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0
[259737.567770]  [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0
[259737.567784]  [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[259737.567797]  [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm]
[259737.567817]  [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm]
[259737.567827]  [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[259737.567838]  [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm]
[259737.567854]  [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm]
[259737.567858]  [<ffffffff8120791c>] __fput+0xec/0x260
[259737.567862]  [<ffffffff81207b7e>] ____fput+0xe/0x10
[259737.567864]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[259737.567867]  [<ffffffff81090ed1>] do_exit+0x2d1/0xa40
[259737.567871]  [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280
[259737.567874]  [<ffffffff810916bf>] do_group_exit+0x3f/0xa0
[259737.567878]  [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0
[259737.567890]  [<ffffffff8102a457>] do_signal+0x57/0x6c0
[259737.567894]  [<ffffffff810fb756>] ? do_futex+0x106/0x5a0
[259737.567896]  [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0
[259737.567900]  [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180
[259737.567903]  [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[259737.567907]  [<ffffffff816c0a5d>] int_signal+0x12/0x17

SLURM 17.11.5
CentOS 7.4, kernel 3.10.0-693.21.1.el7.x86_64

jbh
Comment 7 Brian Christiansen 2018-04-20 09:18:02 MDT
It appears that Bug 3874 would solve earlier reported issues as the fix was put in 17.02.5.

John, do you have ConstrainKmemSpace=yes set in your cgroup.conf by chance? If do, will you try it with it off? If you are still seeing issues will you open a separate ticket for this?
Comment 8 John Hanks 2018-04-20 20:48:07 MDT
(In reply to Brian Christiansen from comment #7)
> It appears that Bug 3874 would solve earlier reported issues as the fix was
> put in 17.02.5.
> 
> John, do you have ConstrainKmemSpace=yes set in your cgroup.conf by chance?
> If do, will you try it with it off? If you are still seeing issues will you
> open a separate ticket for this?

ConstrainKmemSpace wasn't in my cgroup.conf, so yes by default. Created a new bug, 5092.

jbh