Created attachment 4358 [details] slurm.conf file Hi, We have been working with Intel with a problem running a job within Slurm. The slurm.conf is attached. The stack traces we received in syslog, shown below indicated at first that it may be an OPA issue. However, going through the process of changing the TaskPLugin from task/cgroup to task/affinity or task/none, we no longer get the issues shown below. Intel also found that if we have ConstrainRAMSpace=no, we also no longer have the issue. The final conclusion from Intel OPA Developers was "This is not a hfi1 driver or OPA issue. A work around is to NOT use the ConstrainRAMSpace parameter in the slurm cgroup.conf file. An alternative would be to modify the slurm plugin code to NOT set the memory.kmem.limit_in_bytes parameter in the cgroup plugin source code." Would you be able to shed any light on this, and can this be resolved as per their comments? Let me know, if there is any other output you require? Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(293:task_0) but object is from kmem_cache_node Apr 4 20:31:13 compute493 kernel: cache_from_obj: Wrong slab cache. kmalloc-256(293:task_0) but object is from kmem_cache and Apr 4 20:34:13 gpu14 kernel: ============================================================================= Apr 4 20:34:13 gpu14 kernel: BUG numa_policy(223:task_0) (Tainted: P B W OE ------------ ): Objects remaining in numa_policy(223:task_0) on kmem_cache_close() Apr 4 20:34:13 gpu14 kernel: ----------------------------------------------------------------------------- Apr 4 20:34:13 gpu14 kernel: INFO: Slab 0xffffea000885f700 objects=31 used=6 fp=0xffff8802217ddef0 flags=0x2fffff00004080 Apr 4 20:34:13 gpu14 kernel: CPU: 14 PID: 25290 Comm: mdrun_mpi Tainted: P B W OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 Apr 4 20:34:13 gpu14 kernel: Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE128H-2.30]- 12/13/2016 Apr 4 20:34:13 gpu14 kernel: ffffea000885f700 000000000cd24b2d ffff881b47287b50 ffffffff816864ef Apr 4 20:34:13 gpu14 kernel: ffff881b47287c28 ffffffff811da054 ffff881900000020 ffff881b47287c38 Apr 4 20:34:13 gpu14 kernel: ffff881b47287be8 656a624f0000001c 616d657220737463 6e6920676e696e69 Apr 4 20:34:13 gpu14 kernel: Call Trace: Apr 4 20:34:13 gpu14 kernel: [<ffffffff816864ef>] dump_stack+0x19/0x1b Apr 4 20:34:13 gpu14 kernel: [<ffffffff811da054>] slab_err+0xb4/0xe0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff81317ca9>] ? free_cpumask_var+0x9/0x10 Apr 4 20:34:13 gpu14 kernel: [<ffffffff810f9d2d>] ? on_each_cpu_cond+0xcd/0x180 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811dba90>] ? kmem_cache_alloc_bulk+0x140/0x140 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811dd353>] ? __kmalloc+0x1f3/0x240 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811dfa2b>] ? kmem_cache_close+0x12b/0x2f0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811dfa4c>] kmem_cache_close+0x14c/0x2f0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811dfc04>] __kmem_cache_shutdown+0x14/0x80 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811a5904>] kmem_cache_destroy+0x44/0xf0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811f3f49>] kmem_cache_destroy_memcg_children+0x89/0xb0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811a58d9>] kmem_cache_destroy+0x19/0xf0 Apr 4 20:34:13 gpu14 kernel: [<ffffffffa1616af7>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa161870e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15edc50>] remove_gpu+0x220/0x300 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15edf71>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15f1458>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15e7501>] uvm_release+0x11/0x20 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffff81200109>] __fput+0xe9/0x260 Apr 4 20:34:13 gpu14 kernel: [<ffffffff812003be>] ____fput+0xe/0x10 Apr 4 20:34:13 gpu14 kernel: [<ffffffff810aceb4>] task_work_run+0xc4/0xe0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108bdd8>] do_exit+0x2d8/0xa40 Apr 4 20:34:13 gpu14 kernel: [<ffffffff810c5070>] ? wake_up_state+0x10/0x20 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8109aede>] ? signal_wake_up_state+0x1e/0x30 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8109c342>] ? zap_other_threads+0x92/0xc0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108c5bf>] do_group_exit+0x3f/0xa0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108c634>] SyS_exit_group+0x14/0x20 Apr 4 20:34:13 gpu14 kernel: [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dc000 @offset=0 Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dce70 @offset=3696 Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd080 @offset=4224 Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd290 @offset=4752 Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd4a0 @offset=5280 Apr 4 20:34:13 gpu14 kernel: INFO: Object 0xffff8802217dd6b0 @offset=5808 Apr 4 20:34:13 gpu14 kernel: kmem_cache_destroy numa_policy(223:task_0): Slab cache still has objects Apr 4 20:34:13 gpu14 kernel: CPU: 14 PID: 25290 Comm: mdrun_mpi Tainted: P B W OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 Apr 4 20:34:13 gpu14 kernel: Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE128H-2.30]- 12/13/2016 Apr 4 20:34:13 gpu14 kernel: ffff88196c2c8200 000000000cd24b2d ffff881b47287ca8 ffffffff816864ef Apr 4 20:34:13 gpu14 kernel: ffff881b47287cc8 ffffffff811a59a0 00000000000000df ffff88196c2c8200 Apr 4 20:34:13 gpu14 kernel: ffff881b47287cf0 ffffffff811f3f49 ffff88017fc02200 ffff88197ff5a468 Apr 4 20:34:13 gpu14 kernel: Call Trace: Apr 4 20:34:13 gpu14 kernel: [<ffffffff816864ef>] dump_stack+0x19/0x1b Apr 4 20:34:13 gpu14 kernel: [<ffffffff811a59a0>] kmem_cache_destroy+0xe0/0xf0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811f3f49>] kmem_cache_destroy_memcg_children+0x89/0xb0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff811a58d9>] kmem_cache_destroy+0x19/0xf0 Apr 4 20:34:13 gpu14 kernel: [<ffffffffa1616af7>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa161870e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15edc50>] remove_gpu+0x220/0x300 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15edf71>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15f1458>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffffa15e7501>] uvm_release+0x11/0x20 [nvidia_uvm] Apr 4 20:34:13 gpu14 kernel: [<ffffffff81200109>] __fput+0xe9/0x260 Apr 4 20:34:13 gpu14 kernel: [<ffffffff812003be>] ____fput+0xe/0x10 Apr 4 20:34:13 gpu14 kernel: [<ffffffff810aceb4>] task_work_run+0xc4/0xe0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108bdd8>] do_exit+0x2d8/0xa40 Apr 4 20:34:13 gpu14 kernel: [<ffffffff810c5070>] ? wake_up_state+0x10/0x20 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8109aede>] ? signal_wake_up_state+0x1e/0x30 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8109c342>] ? zap_other_threads+0x92/0xc0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108c5bf>] do_group_exit+0x3f/0xa0 Apr 4 20:34:13 gpu14 kernel: [<ffffffff8108c634>] SyS_exit_group+0x14/0x20 Apr 4 20:34:13 gpu14 kernel: [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
Created attachment 4359 [details] cgroup config
Created attachment 4360 [details] topology config
We saw this also with slurm version 17.02.4-1.el7.centos. error messages at /var/log/messages Jul 4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node Jul 4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node Jul 4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node Jul 4 13:29:44 n292 kernel: cache_from_obj: Wrong slab cache. kmalloc-64(1769:task_19) but object is from kmem_cache_node
System config: OS: CentOS-7.3 kernel: 3.10.0-514.10.2.el7.x86_64 slurm: 17.02.4-1.el7.centos OPA: 10.3.1.0-7.el7 GPFS: 4.2.3 slurm config: TaskPlugin=task/cgroup cgroup.conf: -- CgroupAutomount=no CgroupReleaseAgentDir="/etc/slurm/cgroup" CgroupMountpoint=/sys/fs/cgroup ConstrainCores=no ConstrainRAMSpace=yes # Prevent job to use swap ConstrainSwapSpace=yes AllowedSwapSpace=0 # Maximum usage of memory if user did not specify value # Nodes have 128GB, keep 4GB for GPFS pagepool, few GBs for the system MaxRAMPercent=95 --
Just adding a "me too": [259737.567508] INFO: Object 0xffff961750384a50 @offset=2640 [259737.567512] ============================================================================= [259737.567514] BUG numa_policy(374:task_0) (Tainted: P B OEL ------------ T): Objects remaining in numa_policy(374:task_0) on kmem_cache_close() [259737.567515] ----------------------------------------------------------------------------- [259737.567517] INFO: Slab 0xffffea385ebcff00 objects=62 used=2 fp=0xffff9617af3ffac8 flags=0x3afffff00004080 [259737.567519] CPU: 154 PID: 60268 Comm: python Tainted: P B OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [259737.567521] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018 [259737.567522] Call Trace: [259737.567525] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [259737.567528] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [259737.567530] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [259737.567533] [<ffffffff816a87cb>] ? printk+0x60/0x77 [259737.567535] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [259737.567544] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [259737.567547] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [259737.567550] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [259737.567553] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [259737.567555] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [259737.567558] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [259737.567570] [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [259737.567590] [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm] [259737.567601] [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm] [259737.567611] [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [259737.567630] [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm] [259737.567639] [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm] [259737.567642] [<ffffffff8120791c>] __fput+0xec/0x260 [259737.567646] [<ffffffff81207b7e>] ____fput+0xe/0x10 [259737.567648] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [259737.567659] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [259737.567664] [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280 [259737.567667] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [259737.567671] [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0 [259737.567675] [<ffffffff8102a457>] do_signal+0x57/0x6c0 [259737.567679] [<ffffffff810fb756>] ? do_futex+0x106/0x5a0 [259737.567682] [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0 [259737.567686] [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180 [259737.567697] [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0 [259737.567702] [<ffffffff816c0a5d>] int_signal+0x12/0x17 [259737.567735] INFO: Object 0xffff9617af3ff180 @offset=12672 [259737.567738] INFO: Object 0xffff9617af3ff5a0 @offset=13728 [259737.567742] kmem_cache_destroy numa_policy(374:task_0): Slab cache still has objects [259737.567746] CPU: 154 PID: 60268 Comm: python Tainted: P B OEL ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [259737.567748] Hardware name: HPE HPE Integrity MC990 X Server/HPE Integrity MC990 X Server, BIOS Integrity MC990 X BIOS 03/09/2018 [259737.567749] Call Trace: [259737.567753] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [259737.567756] [<ffffffff811ab000>] kmem_cache_destroy+0xe0/0xf0 [259737.567767] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [259737.567770] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [259737.567784] [<ffffffffc083b377>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [259737.567797] [<ffffffffc083cfd0>] uvm_pmm_gpu_deinit+0x50/0x60 [nvidia_uvm] [259737.567817] [<ffffffffc080b050>] remove_gpu+0x220/0x2a0 [nvidia_uvm] [259737.567827] [<ffffffffc080b261>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [259737.567838] [<ffffffffc080fa24>] uvm_va_space_destroy+0x384/0x400 [nvidia_uvm] [259737.567854] [<ffffffffc0802371>] uvm_release+0x11/0x20 [nvidia_uvm] [259737.567858] [<ffffffff8120791c>] __fput+0xec/0x260 [259737.567862] [<ffffffff81207b7e>] ____fput+0xe/0x10 [259737.567864] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [259737.567867] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [259737.567871] [<ffffffff810f99bf>] ? futex_wait+0x11f/0x280 [259737.567874] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [259737.567878] [<ffffffff810a18de>] get_signal_to_deliver+0x1ce/0x5e0 [259737.567890] [<ffffffff8102a457>] do_signal+0x57/0x6c0 [259737.567894] [<ffffffff810fb756>] ? do_futex+0x106/0x5a0 [259737.567896] [<ffffffff810cf15f>] ? pick_next_task_fair+0x5f/0x1b0 [259737.567900] [<ffffffff810fbc70>] ? SyS_futex+0x80/0x180 [259737.567903] [<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0 [259737.567907] [<ffffffff816c0a5d>] int_signal+0x12/0x17 SLURM 17.11.5 CentOS 7.4, kernel 3.10.0-693.21.1.el7.x86_64 jbh
It appears that Bug 3874 would solve earlier reported issues as the fix was put in 17.02.5. John, do you have ConstrainKmemSpace=yes set in your cgroup.conf by chance? If do, will you try it with it off? If you are still seeing issues will you open a separate ticket for this?
(In reply to Brian Christiansen from comment #7) > It appears that Bug 3874 would solve earlier reported issues as the fix was > put in 17.02.5. > > John, do you have ConstrainKmemSpace=yes set in your cgroup.conf by chance? > If do, will you try it with it off? If you are still seeing issues will you > open a separate ticket for this? ConstrainKmemSpace wasn't in my cgroup.conf, so yes by default. Created a new bug, 5092. jbh