Created attachment 7568 [details] GDB output Fresh 17.11.9 install segfaults at controller start: decay[9429]: segfault at 7f4ebc022066 ip 00007f4ed5d92ea0 sp 00007f4ed5d8ea28 error 4 in priority_multifactor.so[7f4ed5d90000+a000] (gdb) bt #0 _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497 #1 0x00007fce34e9ad59 in msort_with_tmp.part.0 () from /lib64/libc.so.6 #2 0x00007fce34e9aac8 in msort_with_tmp.part.0 () from /lib64/libc.so.6 #3 0x00007fce34e9b04c in qsort_r () from /lib64/libc.so.6 #4 0x00007fce356e6496 in list_sort (l=0x1a716f0, f=f@entry=0x7fce30d79ea0 <_sort_part_tier>) at list.c:507 #5 0x00007fce30d7dc22 in _get_priority_internal (start_time=<optimized out>, job_ptr=job_ptr@entry=0x2120d20) at priority_multifactor.c:601 #6 0x00007fce30d7e08f in decay_apply_weighted_factors (job_ptr=0x2120d20, start_time_ptr=<optimized out>) at priority_multifactor.c:2007 #7 0x00007fce356e61a5 in list_for_each (l=l@entry=0x12494a0, f=0x7fce30d7e05a <decay_apply_weighted_factors>, arg=arg@entry=0x7fce317a4d78) at list.c:420 #8 0x00007fce30d7e8e0 in fair_tree_decay (jobs=0x12494a0, start=1533929235) at fair_tree.c:71 #9 0x00007fce30d7c3ec in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333 #10 0x00007fce35236e25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007fce34f60bad in clone () from /lib64/libc.so.6 'thread apply all bt full' output attached. The controller crashes at every start, so we had to downgrade to 17.11.8. I have to say, we're not having a great experience with Slurm releases, lately. :(
Hi Kilian, This is unfortunate as we do try to mitigate these type of issues. We will look into this. Thank you for reporting this. -Jason
Hi This backtrace looks strange, do you sure it was generated against slurmctld and plugins in 17.11.9 version? Dominik
(In reply to Dominik Bartkiewicz from comment #2) > Hi > > This backtrace looks strange, do you sure it was generated against slurmctld > and plugins in 17.11.9 version? Yes, all the Slurm packages were up-to-date and at version 17.11.9 before the segfault, and the bt has been generated with them: # rpm -qa slurm\* slurm-perlapi-17.11.9-1.el7.x86_64 slurm-17.11.9-1.el7.x86_64 slurm-contribs-17.11.9-1.el7.x86_64 slurm-slurmctld-17.11.9-1.el7.x86_64 slurm-slurmdbd-17.11.9-1.el7.x86_64 # gdb /usr/sbin/slurmctld /var/spool/slurm.state/core.98220 -ex 'bt' -ex 'exit' GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 98293] [New LWP 98226] [New LWP 98291] [New LWP 98238] [New LWP 98296] [New LWP 98294] [New LWP 98222] [New LWP 98295] [New LWP 98220] [New LWP 98235] [New LWP 98297] [New LWP 98299] [New LWP 98221] [New LWP 98239] [New LWP 98292] [New LWP 98223] [New LWP 98276] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497 497 priority_multifactor.c: No such file or directory. #0 _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497 #1 0x00007fce34e9ad59 in msort_with_tmp.part.0 () from /lib64/libc.so.6 #2 0x00007fce34e9aac8 in msort_with_tmp.part.0 () from /lib64/libc.so.6 #3 0x00007fce34e9b04c in qsort_r () from /lib64/libc.so.6 #4 0x00007fce356e6496 in list_sort (l=0x1a716f0, f=f@entry=0x7fce30d79ea0 <_sort_part_tier>) at list.c:507 #5 0x00007fce30d7dc22 in _get_priority_internal (start_time=<optimized out>, job_ptr=job_ptr@entry=0x2120d20) at priority_multifactor.c:601 #6 0x00007fce30d7e08f in decay_apply_weighted_factors (job_ptr=0x2120d20, start_time_ptr=<optimized out>) at priority_multifactor.c:2007 #7 0x00007fce356e61a5 in list_for_each (l=l@entry=0x12494a0, f=0x7fce30d7e05a <decay_apply_weighted_factors>, arg=arg@entry=0x7fce317a4d78) at list.c:420 #8 0x00007fce30d7e8e0 in fair_tree_decay (jobs=0x12494a0, start=1533929235) at fair_tree.c:71 #9 0x00007fce30d7c3ec in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333 #10 0x00007fce35236e25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007fce34f60bad in clone () from /lib64/libc.so.6 Undefined command: "exit". Try "help". Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.9-1.el7.x86_64 (gdb) Are there specific things to check in the coredump to verify the version numbers? Cheers, -- Kilian
Created attachment 7572 [details] gdb output Oh but I didn't attach the right output file, sorry about that. Here's the right onw.
Hi Thanks this one looks much more readable. Dominik
Hi Good news is that Danny's got a fix to it, and it should be commited shortly. 17.11.9-2 with this and with a few other changes will be released soon. Dominik
Hi Fixed in commit: https://github.com/SchedMD/slurm/commit/21d2ab6ed16 17.11.9-2 is already available on the download page. If you have no additional questions can we close this ticket? Dominik
(In reply to Dominik Bartkiewicz from comment #9) > Hi > > Fixed in commit: > https://github.com/SchedMD/slurm/commit/21d2ab6ed16 > 17.11.9-2 is already available on the download page. > If you have no additional questions can we close this ticket? Yep, sounds good, I'll give 17.11.9-2 a try. Thanks! -- Kilian
*** Ticket 5615 has been marked as a duplicate of this ticket. ***