Ticket 5552 - slurmctld segfault (_sort_part_tier)
Summary: slurmctld segfault (_sort_part_tier)
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.9
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 5615 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-08-10 13:39 MDT by Kilian Cavalotti
Modified: 2018-08-24 02:49 MDT (History)
4 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.9-2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
GDB output (4.38 KB, application/x-bzip)
2018-08-10 13:39 MDT, Kilian Cavalotti
Details
gdb output (4.21 KB, application/x-bzip)
2018-08-10 15:41 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2018-08-10 13:39:18 MDT
Created attachment 7568 [details]
GDB output

Fresh 17.11.9 install segfaults at controller start:

decay[9429]: segfault at 7f4ebc022066 ip 00007f4ed5d92ea0 sp 00007f4ed5d8ea28 error 4 in priority_multifactor.so[7f4ed5d90000+a000]


(gdb) bt
#0  _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497
#1  0x00007fce34e9ad59 in msort_with_tmp.part.0 () from /lib64/libc.so.6
#2  0x00007fce34e9aac8 in msort_with_tmp.part.0 () from /lib64/libc.so.6
#3  0x00007fce34e9b04c in qsort_r () from /lib64/libc.so.6
#4  0x00007fce356e6496 in list_sort (l=0x1a716f0, f=f@entry=0x7fce30d79ea0 <_sort_part_tier>) at list.c:507
#5  0x00007fce30d7dc22 in _get_priority_internal (start_time=<optimized out>, job_ptr=job_ptr@entry=0x2120d20) at priority_multifactor.c:601
#6  0x00007fce30d7e08f in decay_apply_weighted_factors (job_ptr=0x2120d20, start_time_ptr=<optimized out>) at priority_multifactor.c:2007
#7  0x00007fce356e61a5 in list_for_each (l=l@entry=0x12494a0, f=0x7fce30d7e05a <decay_apply_weighted_factors>, arg=arg@entry=0x7fce317a4d78) at list.c:420
#8  0x00007fce30d7e8e0 in fair_tree_decay (jobs=0x12494a0, start=1533929235) at fair_tree.c:71
#9  0x00007fce30d7c3ec in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
#10 0x00007fce35236e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fce34f60bad in clone () from /lib64/libc.so.6


'thread apply all bt full' output attached.


The controller crashes at every start, so we had to downgrade to 17.11.8.
I have to say, we're not having a great experience with Slurm releases, lately. :(
Comment 1 Jason Booth 2018-08-10 14:01:41 MDT
Hi Kilian,

 This is unfortunate as we do try to mitigate these type of issues. We will look into this. Thank you for reporting this.

-Jason
Comment 2 Dominik Bartkiewicz 2018-08-10 15:05:56 MDT
Hi

This backtrace looks strange, do you sure it was generated against slurmctld and plugins in 17.11.9 version?

Dominik
Comment 3 Kilian Cavalotti 2018-08-10 15:30:57 MDT
(In reply to Dominik Bartkiewicz from comment #2)
> Hi
> 
> This backtrace looks strange, do you sure it was generated against slurmctld
> and plugins in 17.11.9 version?

Yes, all the Slurm packages were up-to-date and at version 17.11.9 before the segfault, and the bt has been generated with them:

# rpm -qa slurm\*
slurm-perlapi-17.11.9-1.el7.x86_64
slurm-17.11.9-1.el7.x86_64
slurm-contribs-17.11.9-1.el7.x86_64
slurm-slurmctld-17.11.9-1.el7.x86_64
slurm-slurmdbd-17.11.9-1.el7.x86_64

# gdb /usr/sbin/slurmctld /var/spool/slurm.state/core.98220 -ex 'bt' -ex 'exit'
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 98293]
[New LWP 98226]
[New LWP 98291]
[New LWP 98238]
[New LWP 98296]
[New LWP 98294]
[New LWP 98222]
[New LWP 98295]
[New LWP 98220]
[New LWP 98235]
[New LWP 98297]
[New LWP 98299]
[New LWP 98221]
[New LWP 98239]
[New LWP 98292]
[New LWP 98223]
[New LWP 98276]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497
497     priority_multifactor.c: No such file or directory.
#0  _sort_part_tier (x=0x7fce10020f58, y=0x7fce10020f60) at priority_multifactor.c:497
#1  0x00007fce34e9ad59 in msort_with_tmp.part.0 () from /lib64/libc.so.6
#2  0x00007fce34e9aac8 in msort_with_tmp.part.0 () from /lib64/libc.so.6
#3  0x00007fce34e9b04c in qsort_r () from /lib64/libc.so.6
#4  0x00007fce356e6496 in list_sort (l=0x1a716f0, f=f@entry=0x7fce30d79ea0 <_sort_part_tier>) at list.c:507
#5  0x00007fce30d7dc22 in _get_priority_internal (start_time=<optimized out>, job_ptr=job_ptr@entry=0x2120d20) at priority_multifactor.c:601
#6  0x00007fce30d7e08f in decay_apply_weighted_factors (job_ptr=0x2120d20, start_time_ptr=<optimized out>) at priority_multifactor.c:2007
#7  0x00007fce356e61a5 in list_for_each (l=l@entry=0x12494a0, f=0x7fce30d7e05a <decay_apply_weighted_factors>, arg=arg@entry=0x7fce317a4d78) at list.c:420
#8  0x00007fce30d7e8e0 in fair_tree_decay (jobs=0x12494a0, start=1533929235) at fair_tree.c:71
#9  0x00007fce30d7c3ec in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
#10 0x00007fce35236e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fce34f60bad in clone () from /lib64/libc.so.6
Undefined command: "exit".  Try "help".
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.9-1.el7.x86_64
(gdb)


Are there specific things to check in the coredump to verify the version numbers?

Cheers,
--
Kilian
Comment 4 Kilian Cavalotti 2018-08-10 15:41:36 MDT
Created attachment 7572 [details]
gdb output

Oh but I didn't attach the right output file, sorry about that. Here's the right onw.
Comment 5 Dominik Bartkiewicz 2018-08-10 15:53:19 MDT
Hi

Thanks this one looks much more readable.
 

Dominik
Comment 7 Dominik Bartkiewicz 2018-08-10 16:24:55 MDT
Hi

Good news is that
Danny's got a fix to it, and it should be commited shortly.
17.11.9-2 with this and with a few other changes will be released soon.

Dominik
Comment 9 Dominik Bartkiewicz 2018-08-10 16:46:37 MDT
Hi

Fixed  in commit:
https://github.com/SchedMD/slurm/commit/21d2ab6ed16
17.11.9-2 is already available on the download page.
If you have no additional questions can we close this ticket?

Dominik
Comment 10 Kilian Cavalotti 2018-08-10 17:24:37 MDT
(In reply to Dominik Bartkiewicz from comment #9)
> Hi
> 
> Fixed  in commit:
> https://github.com/SchedMD/slurm/commit/21d2ab6ed16
> 17.11.9-2 is already available on the download page.
> If you have no additional questions can we close this ticket?

Yep, sounds good, I'll give 17.11.9-2 a try.

Thanks!
-- 
Kilian
Comment 13 Alejandro Sanchez 2018-08-24 02:49:25 MDT
*** Ticket 5615 has been marked as a duplicate of this ticket. ***