Ticket 1557 - slurmctld crash assertion failed
Summary: slurmctld crash assertion failed
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-03-24 04:11 MDT by Paul Edmon
Modified: 2015-03-25 06:51 MDT (History)
2 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.log (2.28 MB, text/x-log)
2015-03-24 05:31 MDT, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2015-03-24 04:11:59 MDT
Our ctld crashed this morning.  It was trying to complete a job, when I ran the ctld in full debug mode it has:

slurmctld: cleanup_completing: job 36030552 completion process took 84 seconds
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=36030552 State=0x3 NodeCnt=0 Node=holy2b09106 usec=434173
slurmctld: debug:  sched: schedule() returning, too many RPCs
slurmctld: debug3: Tree sending to eldorado03
slurmctld: debug2: node_did_resp holystat17
slurmctld: debug4: orig_timeout was 100000 we have 0 steps and a timeout of 100000
slurmctld: _pick_step_nodes: some requested nodes holy2a06102 still have memory used by other steps
slurmctld: cleanup_completing: job 36028846 completion process took 2451 seconds
slurmctld: bitstring.c:624: bit_and: Assertion `((b1)[1]) == ((b2)[1])' failed.
Aborted (core dumped)

Seems like job 36028846 has a bad state.  I deleted the job.36028846 folder to see if that would force it to clear but no dice (I realize now I should just move it aside next time, my bad).  I found that if I turned off the host the job was on sandy-rc01, that the ctld would keep running.  However, whenever sandy-rc01 back on the ctld would crash with the same error.  I've powered off sandy-rc01 to allow the scheduler to work.  I also have a core file if you want info on that.  Please advise as to best course of action.
Comment 1 Brian Christiansen 2015-03-24 04:17:53 MDT
Will you send the backtrace from the core dump?

ex.
gdb slurmctld core
bt full
Comment 2 Paul Edmon 2015-03-24 04:29:22 MDT
[root@holy-slurm01 spool]# gdb slurmctld core.58627
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New Thread 16812]
[New Thread 16848]
[New Thread 16823]
[New Thread 16849]
[New Thread 16839]
[New Thread 58635]
[New Thread 58634]
[New Thread 58633]
[New Thread 16822]
[New Thread 16840]
[New Thread 16574]
[New Thread 58630]
[New Thread 58629]
[New Thread 16578]
[New Thread 16576]
[New Thread 16577]
[New Thread 59011]
[New Thread 16582]
[New Thread 16800]
[New Thread 59010]
[New Thread 16797]
[New Thread 16843]
[New Thread 58636]
[New Thread 16844]
[New Thread 58627]
[New Thread 16602]
[New Thread 58637]
[New Thread 16587]
[New Thread 16580]
[New Thread 59009]
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/lib64/slurm/crypto_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/lib64/slurm/gres_gpu.so
Reading symbols from /usr/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/lib64/slurm/select_cons_res.so
Reading symbols from /usr/lib64/slurm/preempt_partition_prio.so...done.
Loaded symbols for /usr/lib64/slurm/preempt_partition_prio.so
Reading symbols from /usr/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_energy_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_filesystem_none.so
Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so
Reading symbols from /usr/lib64/slurm/job_submit_lua.so...done.
Loaded symbols for /usr/lib64/slurm/job_submit_lua.so
Reading symbols from /usr/lib64/liblua-5.1.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/liblua-5.1.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/lua/5.1/posix.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/lua/5.1/posix.so
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /usr/lib64/slurm/ext_sensors_none.so...done.
Loaded symbols for /usr/lib64/slurm/ext_sensors_none.so
Reading symbols from /usr/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/lib64/slurm/switch_none.so
Reading symbols from /usr/lib64/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_slurmdbd.so
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/slurm/topology_none.so...done.
Loaded symbols for /usr/lib64/slurm/topology_none.so
Reading symbols from /usr/lib64/slurm/jobcomp_script.so...done.
Loaded symbols for /usr/lib64/slurm/jobcomp_script.so
Reading symbols from /usr/lib64/slurm/sched_backfill.so...done.
Loaded symbols for /usr/lib64/slurm/sched_backfill.so
Reading symbols from /usr/lib64/slurm/route_default.so...done.
Loaded symbols for /usr/lib64/slurm/route_default.so
Reading symbols from /usr/lib64/slurm/priority_multifactor.so...done.
Loaded symbols for /usr/lib64/slurm/priority_multifactor.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x0000003254a32625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
slurm-14.11.4-1fasrc01.el6.x86_64
(gdb) bt full
#0  0x0000003254a32625 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000003254a33e05 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000003254a2b74e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000003254a2b810 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00000000004b3e0d in bit_and (b1=<value optimized out>, b2=<value 
optimized out>) at bitstring.c:624
         bit = <value optimized out>
         __PRETTY_FUNCTION__ = "bit_and"
#5  0x00000000004903ed in _step_dealloc_lps (step_ptr=0x5511c80) at 
step_mgr.c:2034
         i_node = 951
         i_last = 950
         job_node_inx = <value optimized out>
         step_node_inx = <value optimized out>
         job_ptr = 0x5511300
         job_resrcs_ptr = 0x55115f0
         cpus_alloc = <value optimized out>
         i_first = <value optimized out>
#6  post_job_step (step_ptr=0x5511c80) at step_mgr.c:4097
         job_ptr = 0x5511300
         error_code = <value optimized out>
#7  0x00000000004908c7 in delete_step_records (job_ptr=0x5511300) at 
step_mgr.c:256
         cleaning = 0
         step_iterator = 0x2918ae0
         step_ptr = 0x5511c80
#8  0x000000000044d337 in kill_running_job_by_node_name 
(node_name=0x7f074400da70 "sandy-rc01") at job_mgr.c:3060
         suspended = false
         job_iterator = 0x2918a80
         job_ptr = 0x5511300
         node_ptr = 0x2bf9370
         bit_position = 950
         kill_job_cnt = 2
         now = 1427213026
#9  0x000000000046170c in validate_node_specs (reg_msg=0x7f0744011830, 
protocol_version=<value optimized out>, newly_up=0x7f0852cd475f)
     at node_mgr.c:2302
         error_code = 0
         i = <value optimized out>
         node_inx = 950
         config_ptr = <value optimized out>
         node_ptr = 0x2bf9370
         reason_down = 0x0
         node_flags = <value optimized out>
---Type <return> to continue, or q <return> to quit---
         now = 1427213026
         gang_flag = true
         orig_node_avail = false
         cr_flag = 1
         cpu_spec_array = 0x55118ae1
#10 0x0000000000474c9d in _slurm_rpc_node_registration 
(msg=0x7f0744000ad0) at proc_req.c:2417
         tv1 = {tv_sec = 1427213026, tv_usec = 352303}
         tv2 = {tv_sec = 139673725650768, tv_usec = 140734583212052}
         tv_str = '\000' <repeats 19 times>
         delta_t = 139673725650800
         error_code = <value optimized out>
         newly_up = false
         node_reg_stat_msg = 0x7f0744011830
         job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = 
WRITE_LOCK, partition = NO_LOCK}
         uid = <value optimized out>
#11 0x0000000000479fa4 in slurmctld_req (msg=0x7f0744000ad0, 
arg=0x7f0868006540) at proc_req.c:340
         tv1 = {tv_sec = 1427213026, tv_usec = 352296}
         tv2 = {tv_sec = 0, tv_usec = 0}
         tv_str = '\000' <repeats 19 times>
         delta_t = <value optimized out>
         i = <value optimized out>
         rpc_type_index = 9
         rpc_user_index = 0
         rpc_uid = <value optimized out>
         __func__ = "slurmctld_req"
#12 0x0000000000433c5b in _service_connection (arg=0x7f0868006540) at 
controller.c:1070
         conn = 0x7f0868006540
         msg = 0x7f0744000ad0
         __func__ = "_service_connection"
#13 0x00000032552079d1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x0000003254ae88fd in clone () from /lib64/libc.so.6
No symbol table info available.


On 03/24/2015 12:17 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 1557 
> <http://bugs.schedmd.com/show_bug.cgi?id=1557>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	IN_PROGRESS
> Ever confirmed 		1
>
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c1> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Will you send the backtrace from the core dump?
>
> ex.
> gdb slurmctld core
> bt full
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 3 Paul Edmon 2015-03-24 04:49:30 MDT
The ctld keeps crashing even with sandy-rc01 down, though it takes 
longer to crash.  Here is the latest backtrace.

[root@holy-slurm01 spool]# gdb slurmctld core.35554
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New Thread 35554]
[New Thread 35557]
[New Thread 35561]
[New Thread 35642]
[New Thread 35564]
[New Thread 50882]
[New Thread 50837]
[New Thread 50758]
[New Thread 51104]
[New Thread 35560]
[New Thread 50874]
[New Thread 35556]
[New Thread 50967]
[New Thread 50886]
[New Thread 50838]
[New Thread 50873]
[New Thread 51083]
[New Thread 51069]
[New Thread 50894]
[New Thread 51067]
[New Thread 51312]
[New Thread 35562]
[New Thread 51340]
[New Thread 50971]
[New Thread 51337]
[New Thread 51333]
[New Thread 51146]
[New Thread 51316]
[New Thread 51588]
[New Thread 51223]
[New Thread 51348]
[New Thread 51325]
[New Thread 51349]
[New Thread 51659]
[New Thread 51362]
[New Thread 51522]
[New Thread 51663]
[New Thread 51654]
[New Thread 51324]
[New Thread 51662]
[New Thread 51352]
[New Thread 51677]
[New Thread 51378]
[New Thread 51345]
[New Thread 35641]
[New Thread 51665]
[New Thread 51678]
[New Thread 51631]
[New Thread 35640]
[New Thread 51640]
[New Thread 51676]
[New Thread 51344]
[New Thread 51184]
[New Thread 51320]
[New Thread 51353]
[New Thread 51692]
[New Thread 51332]
[New Thread 51649]
[New Thread 51341]
[New Thread 51317]
[New Thread 51063]
[New Thread 51135]
[New Thread 51163]
[New Thread 51143]
[New Thread 51658]
[New Thread 51681]
[New Thread 50964]
[New Thread 51321]
[New Thread 51182]
[New Thread 51361]
[New Thread 51392]
[New Thread 51432]
[New Thread 51666]
[New Thread 51453]
[New Thread 51648]
[New Thread 51356]
[New Thread 51633]
[New Thread 51574]
[New Thread 51336]
[New Thread 51508]
[New Thread 51641]
[New Thread 51639]
[New Thread 51638]
[New Thread 51655]
[New Thread 51652]
[New Thread 51530]
[New Thread 51643]
[New Thread 51066]
[New Thread 51644]
[New Thread 51117]
[New Thread 51313]
[New Thread 51646]
[New Thread 51680]
[New Thread 51377]
[New Thread 51637]
[New Thread 51635]
[New Thread 51407]
[New Thread 51536]
[New Thread 51573]
[New Thread 51433]
[New Thread 51628]
[New Thread 35563]
[New Thread 51357]
[New Thread 51682]
[New Thread 51688]
[New Thread 51634]
[New Thread 51580]
[New Thread 51651]
[New Thread 51690]
[New Thread 50810]
[New Thread 51602]
[New Thread 51389]
[New Thread 51589]
[New Thread 51459]
[New Thread 51642]
[New Thread 51632]
[New Thread 51684]
[New Thread 51394]
[New Thread 51685]
[New Thread 51636]
[New Thread 51691]
[New Thread 51630]
[New Thread 51629]
[New Thread 51521]
[New Thread 51647]
[New Thread 51650]
[New Thread 51645]
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/lib64/slurm/crypto_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/lib64/slurm/gres_gpu.so
Reading symbols from /usr/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/lib64/slurm/select_cons_res.so
Reading symbols from /usr/lib64/slurm/preempt_partition_prio.so...done.
Loaded symbols for /usr/lib64/slurm/preempt_partition_prio.so
Reading symbols from /usr/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_energy_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_filesystem_none.so
Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so
Reading symbols from /usr/lib64/slurm/job_submit_lua.so...done.
Loaded symbols for /usr/lib64/slurm/job_submit_lua.so
Reading symbols from /usr/lib64/liblua-5.1.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/liblua-5.1.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/lua/5.1/posix.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/lua/5.1/posix.so
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /usr/lib64/slurm/ext_sensors_none.so...done.
Loaded symbols for /usr/lib64/slurm/ext_sensors_none.so
Reading symbols from /usr/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/lib64/slurm/switch_none.so
Reading symbols from /usr/lib64/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_slurmdbd.so
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/slurm/topology_none.so...done.
Loaded symbols for /usr/lib64/slurm/topology_none.so
Reading symbols from /usr/lib64/slurm/jobcomp_script.so...done.
Loaded symbols for /usr/lib64/slurm/jobcomp_script.so
Reading symbols from /usr/lib64/slurm/sched_backfill.so...done.
Loaded symbols for /usr/lib64/slurm/sched_backfill.so
Reading symbols from /usr/lib64/slurm/route_default.so...done.
Loaded symbols for /usr/lib64/slurm/route_default.so
Reading symbols from /usr/lib64/slurm/priority_multifactor.so...done.
Loaded symbols for /usr/lib64/slurm/priority_multifactor.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x0000003254a32625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
slurm-14.11.4-1fasrc01.el6.x86_64
(gdb) bt full
#0  0x0000003254a32625 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000003254a33e05 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000003254a2b74e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000003254a2b810 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00000000004b3e0d in bit_and (b1=<value optimized out>, b2=<value 
optimized out>) at bitstring.c:624
         bit = <value optimized out>
         __PRETTY_FUNCTION__ = "bit_and"
#5  0x00000000004903ed in _step_dealloc_lps (step_ptr=0x4696530) at 
step_mgr.c:2034
         i_node = 951
         i_last = 950
         job_node_inx = <value optimized out>
         step_node_inx = <value optimized out>
         job_ptr = 0x4695bf0
         job_resrcs_ptr = 0x4695eb0
         cpus_alloc = <value optimized out>
         i_first = <value optimized out>
#6  post_job_step (step_ptr=0x4696530) at step_mgr.c:4097
         job_ptr = 0x4695bf0
         error_code = <value optimized out>
#7  0x00000000004908c7 in delete_step_records (job_ptr=0x4695bf0) at 
step_mgr.c:256
         cleaning = 0
         step_iterator = 0x1f75a40
         step_ptr = 0x4696530
#8  0x000000000044d337 in kill_running_job_by_node_name 
(node_name=0x21f7e30 "sandy-rc01") at job_mgr.c:3060
         suspended = false
         job_iterator = 0x1f75a60
         job_ptr = 0x4695bf0
         node_ptr = 0x2256370
         bit_position = 950
         kill_job_cnt = 2
         now = 1427215643
#9  0x000000000045ff0d in set_node_down_ptr (node_ptr=0x2256370, 
reason=0x578bb9 "Not responding") at node_mgr.c:2974
         now = 1427215643
#10 0x000000000046ec77 in ping_nodes () at ping_nodes.c:275
         restart_flag = false
         offset = 250
         max_reg_threads = 50
         i = <value optimized out>
         now = <value optimized out>
         still_live_time = 1427215543
---Type <return> to continue, or q <return> to quit---
         node_dead_time = 1427215217
         last_ping_time = 1427215643
         down_hostlist = 0x1f6f840
         host_str = 0x0
         ping_agent_args = 0x3b186d0
         reg_agent_args = 0x3c19ee0
         node_ptr = <value optimized out>
         old_cpu_load_time = 1427215343
#11 0x0000000000432b09 in _slurmctld_background (no_data=<value 
optimized out>) at controller.c:1555
         last_sched_time = 1427215624
         last_full_sched_time = 1427215592
         last_checkpoint_time = 1427215547
         last_group_time = 1427215187
         last_health_check_time = 1427215187
         last_acct_gather_node_time = 1427215187
         last_ext_sensors_time = 1427215187
         last_no_resp_msg_time = 1427215592
         last_ping_node_time = 1427215643
         last_ping_srun_time = 1427215187
         last_purge_job_time = 1427215624
         last_resv_time = 1427215634
         last_timelimit_time = 1427215643
         last_assert_primary_time = 1427215187
         last_trigger = 1427215624
         last_node_acct = 1427215527
         last_ctld_bu_ping = 1427215560
         last_uid_update = 1427215187
         last_reboot_msg_time = 1427215188
         ping_msg_sent = false
         now = 1427215643
         no_resp_msg_interval = <value optimized out>
         ping_interval = 100
         purge_job_interval = 10
         group_time = <value optimized out>
         group_force = <value optimized out>
         i = <value optimized out>
         job_limit = <value optimized out>
         tv1 = {tv_sec = 1427215634, tv_usec = 882079}
         tv2 = {tv_sec = 1427215633, tv_usec = 848431}
         tv_str = "usec=29019169\000\000\000\000\000\000"
         delta_t = 29019169
         config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = 
NO_LOCK, partition = NO_LOCK}
         job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = 
NO_LOCK, partition = NO_LOCK}
         job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = 
WRITE_LOCK, partition = READ_LOCK}
---Type <return> to continue, or q <return> to quit---
         node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = 
WRITE_LOCK, partition = NO_LOCK}
         node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = 
WRITE_LOCK, partition = NO_LOCK}
         part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = 
NO_LOCK, partition = WRITE_LOCK}
         job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = 
READ_LOCK, partition = NO_LOCK}
#12 0x00000000004350ff in main (argc=<value optimized out>, argv=<value 
optimized out>) at controller.c:561
         cnt = <value optimized out>
         error_code = <value optimized out>
         i = 3
         thread_attr = {__size = '\000' <repeats 17 times>, "\020", 
'\000' <repeats 16 times>, "\020", '\000' <repeats 20 times>, __align = 0}
         stat_buf = {st_dev = 64768, st_ino = 1839278, st_nlink = 1, 
st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0,
           st_size = 391256, st_blksize = 4096, st_blocks = 768, st_atim 
= {tv_sec = 1427206202, tv_nsec = 96233466}, st_mtim = {tv_sec = 
1282484417,
             tv_nsec = 0}, st_ctim = {tv_sec = 1422523947, tv_nsec = 
288570021}, __unused = {0, 0, 0}}
         config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, 
node = WRITE_LOCK, partition = WRITE_LOCK}
         callbacks = {acct_full = 0x493640 
<trigger_primary_ctld_acct_full>, dbd_fail = 0x493410 
<trigger_primary_dbd_fail>,
           dbd_resumed = 0x493380 <trigger_primary_dbd_res_op>, db_fail 
= 0x4932f0 <trigger_primary_db_fail>,
           db_resumed = 0x493260 <trigger_primary_db_res_op>}
         dir_name = 0x7fff32a8bf48 "[ߨ2\377\177"


On 03/24/2015 12:17 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 1557 
> <http://bugs.schedmd.com/show_bug.cgi?id=1557>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	IN_PROGRESS
> Ever confirmed 		1
>
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c1> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Will you send the backtrace from the core dump?
>
> ex.
> gdb slurmctld core
> bt full
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 4 David Bigagli 2015-03-24 04:56:07 MDT
Paul as you said the node that is causing the core dump is sandy-rc01.
The slurmctld thinks there are still some jobs in completing state on the node
and when tries to clean it up the core dump happens. 
With sandy-rc01 down do you see any jobs in CG state with node sandy-rc01 allocated to it?

David
Comment 5 David Bigagli 2015-03-24 04:58:04 MDT
Can you also try to remove the node from slurm.conf?

David
Comment 6 Paul Edmon 2015-03-24 05:05:07 MDT
[root@holy-slurm01 spool]# squeue -w sandy-rc01
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
           36029539 serial_re rnaseq_s chevrier CG       0:00      1 
sandy-rc01
           36028846 serial_re Y2_jrgca    speer CG       7:43      1 
sandy-rc01
      36015384_2279 serial_re Amperean kanasznm CG       0:00      1 
sandy-rc01

So 36028846 is the culprit.

-Paul Edmon-

On 03/24/2015 12:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c4> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David Bigagli 
> <mailto:david@schedmd.com> *
> Paul as you said the node that is causing the core dump is sandy-rc01.
> The slurmctld thinks there are still some jobs in completing state on the node
> and when tries to clean it up the core dump happens.
> With sandy-rc01 down do you see any jobs in CG state with node sandy-rc01
> allocated to it?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 7 Paul Edmon 2015-03-24 05:05:20 MDT
Sure. Let me try that out.

-Paul Edmon-

On 03/24/2015 12:58 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c5> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David Bigagli 
> <mailto:david@schedmd.com> *
> Can you also try to remove the node from slurm.conf?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 8 Paul Edmon 2015-03-24 05:14:51 MDT
So, I've removed that node.  The ctld is running.  We will see if it 
crashes again.  Let you know if it does.

-Paul Edmon-

On 03/24/2015 12:58 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c5> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David Bigagli 
> <mailto:david@schedmd.com> *
> Can you also try to remove the node from slurm.conf?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 David Bigagli 2015-03-24 05:20:25 MDT
Do you still see the CG jobs? Can you send us the log file since you ran at high debug level we can try to find some clue.

David
Comment 10 Paul Edmon 2015-03-24 05:30:55 MDT
Created attachment 1763 [details]
slurm.log

So those jobs look like they no longer exist:

[root@holy-slurm01 spool]# squeue -j 36028846
slurm_load_jobs error: Invalid job id specified
[root@holy-slurm01 spool]# scontrol -dd show job 36028846
slurm_load_jobs error: Invalid job id specified

I've attached the log file from the last time I ran slurmctld -Dvvvvv

This was when it was crashing.  It has not crashed since I dropped 
sandy-rc01 from the conf.

-Paul Edmon-

On 03/24/2015 01:20 PM, bugs@schedmd.com wrote:
>
> *Comment # 9 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c9> on bug 
> 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David Bigagli 
> <mailto:david@schedmd.com> *
> Do you still see the CG jobs? Can you send us the log file since you ran at
> high debug level we can try to find some clue.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 11 David Bigagli 2015-03-24 05:54:33 MDT
In gdb in frame 5 could you please print:

(gdb)p * step_ptr->job_ptr

from frame 4

(gdb)p b1[1]
(gdb)p b2[1]

(gdb)p b1[0]
(gdb)p b2[0]

to move to the desired frame you can use the frame command, e.g.

(gdb)frame 4

Thanks

David
Comment 12 Paul Edmon 2015-03-24 05:59:20 MDT
[root@holy-slurm01 spool]# gdb slurmctld core.58627
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New Thread 16812]
[New Thread 16848]
[New Thread 16823]
[New Thread 16849]
[New Thread 16839]
[New Thread 58635]
[New Thread 58634]
[New Thread 58633]
[New Thread 16822]
[New Thread 16840]
[New Thread 16574]
[New Thread 58630]
[New Thread 58629]
[New Thread 16578]
[New Thread 16576]
[New Thread 16577]
[New Thread 59011]
[New Thread 16582]
[New Thread 16800]
[New Thread 59010]
[New Thread 16797]
[New Thread 16843]
[New Thread 58636]
[New Thread 16844]
[New Thread 58627]
[New Thread 16602]
[New Thread 58637]
[New Thread 16587]
[New Thread 16580]
[New Thread 59009]
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/lib64/slurm/crypto_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/lib64/slurm/gres_gpu.so
Reading symbols from /usr/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/lib64/slurm/select_cons_res.so
Reading symbols from /usr/lib64/slurm/preempt_partition_prio.so...done.
Loaded symbols for /usr/lib64/slurm/preempt_partition_prio.so
Reading symbols from /usr/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_energy_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_filesystem_none.so
Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so
Reading symbols from /usr/lib64/slurm/job_submit_lua.so...done.
Loaded symbols for /usr/lib64/slurm/job_submit_lua.so
Reading symbols from /usr/lib64/liblua-5.1.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/liblua-5.1.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/lua/5.1/posix.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/lua/5.1/posix.so
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /usr/lib64/slurm/ext_sensors_none.so...done.
Loaded symbols for /usr/lib64/slurm/ext_sensors_none.so
Reading symbols from /usr/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/lib64/slurm/switch_none.so
Reading symbols from /usr/lib64/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /usr/lib64/slurm/accounting_storage_slurmdbd.so
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/slurm/topology_none.so...done.
Loaded symbols for /usr/lib64/slurm/topology_none.so
Reading symbols from /usr/lib64/slurm/jobcomp_script.so...done.
Loaded symbols for /usr/lib64/slurm/jobcomp_script.so
Reading symbols from /usr/lib64/slurm/sched_backfill.so...done.
Loaded symbols for /usr/lib64/slurm/sched_backfill.so
Reading symbols from /usr/lib64/slurm/route_default.so...done.
Loaded symbols for /usr/lib64/slurm/route_default.so
Reading symbols from /usr/lib64/slurm/priority_multifactor.so...done.
Loaded symbols for /usr/lib64/slurm/priority_multifactor.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x0000003254a32625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
slurm-14.11.4-1fasrc01.el6.x86_64
(gdb) frame 5
#5  0x00000000004903ed in _step_dealloc_lps (step_ptr=0x5511c80) at 
step_mgr.c:2034
2034    step_mgr.c: No such file or directory.
     in step_mgr.c
(gdb) p * step_ptr->job_ptr
$1 = {account = 0x550f430 "zhuang_lab", alias_list = 0x0, alloc_node = 
0x5510fb0 "rcnx01", alloc_resp_port = 0, alloc_sid = 16944, array_job_id 
= 0,
   array_task_id = 4294967294, array_recs = 0x0, assoc_id = 4502, 
assoc_ptr = 0x2a206e0, batch_flag = 1, batch_host = 0x5cdb660 "sandy-rc01",
   check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, 
cpu_cnt = 0, cr_enabled = 0, db_index = 38992161, derived_ec = 4294967294,
   details = 0x5511030, direct_set_prio = 0, end_time = 1427210653, 
epilog_running = false, exit_code = 256, front_end_ptr = 0x0, gres = 0x0,
   gres_list = 0x0, gres_alloc = 0x5510780 "", gres_req = 0x55107a0 "", 
gres_used = 0x0, group_id = 40189, job_id = 36028846, job_next = 0x0,
   job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 
0x55115f0, job_state = 32773, kill_on_node_fail = 1, licenses = 0x0,
   license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, 
limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory 
= 0,
   limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 
0x0, magic = 4038539564, name = 0x550f400 "Y2_jrgcalpha5_133_1", network 
= 0x0,
   next_step_id = 2, nodes = 0x5511220 "sandy-rc01", node_addr = 
0x55109d0, node_bitmap = 0x5cdb690, node_bitmap_cg = 0x5cdb5b0, node_cnt 
= 0,
   node_cnt_wag = 0, nodes_completing = 0x55111f0 "sandy-rc01", 
other_port = 0, partition = 0x5511250 "serial_requeue,general,general",
   part_ptr_list = 0x54e45a0, part_nodes_missing = false, part_ptr = 
0x2ba9080, pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
   priority = 28831531, priority_array = 0x0, prio_factors = 0x55111a0, 
profile = 0, qos_id = 1, qos_ptr = 0x0, reboot = 0 '\000', restart_cnt = 2,
   resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid 
= 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 
0x55115c0,
   spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 
7168, start_time = 1427210190, state_desc = 0x0, state_reason = 23,
   step_list = 0x54e45f0, suspend_time = 0, time_last_active = 
1427212824, time_limit = 100, time_min = 0, tot_sus_time = 0, total_cpus 
= 2,
   total_nodes = 1, user_id = 40862, wait_all_nodes = 0, warn_flags = 0, 
warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch 
= 0,
   best_switch = true, wait4switch_start = 0}
(gdb) frame 4
#4  0x00000000004b3e0d in bit_and (b1=<value optimized out>, b2=<value 
optimized out>) at bitstring.c:624
624    bitstring.c: No such file or directory.
     in bitstring.c
(gdb) p b1[1]
value has been optimized out
(gdb) p b2[1]
value has been optimized out
(gdb) p b1[0]
value has been optimized out
(gdb) p b2[0]
value has been optimized out
(gdb) quit
[root@holy-slurm01 spool]#

-Paul Edmon-

On 03/24/2015 01:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c11> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> In gdb in frame 5 could you please print:
>
> (gdb)p * step_ptr->job_ptr
>
> from frame 4
>
> (gdb)p b1[1]
> (gdb)p b2[1]
>
> (gdb)p b1[0]
> (gdb)p b2[0]
>
> to move to the desired frame you can use the frame command, e.g.
>
> (gdb)frame 4
>
> Thanks
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 David Bigagli 2015-03-24 06:07:58 MDT
Ah the bitmasks were optimized out... :-( Perhaps next time when you rebuild the controller it would be good to set CFLAGS=-ggdb -O0 so we will get all the symbols, the binary you are running is slightly optimized with -O2 I believe.

David
Comment 14 Paul Edmon 2015-03-24 06:12:23 MDT
That may be true.  Does using -O2 actually get you anything with the 
controller?  I'd rather not slow down the scheduler if we don't have to.

-Paul Edmon-

On 03/24/2015 02:07 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c13> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> Ah the bitmasks were optimized out... :-( Perhaps next time when you rebuild
> the controller it would be good to set CFLAGS=-ggdb -O0 so we will get all the
> symbols, the binary you are running is slightly optimized with -O2 I believe.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 15 David Bigagli 2015-03-24 06:26:59 MDT
The -O2 is the default automake uses so unless explicitly overwritten using
CFLAGS it is on. I don't think it does much to a program like controller who
does mostly integer and strings operations.

David
Comment 16 Paul Edmon 2015-03-24 06:28:46 MDT
Yeah, I guessed that would be true. I would rather have the debuging 
flags so long as it doesn't adversely impact performance.  I will make 
sure to get those into the spec file we use to build slurm.

-Paul Edmon-

On 03/24/2015 02:26 PM, bugs@schedmd.com wrote:
>
> *Comment # 15 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c15> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> The -O2 is the default automake uses so unless explicitly overwritten using
> CFLAGS it is on. I don't think it does much to a program like controller who
> does mostly integer and strings operations.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 17 David Bigagli 2015-03-24 06:51:46 MDT
Ok. I think that at this point you can put the host back in since
the 'funny' job was cleaned. I went through the logs but I don't find much information... do you recall anything that could have triggered it or
it just happen out of the blue?

David
Comment 18 Paul Edmon 2015-03-24 06:53:25 MDT
Just happened out of the blue.  I'm betting the something got corrupted 
or put into an odd state thus causing the crash.

-Paul Edmon-

On 03/24/2015 02:51 PM, bugs@schedmd.com wrote:
>
> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c17> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> Ok. I think that at this point you can put the host back in since
> the 'funny' job was cleaned. I went through the logs but I don't find much
> information... do you recall anything that could have triggered it or
> it just happen out of the blue?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 19 David Bigagli 2015-03-24 07:53:59 MDT
Have you tried to configure the host back into the cluster?
Comment 20 Paul Edmon 2015-03-24 08:02:23 MDT
Yes, that fixed it.  The node is back in and everything is running fine.

-Paul Edmon-

On 03/24/2015 03:53 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c19> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> Have you tried to configure the host back into the cluster?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 21 David Bigagli 2015-03-25 06:10:49 MDT
Is everything running fine now? Just want to follow up with you.

David
Comment 22 Paul Edmon 2015-03-25 06:13:10 MDT
Yeah, things are running just fine.  Our next release of slurm will be 
14.11.5 and will be unoptimized with the specific gdb flags to make this 
easier to debug in the future.  That will be April 6th.

-Paul Edmon-

On 3/25/2015 2:10 PM, bugs@schedmd.com wrote:
>
> *Comment # 21 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c21> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> Is everything running fine now? Just want to follow up with you.
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 23 David Bigagli 2015-03-25 06:14:27 MDT
So can we close this for now and eventually reopen later?

David
Comment 24 Paul Edmon 2015-03-25 06:38:15 MDT
Yeah, sounds good.  I think this is a transient issue.  So if you come 
up with a bug fix I would just roll it into the next minor release.

-Paul Edmon-

On 3/25/2015 2:14 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <http://bugs.schedmd.com/show_bug.cgi?id=1557#c23> on 
> bug 1557 <http://bugs.schedmd.com/show_bug.cgi?id=1557> from David 
> Bigagli <mailto:david@schedmd.com> *
> So can we close this for now and eventually reopen later?
>
> David
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 25 David Bigagli 2015-03-25 06:51:50 MDT
Close for now as we cannot reproduce it. Reopen later on if necessary.

David