Hi, Just got a segfault during production that took down both controllers: (gdb) bt #0 _step_dealloc_lps (step_ptr=0x5587520) at step_mgr.c:2105 #1 post_job_step (step_ptr=step_ptr@entry=0x5587520) at step_mgr.c:4782 #2 0x00000000004b510b in _post_job_step (step_ptr=0x5587520) at step_mgr.c:270 #3 _internal_step_complete (job_ptr=job_ptr@entry=0x5586a90, step_ptr=step_ptr@entry=0x5587520) at step_mgr.c:317 #4 0x00000000004b52c3 in job_step_complete (job_id=40338813, step_id=4294967295, uid=uid@entry=0, requeue=requeue@entry=false, job_return_code=<optimized out>) at step_mgr.c:884 #5 0x0000000000496b23 in _slurm_rpc_step_complete (running_composite=false, msg=0x7f1670ce0e50) at proc_req.c:3902 #6 slurmctld_req (msg=msg@entry=0x7f1670ce0e50, arg=arg@entry=0x7f16500128e0) at proc_req.c:512 #7 0x0000000000428a41 in _service_connection (arg=0x7f16500128e0) at controller.c:1282 #8 0x00007f168731ddd5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f1687046ead in clone () from /lib64/libc.so.6 The first two restarts generated the same segfault, but a 4th oneeventually worked and slurmctld is running again now. -rw------- 1 slurm slurm 866541568 Apr 10 15:48 /var/spool/slurm.state/core.100920 -rw------- 1 slurm slurm 594264064 Apr 10 15:53 /var/spool/slurm.state/core.51713 -rw------- 1 slurm slurm 785895424 Apr 10 15:57 /var/spool/slurm.state/core.54759 "thread apply all bt full" to follow. Cheers, -- Kilian
Created attachment 9853 [details] gdb output for segfault #1
Created attachment 9854 [details] gdb output for segfault #2
Created attachment 9855 [details] gdb output for segfault #3
Mmmh, correction, slurmctld just crashed again, so we're basically down. I'm raising the severity to 1 :(
Created attachment 9856 [details] avoid segfault Can you apply this patch? It simply returns before dereferencing that NULL pointer. I believe we saw this exact same segfault in 17.11, but thought it was resolved. Apparently not. Let's get the system up and running, then we'll work on fixing the actual problem. And I'll see if we can include this patch in a future tag - even though it masks a problem, at least it can print a nasty warning instead of segfaulting at the cost of maybe leaking a little memory. (There's been some disagreement in the past amount this, as you may recall.)
This is probably the same thing as bug 5438, which we thought was resolved but now it seems it wasn't. Also bug 5487 comment 4: https://bugs.schedmd.com/show_bug.cgi?id=5487#c4 > Thanks! I re-applied the patch from 5438 on our 17.11.8 installation. > On a side note, what about including that patch in future releases anyway, even > without a complete understanding of how the job structure got corrupted? > I mean logging an error or even generating an exception via an assert would be > better than a segfault, wouldn't it? I certainly think so. Looking closer at the code, that function appears to actually free up resources, so maybe returning early would mean those resources don't get free'd. But since job_resrcs_ptr is already NULL at this point, I'm guessing they did already get free'd and we're in double-free territory. But this is all guess-work.
Thanks! I'll try your patch from #c5 right away and report back shortly.
Ok, it looks like we're back in business, thanks! Yes, I'm a bit surprised to see this popping again, I assumed it had been included in 18.08. So it would definitely be great to make sure that it doesn't happen again in future versions. Let me know if you need additional debug information, I have plenty of core dumps :) Cheers, -- Kilian
Back in business is fantastic. And Yeah, that's definitely a mistake on our part. We'll probably ask for some info on those core dumps and a current slurm.conf. I'm going to ask someone else to take it because I couldn't find what was wrong last time, so I hope a fresh set of eyes can see what I missed.
Hi Could you send slurmctld.log? Dominik
Created attachment 9862 [details] slurmctld logs Absolutely! Here it is. First segfault happened at 15:45, Apr 10. Thanks, -- Kilian
Hi Could you also send us gdb output for? t 1 p *job_ptr Dominik
(In reply to Dominik Bartkiewicz from comment #12) > Could you also send us gdb output for? > t 1 > p *job_ptr Yup, here it is: (gdb) t 1 [Switching to thread 1 (Thread 0x7f312c5c5700 (LWP 51365))] #0 _step_dealloc_lps (step_ptr=0x7f30c0764a30) at step_mgr.c:2105 2105 in step_mgr.c (gdb) p *job_ptr $1 = {magic = 4038539564, account = 0x5bc7e30 "dpetrov", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x5bc7e10 "sh-ln07", alloc_resp_port = 0, alloc_sid = 49087, array_job_id = 40338813, array_task_id = 500, array_recs = 0x5bc83f0, assoc_id = 5362, assoc_ptr = 0xf895f0, batch_features = 0x0, batch_flag = 1, batch_host = 0x7f31b5300520 "sh-106-68", billable_tres = 16, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 16, cpus_per_tres = 0x0, cr_enabled = 1, db_index = 709147756, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x5bc76e0, direct_set_prio = 0, end_time = 1554936276, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x7f30c80741f0, gres_list = 0x0, gres_alloc = 0x5bc7e50 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x5bc7e70 "", gres_used = 0x0, group_id = 37123, job_id = 40338813, job_next = 0x558abf0, job_array_next_j = 0x5ddf210, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 32768, kill_on_node_fail = 1, last_sched_eval = 1554935188, licenses = 0x0, license_list = 0x0, limit_set = { qos = 0, time = 0, tres = 0x5bc78b0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x5bc7db0 "LN_sweep.py", network = 0x0, next_step_id = 0, ngids = 4, nodes = 0x7f31b5651d40 "sh-106-68", node_addr = 0x7f31b4869c20, node_bitmap = 0x7f31b5695ab0, node_bitmap_cg = 0x7f31b40a6510, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7f3140094130 "sh-106-68", origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f31b40a9060 "owners,hns", part_ptr_list = 0x5e082a0, part_nodes_missing = false, part_ptr = 0x7f30f8068180, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 70, priority_array = 0x7f3160342640, prio_factors = 0x5bc7cf0, profile = 0, qos_id = 3, qos_ptr = 0xbd1f80, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 7, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x5bc7ec0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8448, start_time = 0, state_desc = 0x0, state_reason = 0, state_reason_prev = 35, step_list = 0x5e08430, suspend_time = 0, system_comment = 0x0, time_last_active = 1554936276, time_limit = 2880, time_min = 0, tot_sus_time = 0, total_cpus = 16, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x5bc8500, tres_req_str = 0x5bc8380 "1=16,2=65536,4=1,5=16", tres_fmt_req_str = 0x5bc83b0 "cpu=16,mem=64G,node=1,billing=16", tres_alloc_cnt = 0x7f31b5364150, tres_alloc_str = 0x7f31b637d1c0 "1=16,2=65536,3=18446744073709551614,4=1,5=16", tres_fmt_alloc_str = 0x7f31b4014890 "cpu=16,mem=64G,node=1,billing=16", user_id = 269223, user_name = 0x5bc7de0 "cmcfarl2", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) Cheers, -- Kilian
Hi I have some idea Could you also send me gdb output for? t 1 p *(job_ptr->array_recs) Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #15) > I have some idea > Could you also send me gdb output for? > t 1 > p *(job_ptr->array_recs) Sure! (gdb) t 1 [Switching to thread 1 (Thread 0x7f312c5c5700 (LWP 51365))] #0 _step_dealloc_lps (step_ptr=0x7f30c0764a30) at step_mgr.c:2105 2105 in step_mgr.c (gdb) p *(job_ptr->array_recs) $1 = {task_cnt = 0, task_id_bitmap = 0x0, task_id_str = 0x0, array_flags = 1, max_run_tasks = 0, tot_run_tasks = 119, min_exit_code = 9, max_exit_code = 9, pend_run_tasks = 0, tot_comp_tasks = 312} Cheers, -- Kilian
Hi I assume that you are running slurmctld with Marshall patch from comment 5. Could you send me current slurmctld.log? Dominik
Created attachment 9912 [details] current slurmctld logs (In reply to Dominik Bartkiewicz from comment #18) > Hi > > I assume that you are running slurmctld with Marshall patch from comment 5. We do, yes. > Could you send me current slurmctld.log? Here is the latest one. No mention of XXXBUG6837XXX, though. Cheers, -- Kilian
Hi I have finally reproduced this issue. I know now where this problem is located, and I am working on a fix. Dominik
Hi Moe, Thanks a lot for the patch and for the explanation, this is very much appreciated! On Thu, Apr 18, 2019 at 12:51 PM <bugs@schedmd.com> wrote: > I don't know how you could have found this, but great job! I guess we have a certain knack for getting ourselves in corner cases, these days. :) Cheers,
(In reply to Kilian Cavalotti from comment #26) > Hi Moe, > > Thanks a lot for the patch and for the explanation, this is very much > appreciated! > > On Thu, Apr 18, 2019 at 12:51 PM <bugs@schedmd.com> wrote: > > I don't know how you could have found this, but great job! > > I guess we have a certain knack for getting ourselves in corner cases, > these days. :) > > Cheers, Killian, The patch and comments were meant for Dominik to review before sending it onto you. (I messed up by not selecting the "Private" button first). I think this fixes your problem, but would recommend giving Dominik time to review and test before installing it on your systems.
No worries, we'll wait for final review before deploying it. And thanks again, as always. Cheers,
Created attachment 9994 [details] patch proposal Hi This patch is slightly modified version of patch previously attached by Moe. It should fix this issue. Dominik
Hi Dominik, Thank you! Will it be in the next 18.08, or only in 19.05? Cheers, -- Kilian
Hi I expect that this will also go to 18.08. Dominik
Hi We added this fix as: https://github.com/SchedMD/slurm/commit/70d12f070908c33 It will be in next 18.08 and 19.05 If you are ok with this I would like to close this ticket. Dominik
(In reply to Dominik Bartkiewicz from comment #35) > Hi > > We added this fix as: > https://github.com/SchedMD/slurm/commit/70d12f070908c33 > It will be in next 18.08 and 19.05 > If you are ok with this I would like to close this ticket. Perfect, yes, thank you! Cheers, -- Kilian
*** Ticket 6386 has been marked as a duplicate of this ticket. ***
*** Ticket 7253 has been marked as a duplicate of this ticket. ***
*** Ticket 8322 has been marked as a duplicate of this ticket. ***