Ticket 6837 - slurmctld segfault (_step_dealloc_lps)
Summary: slurmctld segfault (_step_dealloc_lps)
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 6386 7253 8322 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-04-10 16:58 MDT by Kilian Cavalotti
Modified: 2021-04-08 11:04 MDT (History)
5 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed: 18.08.8
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
gdb output for segfault #1 (273.81 KB, text/x-log)
2019-04-10 17:00 MDT, Kilian Cavalotti
Details
gdb output for segfault #2 (75.97 KB, text/x-log)
2019-04-10 17:00 MDT, Kilian Cavalotti
Details
gdb output for segfault #3 (132.76 KB, text/x-log)
2019-04-10 17:01 MDT, Kilian Cavalotti
Details
avoid segfault (605 bytes, patch)
2019-04-10 17:13 MDT, Marshall Garey
Details | Diff
slurmctld logs (3.58 MB, application/x-bzip)
2019-04-11 08:44 MDT, Kilian Cavalotti
Details
current slurmctld logs (1.11 MB, application/x-bzip)
2019-04-15 11:18 MDT, Kilian Cavalotti
Details
proposed fix (1.45 KB, patch)
2019-04-18 13:51 MDT, Moe Jette
Details | Diff
patch proposal (1.41 KB, patch)
2019-04-23 08:55 MDT, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2019-04-10 16:58:57 MDT
Hi,

Just got a segfault during production that took down both controllers:

(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x5587520) at step_mgr.c:2105
#1  post_job_step (step_ptr=step_ptr@entry=0x5587520) at step_mgr.c:4782
#2  0x00000000004b510b in _post_job_step (step_ptr=0x5587520) at step_mgr.c:270
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x5586a90, step_ptr=step_ptr@entry=0x5587520) at step_mgr.c:317
#4  0x00000000004b52c3 in job_step_complete (job_id=40338813, step_id=4294967295, uid=uid@entry=0, requeue=requeue@entry=false, job_return_code=<optimized out>) at step_mgr.c:884
#5  0x0000000000496b23 in _slurm_rpc_step_complete (running_composite=false, msg=0x7f1670ce0e50) at proc_req.c:3902
#6  slurmctld_req (msg=msg@entry=0x7f1670ce0e50, arg=arg@entry=0x7f16500128e0) at proc_req.c:512
#7  0x0000000000428a41 in _service_connection (arg=0x7f16500128e0) at controller.c:1282
#8  0x00007f168731ddd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1687046ead in clone () from /lib64/libc.so.6

The first two restarts generated the same segfault, but a 4th oneeventually worked and slurmctld is running again now.

-rw------- 1 slurm slurm   866541568 Apr 10 15:48 /var/spool/slurm.state/core.100920
-rw------- 1 slurm slurm   594264064 Apr 10 15:53 /var/spool/slurm.state/core.51713
-rw------- 1 slurm slurm   785895424 Apr 10 15:57 /var/spool/slurm.state/core.54759

"thread apply all bt full" to follow.

Cheers,
-- 
Kilian
Comment 1 Kilian Cavalotti 2019-04-10 17:00:41 MDT
Created attachment 9853 [details]
gdb output for segfault #1
Comment 2 Kilian Cavalotti 2019-04-10 17:00:59 MDT
Created attachment 9854 [details]
gdb output for segfault #2
Comment 3 Kilian Cavalotti 2019-04-10 17:01:37 MDT
Created attachment 9855 [details]
gdb output for segfault #3
Comment 4 Kilian Cavalotti 2019-04-10 17:03:23 MDT
Mmmh, correction, slurmctld just crashed again, so we're basically down.
I'm raising the severity to 1 :(
Comment 5 Marshall Garey 2019-04-10 17:13:15 MDT
Created attachment 9856 [details]
avoid segfault

Can you apply this patch? It simply returns before dereferencing that NULL pointer.

I believe we saw this exact same segfault in 17.11, but thought it was resolved. Apparently not. Let's get the system up and running, then we'll work on fixing the actual problem. And I'll see if we can include this patch in a future tag - even though it masks a problem, at least it can print a nasty warning instead of segfaulting at the cost of maybe leaking a little memory. (There's been some disagreement in the past amount this, as you may recall.)
Comment 6 Marshall Garey 2019-04-10 17:19:43 MDT
This is probably the same thing as bug 5438, which we thought was resolved but now it seems it wasn't. Also bug 5487 comment 4:

https://bugs.schedmd.com/show_bug.cgi?id=5487#c4

> Thanks! I re-applied the patch from 5438 on our 17.11.8 installation.

> On a side note, what about including that patch in future releases anyway, even 
> without a complete understanding of how the job structure got corrupted? 

> I mean logging an error or even generating an exception via an assert would be 
> better than a segfault, wouldn't it?

I certainly think so. Looking closer at the code, that function appears to actually free up resources, so maybe returning early would mean those resources don't get free'd. But since job_resrcs_ptr is already NULL at this point, I'm guessing they did already get free'd and we're in double-free territory. But this is all guess-work.
Comment 7 Kilian Cavalotti 2019-04-10 17:27:00 MDT
Thanks!
I'll try your patch from #c5 right away and report back shortly.
Comment 8 Kilian Cavalotti 2019-04-10 17:36:18 MDT
Ok, it looks like we're back in business, thanks!

Yes, I'm a bit surprised to see this popping again, I assumed it had been included in 18.08. So it would definitely be great to make sure that it doesn't happen again in future versions.

Let me know if you need additional debug information, I have plenty of core dumps :)

Cheers,
-- 
Kilian
Comment 9 Marshall Garey 2019-04-10 18:03:53 MDT
Back in business is fantastic. And Yeah, that's definitely a mistake on our part. We'll probably ask for some info on those core dumps and a current slurm.conf. I'm going to ask someone else to take it because I couldn't find what was wrong last time, so I hope a fresh set of eyes can see what I missed.
Comment 10 Dominik Bartkiewicz 2019-04-10 23:58:04 MDT
Hi

Could you send slurmctld.log?

Dominik
Comment 11 Kilian Cavalotti 2019-04-11 08:44:18 MDT
Created attachment 9862 [details]
slurmctld logs

Absolutely! Here it is.
First segfault happened at 15:45, Apr 10.

Thanks,
-- 
Kilian
Comment 12 Dominik Bartkiewicz 2019-04-11 09:16:03 MDT
Hi

Could you also send us gdb output for?
t 1
p *job_ptr

Dominik
Comment 13 Kilian Cavalotti 2019-04-11 09:19:29 MDT
(In reply to Dominik Bartkiewicz from comment #12)
> Could you also send us gdb output for?
> t 1
> p *job_ptr

Yup, here it is:


(gdb) t 1
[Switching to thread 1 (Thread 0x7f312c5c5700 (LWP 51365))]
#0  _step_dealloc_lps (step_ptr=0x7f30c0764a30) at step_mgr.c:2105
2105    in step_mgr.c
(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x5bc7e30 "dpetrov", admin_comment = 0x0,
  alias_list = 0x0, alloc_node = 0x5bc7e10 "sh-ln07", alloc_resp_port = 0,
  alloc_sid = 49087, array_job_id = 40338813, array_task_id = 500,
  array_recs = 0x5bc83f0, assoc_id = 5362, assoc_ptr = 0xf895f0,
  batch_features = 0x0, batch_flag = 1, batch_host = 0x7f31b5300520 "sh-106-68",
  billable_tres = 16, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0,
  check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0,
  comment = 0x0, cpu_cnt = 16, cpus_per_tres = 0x0, cr_enabled = 1,
  db_index = 709147756, deadline = 0, delay_boot = 0, derived_ec = 0,
  details = 0x5bc76e0, direct_set_prio = 0, end_time = 1554936276,
  end_time_exp = 4294967294, epilog_running = false, exit_code = 0,
  fed_details = 0x0, front_end_ptr = 0x0, gids = 0x7f30c80741f0, gres_list = 0x0,
  gres_alloc = 0x5bc7e50 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
  gres_req = 0x5bc7e70 "", gres_used = 0x0, group_id = 37123, job_id = 40338813,
  job_next = 0x558abf0, job_array_next_j = 0x5ddf210, job_array_next_t = 0x0,
  job_resrcs = 0x0, job_state = 32768, kill_on_node_fail = 1,
  last_sched_eval = 1554935188, licenses = 0x0, license_list = 0x0, limit_set = {
    qos = 0, time = 0, tres = 0x5bc78b0}, mail_type = 0, mail_user = 0x0,
  mem_per_tres = 0x0, mcs_label = 0x0, name = 0x5bc7db0 "LN_sweep.py",
  network = 0x0, next_step_id = 0, ngids = 4, nodes = 0x7f31b5651d40 "sh-106-68",
  node_addr = 0x7f31b4869c20, node_bitmap = 0x7f31b5695ab0,
  node_bitmap_cg = 0x7f31b40a6510, node_cnt = 1, node_cnt_wag = 1,
  nodes_completing = 0x7f3140094130 "sh-106-68", origin_cluster = 0x0,
  other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0,
  pack_job_offset = 0, pack_job_list = 0x0,
  partition = 0x7f31b40a9060 "owners,hns", part_ptr_list = 0x5e082a0,
  part_nodes_missing = false, part_ptr = 0x7f30f8068180, power_flags = 0 '\000',
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 70,
  priority_array = 0x7f3160342640, prio_factors = 0x5bc7cf0, profile = 0,
  qos_id = 3, qos_ptr = 0xbd1f80, qos_blocking_ptr = 0x0, reboot = 0 '\000',
  restart_cnt = 7, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0,
  requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0,
  select_jobinfo = 0x5bc7ec0, spank_job_env = 0x0, spank_job_env_size = 0,
  start_protocol_ver = 8448, start_time = 0, state_desc = 0x0, state_reason = 0,
  state_reason_prev = 35, step_list = 0x5e08430, suspend_time = 0,
  system_comment = 0x0, time_last_active = 1554936276, time_limit = 2880,
  time_min = 0, tot_sus_time = 0, total_cpus = 16, total_nodes = 1,
  tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0,
  tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x5bc8500,
  tres_req_str = 0x5bc8380 "1=16,2=65536,4=1,5=16",
  tres_fmt_req_str = 0x5bc83b0 "cpu=16,mem=64G,node=1,billing=16",
  tres_alloc_cnt = 0x7f31b5364150,
  tres_alloc_str = 0x7f31b637d1c0 "1=16,2=65536,3=18446744073709551614,4=1,5=16",
  tres_fmt_alloc_str = 0x7f31b4014890 "cpu=16,mem=64G,node=1,billing=16",
  user_id = 269223, user_name = 0x5bc7de0 "cmcfarl2", wait_all_nodes = 0,
  warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0,
  wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb)

Cheers,
--
Kilian
Comment 15 Dominik Bartkiewicz 2019-04-12 05:14:29 MDT
Hi

I have some idea
Could you also send me gdb output for?
t 1
p *(job_ptr->array_recs)

Dominik
Comment 16 Kilian Cavalotti 2019-04-12 09:28:34 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #15)
> I have some idea
> Could you also send me gdb output for?
> t 1
> p *(job_ptr->array_recs)

Sure!

(gdb) t 1
[Switching to thread 1 (Thread 0x7f312c5c5700 (LWP 51365))]
#0  _step_dealloc_lps (step_ptr=0x7f30c0764a30) at step_mgr.c:2105
2105    in step_mgr.c
(gdb) p *(job_ptr->array_recs)
$1 = {task_cnt = 0, task_id_bitmap = 0x0, task_id_str = 0x0, array_flags = 1,
  max_run_tasks = 0, tot_run_tasks = 119, min_exit_code = 9,
  max_exit_code = 9, pend_run_tasks = 0, tot_comp_tasks = 312}

Cheers,
-- 
Kilian
Comment 18 Dominik Bartkiewicz 2019-04-15 07:16:31 MDT
Hi

I assume that you are running slurmctld with Marshall patch from comment 5.
Could you send me current slurmctld.log?

Dominik
Comment 19 Kilian Cavalotti 2019-04-15 11:18:26 MDT
Created attachment 9912 [details]
current slurmctld logs

(In reply to Dominik Bartkiewicz from comment #18)
> Hi
> 
> I assume that you are running slurmctld with Marshall patch from comment 5.

We do, yes.

> Could you send me current slurmctld.log?

Here is the latest one. No mention of XXXBUG6837XXX, though.

Cheers,
-- 
Kilian
Comment 22 Dominik Bartkiewicz 2019-04-17 12:43:55 MDT
Hi

I have finally reproduced this issue.
I know now where this problem is located, and I am working on a fix.

Dominik
Comment 26 Kilian Cavalotti 2019-04-18 14:39:20 MDT
Hi Moe,

Thanks a lot for the patch and for the explanation, this is very much
appreciated!

On Thu, Apr 18, 2019 at 12:51 PM <bugs@schedmd.com> wrote:
> I don't know how you could have found this, but great job!

I guess we have a certain knack for getting ourselves in corner cases,
these days. :)

Cheers,
Comment 27 Moe Jette 2019-04-18 14:42:11 MDT
(In reply to Kilian Cavalotti from comment #26)
> Hi Moe,
> 
> Thanks a lot for the patch and for the explanation, this is very much
> appreciated!
> 
> On Thu, Apr 18, 2019 at 12:51 PM <bugs@schedmd.com> wrote:
> > I don't know how you could have found this, but great job!
> 
> I guess we have a certain knack for getting ourselves in corner cases,
> these days. :)
> 
> Cheers,

Killian,
The patch and comments were meant for Dominik to review before sending it onto you. (I messed up by not selecting the "Private" button first). I think this fixes your problem, but would recommend giving Dominik time to review and test before installing it on your systems.
Comment 28 Kilian Cavalotti 2019-04-18 14:46:39 MDT
No worries, we'll wait for final review before deploying it.
And thanks again, as always.

Cheers,
Comment 31 Dominik Bartkiewicz 2019-04-23 08:55:53 MDT
Created attachment 9994 [details]
patch proposal

Hi

This patch is slightly modified version of patch previously attached by Moe.
It should fix this issue.

Dominik
Comment 32 Kilian Cavalotti 2019-04-23 11:22:55 MDT
Hi Dominik, 

Thank you! Will it be in the next 18.08, or only in 19.05?

Cheers,
-- 
Kilian
Comment 33 Dominik Bartkiewicz 2019-04-24 00:47:05 MDT
Hi

I expect that this will also go to 18.08.

Dominik
Comment 35 Dominik Bartkiewicz 2019-04-25 03:30:15 MDT
Hi

We added this fix as:
https://github.com/SchedMD/slurm/commit/70d12f070908c33
It will be in next 18.08 and 19.05
If you are ok with this I would like to close this ticket.

Dominik
Comment 36 Kilian Cavalotti 2019-04-25 08:46:30 MDT
(In reply to Dominik Bartkiewicz from comment #35)
> Hi
> 
> We added this fix as:
> https://github.com/SchedMD/slurm/commit/70d12f070908c33
> It will be in next 18.08 and 19.05
> If you are ok with this I would like to close this ticket.

Perfect, yes, thank you!

Cheers,
-- 
Kilian
Comment 38 Dominik Bartkiewicz 2019-05-16 16:02:38 MDT
*** Ticket 6386 has been marked as a duplicate of this ticket. ***
Comment 39 Dominik Bartkiewicz 2019-06-25 04:30:37 MDT
*** Ticket 7253 has been marked as a duplicate of this ticket. ***
Comment 40 Dominik Bartkiewicz 2020-01-20 09:04:34 MST
*** Ticket 8322 has been marked as a duplicate of this ticket. ***