Ticket 9474 - After adding a node to slurm.conf and restarting, slurmctld always segfaults
Summary: After adding a node to slurm.conf and restarting, slurmctld always segfaults
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-07-27 17:47 MDT by john.blaas
Modified: 2021-03-30 04:25 MDT (History)
3 users (show)

See Also:
Site: University of Colorado
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: RHEL
Machine Name:
CLE Version:
Version Fixed: 20.02.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (8.61 KB, text/plain)
2020-07-27 17:52 MDT, john.blaas
Details
topology.conf (657 bytes, text/plain)
2020-07-27 17:52 MDT, john.blaas
Details
full bt from our last core file (3.94 KB, text/plain)
2020-07-27 17:57 MDT, Jonathon Anderson
Details
slurmctld.log (33.52 MB, text/plain)
2020-07-27 18:33 MDT, john.blaas
Details
v1 19.05.5 skip segfault patch (616 bytes, patch)
2020-07-27 19:35 MDT, Broderick Gardner
Details | Diff
core file (1) (1.30 MB, application/x-gzip)
2020-07-29 12:12 MDT, Jonathon Anderson
Details
core file (2) (1.30 MB, application/x-gzip)
2020-07-29 12:12 MDT, Jonathon Anderson
Details
core file (3) (1.30 MB, application/x-gzip)
2020-07-29 12:13 MDT, Jonathon Anderson
Details
Job 9623092 slurmctld log (5.60 MB, text/plain)
2020-09-03 08:55 MDT, john.blaas
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description john.blaas 2020-07-27 17:47:04 MDT
We recently tried to add a node to one of our clusters. After adding that node we were unable to restart slurmctld. Even after removing the node back out of configuration we are still unable to successfully start slurmctld, and each time it segfaults when trying to start.

Sample of a gdb output from one of the core files is below

gdb $(which slurmctld) core-slurmctld.29235.slurm3.rc.int.colorado.edu.1595892178 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 29235]
[New LWP 29236]
[New LWP 29239]
[New LWP 29238]
[New LWP 29240]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld -vvv -D'.
Program terminated with signal 11, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0xea0e90) at step_mgr.c:2107
2107	step_mgr.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.5-1.el7.x86_64
(gdb) bt
#0  _step_dealloc_lps (step_ptr=0xea0e90) at step_mgr.c:2107
#1  post_job_step (step_ptr=step_ptr@entry=0xea0e90) at step_mgr.c:4822
#2  0x00000000004b813b in _internal_step_complete (job_ptr=job_ptr@entry=0xea02b0, step_ptr=step_ptr@entry=0xea0e90) at step_mgr.c:302
#3  0x00000000004b81b9 in delete_step_records (job_ptr=job_ptr@entry=0xea02b0) at step_mgr.c:331
#4  0x000000000046d00c in cleanup_completing (job_ptr=job_ptr@entry=0xea02b0) at job_scheduler.c:4899
#5  0x000000000047a24c in deallocate_nodes (job_ptr=job_ptr@entry=0xea02b0, timeout=timeout@entry=false, suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:626
#6  0x00000000004a0362 in _sync_nodes_to_comp_job () at read_config.c:2625
#7  read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1425
#8  0x000000000042d729 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:662
Comment 1 john.blaas 2020-07-27 17:52:08 MDT
Created attachment 15187 [details]
slurm.conf
Comment 2 john.blaas 2020-07-27 17:52:30 MDT
Created attachment 15188 [details]
topology.conf
Comment 3 Jonathon Anderson 2020-07-27 17:57:56 MDT
Created attachment 15189 [details]
full bt from our last core file
Comment 4 Jason Booth 2020-07-27 18:19:01 MDT
Can you please take a tarball of your StateSaveLocation and upload it to this bug for analysis?
Comment 5 Jason Booth 2020-07-27 18:21:23 MDT
Please also attach your slurmctld.log.
Comment 6 john.blaas 2020-07-27 18:28:04 MDT
Created attachment 15191 [details]
state directory for the cluster in question

state directory
Comment 7 john.blaas 2020-07-27 18:33:54 MDT
Created attachment 15192 [details]
slurmctld.log
Comment 8 john.blaas 2020-07-27 18:34:22 MDT
State directory and slurmctld log added
Comment 9 Jason Booth 2020-07-27 19:03:25 MDT
Please goto frame 1 and type:


 p *job_ptr
 p *job_ptr->job_resrcs->node_bitmap

It looks like the job_resrcs is null.
Comment 10 Jonathon Anderson 2020-07-27 19:10:20 MDT
(gdb) frame 1
#1  post_job_step (step_ptr=step_ptr@entry=0x19ffa60) at step_mgr.c:4822
4822	in step_mgr.c
(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x19ff580 "blanca-ibg", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x19ff550 "bmem-ibg1", alloc_resp_port = 0, alloc_sid = 44914, array_job_id = 0, array_task_id = 4294967294, 
  array_recs = 0x0, assoc_id = 451, assoc_ptr = 0x179c210, batch_features = 0x0, batch_flag = 2, batch_host = 0x28949d0 "bnode0219", billable_tres = 6, bit_flags = 344195072, burst_buffer = 0x0, burst_buffer_state = 0x0, 
  check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 6, cpus_per_tres = 0x0, cr_enabled = 0, db_flags = 4, db_index = 80965110, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x19ff150, direct_set_prio = 0, end_time = 1594292425, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gres_list = 0x0, gres_alloc = 0x1806d60 "", 
  gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x17e4270 "", gres_used = 0x0, group_id = 588305, job_id = 9093798, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 32772, 
  kill_on_node_fail = 1, last_sched_eval = 1594292425, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x19fecb0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, 
  name = 0x19ff4f0 "ersachunkUKB1.sh", network = 0x0, next_step_id = 0, nodes = 0x19ff3a0 "bnode0219", node_addr = 0x2890f20, node_bitmap = 0x2894800, node_bitmap_cg = 0x1850a70, node_cnt = 0, node_cnt_wag = 0, 
  nodes_completing = 0x183d5e0 "", origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x19ff3d0 "blanca-ccn,blanca,login,blanca-curc,blanca-nso,blanca-ibg,blanca-ics,blanca-igg,blanca-mrg,blanca-el,blanca-sha,blanca-ceae,blanca-dhl,blanca-pccs,blanca-csdms,blanca-sol,blanca-rittger,blanca-mktg,bl"..., part_ptr_list = 0x2855eb0, part_nodes_missing = false, part_ptr = 0x18975f0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 9063, priority_array = 0x0, 
  prio_factors = 0x19fec00, profile = 0, qos_id = 63, qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, 
  resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x19ff5e0, site_factor = 2147483648, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8704, start_time = 0, 
  state_desc = 0x19ff320 "Job's QOS not permitted to use this partition (blanca-bortz allows blanca-bortz not preemptable)", state_reason = 36, state_reason_prev = 0, state_reason_prev_db = 36, step_list = 0x185e4d0, 
  suspend_time = 0, system_comment = 0x0, time_last_active = 1595893037, time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 6, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, 
  tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x1a04f30, tres_req_str = 0x1a00d20 "1=6,2=20480,4=1,5=6", tres_fmt_req_str = 0x1a00d50 "cpu=6,mem=20G,node=1,billing=6", 
  tres_alloc_cnt = 0x1a00c20, tres_alloc_str = 0x1a00cd0 "1=6,2=20480,3=18446744073709551614,4=1,5=6", tres_fmt_alloc_str = 0x1a00c90 "cpu=6,mem=20G,node=1,billing=6", user_id = 588305, user_name = 0x19ff520 "emsa1620", 
  wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb)  p *job_ptr->job_resrcs->node_bitmap
Cannot access memory at address 0x58
Comment 11 Broderick Gardner 2020-07-27 19:35:58 MDT
Created attachment 15193 [details]
v1 19.05.5 skip segfault patch

Please apply this patch to Slurm and rerun the slurmctld. This should skip over the bad state causing the segfault. Please make sure to save a core dump in case we need more information from it.

Let me know when you have restarted with the patch, as well as the status afterwards.

Thanks
Comment 12 john.blaas 2020-07-27 20:03:25 MDT
Slurm controller started up again with the patch applied. Things seem to be normal now after the restart.
Comment 13 Jason Booth 2020-07-27 20:04:38 MDT
Downgrading to severity 3 now that you are up and running.
Comment 14 Jonathon Anderson 2020-07-29 12:12:12 MDT
Created attachment 15230 [details]
core file (1)
Comment 15 Jonathon Anderson 2020-07-29 12:12:36 MDT
Created attachment 15231 [details]
core file (2)
Comment 16 Jonathon Anderson 2020-07-29 12:13:04 MDT
Created attachment 15232 [details]
core file (3)
Comment 17 Jonathon Anderson 2020-07-29 12:13:40 MDT
I attached three core dumps to the case. I expect they are the same as each other.
Comment 18 Broderick Gardner 2020-07-29 12:23:14 MDT
Can you describe exactly what was done prior to the first crash of the slurmctld? As far as you are able, I would like to know what any lines were changed to in the slurm.conf when adding the node. 

There have been a few instances of crashes like this over the years, and we have never been able to reproduce it. So any information you can give about the circumstances around the crash might be helpful.

Thanks
Comment 19 john.blaas 2020-07-29 12:30:48 MDT
Initially we had just wanted to add a single node which is currently commented out and partial in the slurm.conf that was attached, as well as adding the node to the dummy switch in topology.conf.

The lines that are missing which we removed after experiencing the crashes to try to revert changes are the following:

NodeName=bgpu-papp1 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=191000 Feature=skylake,rhel7,Tesla,V100,avx2,avx512
PartitionName=blanca-papp Nodes=bgpu-papp1 MaxTime=7-0 DefMemPerCPU=5800 Shared=no AllowQOS=blanca-papp
Comment 20 Jonathon Anderson 2020-07-29 13:03:33 MDT
Potentially worth knowing is that that final line was initially added to slurm.conf without a trailing newline.
Comment 24 Dominik Bartkiewicz 2020-09-02 09:16:56 MDT
Hi

Are you still using 19.05.5 with the patch from comment 11?
If yes, have you noticed any error like "job resources for step %pS is NULL" in slurmctld log?

Unfortunately, I can't recreate this issue.
We have observed from time to time similar issue on 19.05 (eg.: bug 7641) but we have never tracked the root cause of it.

Dominik
Comment 25 Jonathon Anderson 2020-09-02 16:09:14 MDT
I believe that we are still using 19.05.5 with the patch from comment 11. We are planning to upgrade to a slurm 20.x.y (John can specify) soon.

Our logs show these:

Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Extern is NULL
Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Batch is NULL
Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Extern is NULL
Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Batch is NULL

These appear to have been after our initial problem July 27.
Comment 26 john.blaas 2020-09-02 16:39:24 MDT
Dominik,

We are still using the patched version but are planning to upgrade to 20.02 next Wednesday during our planned maintenance period.

Thanks,
John
Comment 27 Dominik Bartkiewicz 2020-09-03 02:46:13 MDT
Hi

Could you send me slurmctld log covering the whole life of job 9623092?

Dominik
Comment 28 john.blaas 2020-09-03 08:55:52 MDT
Created attachment 15725 [details]
Job 9623092 slurmctld log
Comment 29 john.blaas 2020-09-03 08:58:55 MDT
Dominik,

I have added the slurmctld log covering the lifetime of the job 9623092.

Thanks,
John
Comment 31 Dominik Bartkiewicz 2020-10-05 06:20:16 MDT
Hi

I don't see any code path related to adding nodes that can lead to nulling job_resrcs .
It looks like a coincidence.

In logs, I have found multiple errors that occurred during writing state files to disk.
Aug  5 05:00:21 slurm3 slurmctld[733]: error: Could not open job state file /curc/slurm/blanca/state/job_state: Permission denied
Aug  5 05:00:21 slurm3 slurmctld[733]: error: NOTE: Trying backup state save file. Jobs may be lost!
Aug  5 05:00:21 slurm3 slurmctld[733]: error: Can't save state, create file /curc/slurm/blanca/state/job_state.new error Permission denied

I think this is the most probable reason for corrupt (missing job_resrcs) of job 9623092

Do you think we can close this bug?
If a similar problem occurs again on 20.02 you can reopen this bug or create a new one.

Dominik
Comment 32 Jonathon Anderson 2020-10-05 11:20:14 MDT
I agree that adding nodes is likely a coincidence. That justled to us restarting slurmctld.

Regardless of why slurm got into this state, it seems that it should be able to handle it on bring-up, rather than segfault. Is that something that has been / could be fixed?
Comment 33 Dominik Bartkiewicz 2021-03-10 07:39:59 MST
Hi

Sorry for the delay.
I made few attempts, but I am still not able to reproduce this issue.
It looks that this issue still exists in 20.11 -- bug 10980.
I hope that with new data from 10980, we will be able to track and fix this.

Dominik
Comment 35 Dominik Bartkiewicz 2021-03-26 11:12:51 MDT
Hi

Finally, we can reproduce this issue.
The bug you have hit is already fixed (slurm 20.02 and above) by the commit
https://github.com/SchedMD/slurm/commit/2cb65cb2b2e
We still have one code path which can lead to similar issue.
It is tracked in bug 10980.
Let me know if we can close this ticket.

Dominik
Comment 36 Jonathon Anderson 2021-03-29 10:22:05 MDT
Thanks for letting us know.
Comment 37 Dominik Bartkiewicz 2021-03-30 04:25:42 MDT
Hi

I'll go ahead and close this out. Feel free to reopen if needed.

Dominik