We recently tried to add a node to one of our clusters. After adding that node we were unable to restart slurmctld. Even after removing the node back out of configuration we are still unable to successfully start slurmctld, and each time it segfaults when trying to start. Sample of a gdb output from one of the core files is below gdb $(which slurmctld) core-slurmctld.29235.slurm3.rc.int.colorado.edu.1595892178 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 29235] [New LWP 29236] [New LWP 29239] [New LWP 29238] [New LWP 29240] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld -vvv -D'. Program terminated with signal 11, Segmentation fault. #0 _step_dealloc_lps (step_ptr=0xea0e90) at step_mgr.c:2107 2107 step_mgr.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.5-1.el7.x86_64 (gdb) bt #0 _step_dealloc_lps (step_ptr=0xea0e90) at step_mgr.c:2107 #1 post_job_step (step_ptr=step_ptr@entry=0xea0e90) at step_mgr.c:4822 #2 0x00000000004b813b in _internal_step_complete (job_ptr=job_ptr@entry=0xea02b0, step_ptr=step_ptr@entry=0xea0e90) at step_mgr.c:302 #3 0x00000000004b81b9 in delete_step_records (job_ptr=job_ptr@entry=0xea02b0) at step_mgr.c:331 #4 0x000000000046d00c in cleanup_completing (job_ptr=job_ptr@entry=0xea02b0) at job_scheduler.c:4899 #5 0x000000000047a24c in deallocate_nodes (job_ptr=job_ptr@entry=0xea02b0, timeout=timeout@entry=false, suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:626 #6 0x00000000004a0362 in _sync_nodes_to_comp_job () at read_config.c:2625 #7 read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1425 #8 0x000000000042d729 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:662
Created attachment 15187 [details] slurm.conf
Created attachment 15188 [details] topology.conf
Created attachment 15189 [details] full bt from our last core file
Can you please take a tarball of your StateSaveLocation and upload it to this bug for analysis?
Please also attach your slurmctld.log.
Created attachment 15191 [details] state directory for the cluster in question state directory
Created attachment 15192 [details] slurmctld.log
State directory and slurmctld log added
Please goto frame 1 and type: p *job_ptr p *job_ptr->job_resrcs->node_bitmap It looks like the job_resrcs is null.
(gdb) frame 1 #1 post_job_step (step_ptr=step_ptr@entry=0x19ffa60) at step_mgr.c:4822 4822 in step_mgr.c (gdb) p *job_ptr $1 = {magic = 4038539564, account = 0x19ff580 "blanca-ibg", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x19ff550 "bmem-ibg1", alloc_resp_port = 0, alloc_sid = 44914, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 451, assoc_ptr = 0x179c210, batch_features = 0x0, batch_flag = 2, batch_host = 0x28949d0 "bnode0219", billable_tres = 6, bit_flags = 344195072, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 6, cpus_per_tres = 0x0, cr_enabled = 0, db_flags = 4, db_index = 80965110, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x19ff150, direct_set_prio = 0, end_time = 1594292425, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gres_list = 0x0, gres_alloc = 0x1806d60 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x17e4270 "", gres_used = 0x0, group_id = 588305, job_id = 9093798, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 32772, kill_on_node_fail = 1, last_sched_eval = 1594292425, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x19fecb0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x19ff4f0 "ersachunkUKB1.sh", network = 0x0, next_step_id = 0, nodes = 0x19ff3a0 "bnode0219", node_addr = 0x2890f20, node_bitmap = 0x2894800, node_bitmap_cg = 0x1850a70, node_cnt = 0, node_cnt_wag = 0, nodes_completing = 0x183d5e0 "", origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x19ff3d0 "blanca-ccn,blanca,login,blanca-curc,blanca-nso,blanca-ibg,blanca-ics,blanca-igg,blanca-mrg,blanca-el,blanca-sha,blanca-ceae,blanca-dhl,blanca-pccs,blanca-csdms,blanca-sol,blanca-rittger,blanca-mktg,bl"..., part_ptr_list = 0x2855eb0, part_nodes_missing = false, part_ptr = 0x18975f0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 9063, priority_array = 0x0, prio_factors = 0x19fec00, profile = 0, qos_id = 63, qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x19ff5e0, site_factor = 2147483648, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8704, start_time = 0, state_desc = 0x19ff320 "Job's QOS not permitted to use this partition (blanca-bortz allows blanca-bortz not preemptable)", state_reason = 36, state_reason_prev = 0, state_reason_prev_db = 36, step_list = 0x185e4d0, suspend_time = 0, system_comment = 0x0, time_last_active = 1595893037, time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 6, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x1a04f30, tres_req_str = 0x1a00d20 "1=6,2=20480,4=1,5=6", tres_fmt_req_str = 0x1a00d50 "cpu=6,mem=20G,node=1,billing=6", tres_alloc_cnt = 0x1a00c20, tres_alloc_str = 0x1a00cd0 "1=6,2=20480,3=18446744073709551614,4=1,5=6", tres_fmt_alloc_str = 0x1a00c90 "cpu=6,mem=20G,node=1,billing=6", user_id = 588305, user_name = 0x19ff520 "emsa1620", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) p *job_ptr->job_resrcs->node_bitmap Cannot access memory at address 0x58
Created attachment 15193 [details] v1 19.05.5 skip segfault patch Please apply this patch to Slurm and rerun the slurmctld. This should skip over the bad state causing the segfault. Please make sure to save a core dump in case we need more information from it. Let me know when you have restarted with the patch, as well as the status afterwards. Thanks
Slurm controller started up again with the patch applied. Things seem to be normal now after the restart.
Downgrading to severity 3 now that you are up and running.
Created attachment 15230 [details] core file (1)
Created attachment 15231 [details] core file (2)
Created attachment 15232 [details] core file (3)
I attached three core dumps to the case. I expect they are the same as each other.
Can you describe exactly what was done prior to the first crash of the slurmctld? As far as you are able, I would like to know what any lines were changed to in the slurm.conf when adding the node. There have been a few instances of crashes like this over the years, and we have never been able to reproduce it. So any information you can give about the circumstances around the crash might be helpful. Thanks
Initially we had just wanted to add a single node which is currently commented out and partial in the slurm.conf that was attached, as well as adding the node to the dummy switch in topology.conf. The lines that are missing which we removed after experiencing the crashes to try to revert changes are the following: NodeName=bgpu-papp1 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=191000 Feature=skylake,rhel7,Tesla,V100,avx2,avx512 PartitionName=blanca-papp Nodes=bgpu-papp1 MaxTime=7-0 DefMemPerCPU=5800 Shared=no AllowQOS=blanca-papp
Potentially worth knowing is that that final line was initially added to slurm.conf without a trailing newline.
Hi Are you still using 19.05.5 with the patch from comment 11? If yes, have you noticed any error like "job resources for step %pS is NULL" in slurmctld log? Unfortunately, I can't recreate this issue. We have observed from time to time similar issue on 19.05 (eg.: bug 7641) but we have never tracked the root cause of it. Dominik
I believe that we are still using 19.05.5 with the patch from comment 11. We are planning to upgrade to a slurm 20.x.y (John can specify) soon. Our logs show these: Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Extern is NULL Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Batch is NULL Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Extern is NULL Aug 13 13:40:17 slurm3 slurmctld[7155]: error: post_job_step: job resources for step JobId=9623092_230(9623092) StepId=Batch is NULL These appear to have been after our initial problem July 27.
Dominik, We are still using the patched version but are planning to upgrade to 20.02 next Wednesday during our planned maintenance period. Thanks, John
Hi Could you send me slurmctld log covering the whole life of job 9623092? Dominik
Created attachment 15725 [details] Job 9623092 slurmctld log
Dominik, I have added the slurmctld log covering the lifetime of the job 9623092. Thanks, John
Hi I don't see any code path related to adding nodes that can lead to nulling job_resrcs . It looks like a coincidence. In logs, I have found multiple errors that occurred during writing state files to disk. Aug 5 05:00:21 slurm3 slurmctld[733]: error: Could not open job state file /curc/slurm/blanca/state/job_state: Permission denied Aug 5 05:00:21 slurm3 slurmctld[733]: error: NOTE: Trying backup state save file. Jobs may be lost! Aug 5 05:00:21 slurm3 slurmctld[733]: error: Can't save state, create file /curc/slurm/blanca/state/job_state.new error Permission denied I think this is the most probable reason for corrupt (missing job_resrcs) of job 9623092 Do you think we can close this bug? If a similar problem occurs again on 20.02 you can reopen this bug or create a new one. Dominik
I agree that adding nodes is likely a coincidence. That justled to us restarting slurmctld. Regardless of why slurm got into this state, it seems that it should be able to handle it on bring-up, rather than segfault. Is that something that has been / could be fixed?
Hi Sorry for the delay. I made few attempts, but I am still not able to reproduce this issue. It looks that this issue still exists in 20.11 -- bug 10980. I hope that with new data from 10980, we will be able to track and fix this. Dominik
Hi Finally, we can reproduce this issue. The bug you have hit is already fixed (slurm 20.02 and above) by the commit https://github.com/SchedMD/slurm/commit/2cb65cb2b2e We still have one code path which can lead to similar issue. It is tracked in bug 10980. Let me know if we can close this ticket. Dominik
Thanks for letting us know.
Hi I'll go ahead and close this out. Feel free to reopen if needed. Dominik