Ticket 5320 - Job stuck completing
Summary: Job stuck completing
Status: RESOLVED DUPLICATE of ticket 5103
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.02.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-15 03:31 MDT by GSK-ONYX-SLURM
Modified: 2018-07-02 04:32 MDT (History)
1 user (show)

See Also:
Site: GSK
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: ?
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description GSK-ONYX-SLURM 2018-06-15 03:31:38 MDT

    
Comment 1 GSK-ONYX-SLURM 2018-06-15 04:01:02 MDT
We have a node stuck in a completing state.  There is a job that is completing and has been like this now since yesterday.  I've tried to scancel it and couple of time but nothing happened.

[root@uk1salx00552 slurm]# /usr/local/slurm/bin/sinfo -Nl --node=uk1salx00552
Fri Jun 15 10:25:03 2018
NODELIST      NODES       PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
uk1salx00552      1         uk_hpc*  completing   48   48:1:1 257526     2036      1   (null) none
uk1salx00552      1 uk_columbus_tst  completing   48   48:1:1 257526     2036      1   (null) none
[root@uk1salx00552 slurm]#
[root@uk1salx00552 slurm]# /usr/local/slurm/bin/sinfo --version
slurm 17.02.7

The slurmd log just kept repeating the same messages relating to 274943 over and over.

[root@uk1salx00552 slurm]# tail slurmd-uk1salx00552.log
[2018-06-15T10:37:38.656] debug:  _rpc_terminate_job, uid = 63124
[2018-06-15T10:37:38.656] debug:  task_p_slurmd_release_resources: affinity jobid 274943
[2018-06-15T10:38:38.710] debug2: got this type of message 6011
[2018-06-15T10:38:38.710] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2018-06-15T10:38:38.710] debug:  _rpc_terminate_job, uid = 63124
[2018-06-15T10:38:38.710] debug:  task_p_slurmd_release_resources: affinity jobid 274943
[2018-06-15T10:39:38.768] debug2: got this type of message 6011
[2018-06-15T10:39:38.768] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2018-06-15T10:39:38.768] debug:  _rpc_terminate_job, uid = 63124
[2018-06-15T10:39:38.768] debug:  task_p_slurmd_release_resources: affinity jobid 274943

The ctld log only had messages related to this job from the previous day

[root@uk1sxlx00087 slurm]# grep 274943 slurmctld.log
[2018-06-14T14:56:21.977] sched: Allocate JobID=270430_274(274943) NodeList=uk1salx00552 #CPUs=1 Partition=uk_columbus_tst
[2018-06-14T17:26:45.037] Time limit exhausted for JobId=274943
[2018-06-14T17:27:36.077] Resending TERMINATE_JOB request JobId=274943 Nodelist=uk1salx00552


So I then killed the job
[root@uk1salx00552 slurm]# ps -ef | grep 274943
root     12473     1  0 Jun14 ?        00:00:00 slurmstepd: [274943]
root     17410 30988  0 10:42 pts/6    00:00:00 grep 274943
[root@uk1salx00552 slurm]#
[root@uk1salx00552 slurm]#
[root@uk1salx00552 slurm]# kill -9 12473
[root@uk1salx00552 slurm]# ps -ef | grep 274943
root     19072 30988  0 10:46 pts/6    00:00:00 grep 274943
[root@uk1salx00552 slurm]#

which seemed to free things up but I got an awful lot of messages in the  ctld and slurmd log files, eg

[2018-06-15T10:46:37.750] error: Orphan job 275396.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275387.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275316.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275326.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275315.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275388.batch reported on node uk1salx00552
[2018-06-15T10:46:37.750] error: Orphan job 275398.batch reported on node uk1salx00552


Did I do a bad thing using "kill -9" on the slurmstepd process?

Was there a better way to handle this situation?

What do you need to identify why this situation occurred?  I've not seen this before.

Thanks.
Mark.
Comment 2 Alejandro Sanchez 2018-06-15 06:16:35 MDT
Hey Mark -

Usually the cause for a node stuck in a completing state is either:

a) Epilog script doing weird stuff and/or running indefinitely
b) slurmstepd not exiting, which in turn could be triggered by a slurmstepd deadlock for instance.

One of the causes of a very subtle deadlock was addressed in this bug:

https://bugs.schedmd.com/show_bug.cgi?id=5103

which was fixed starting from 17.11.6. If you didn't kill the stepd we could had gdb attached to it to see if the backtrace corresponded to those reported in the mentioned bug.

The logged message:

[2018-06-15T10:46:37.750] error: Orphan job 275396.batch reported on node uk1salx00552 

is reported in slurmctld when a node registers back to slurmctld, then it validates the jobs that slurmd reports are running on the node are really supposed to be there. If they aren't then this error is logged and the job is aborted on the node.

What I'd do now is:

1. See if you have any kind of epilog script configured and double-check it will always finish execution.

2. DRAIN uk1salx00552 and once there are no more jobs in it reboot it.

3. When possible, we always encourage to stick to the latest stable release. Many fixes have been made since 17.02.7, including stepd deadlock ones.

I'm gonna go ahead and close the bug. If you encounter this again after the reboot, please re-open and we will investigate the stuck stepd.

Thanks.

- Alex
Comment 3 GSK-ONYX-SLURM 2018-06-28 10:46:33 MDT
I'd like to re-open this bug please.  I have multiple slurmstepd processes that we know have finished running and done their work, but are not exiting.  Eventually they go to a completing state.  We've also found the problem is worse when we give each process two cores rather than one.

Please can you provide the instructions to get the backtrace info you need.

Thanks.
Mark.
Comment 4 Alejandro Sanchez 2018-07-02 03:30:32 MDT
Can you please gdb attach to one of those stuck stepd process and attach the output of 'thread apply all bt'? Thanks.
Comment 5 GSK-ONYX-SLURM 2018-07-02 04:23:22 MDT
Hi Alex.

Here's output from one process, and which is representative of all but one of the CG slurmstpd's....

[root@uk1salx00552 slurm]# gdb attach 11805
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 11805
Reading symbols from /usr/local/slurm/sbin/slurmstepd...done.
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 18769]
[New LWP 17527]
[New LWP 11808]
[New LWP 11807]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_compat.so.2
Reading symbols from /usr/lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnsl.so.1
Reading symbols from /usr/lib64/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_nis.so.2
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/select_cons_res.so
Reading symbols from /usr/local/slurm/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/switch_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/gres_gpu.so
Reading symbols from /usr/local/slurm/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/core_spec_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/core_spec_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_affinity.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_affinity.so
Reading symbols from /usr/local/slurm/lib64/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/proctrack_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/crypto_munge.so
Reading symbols from /usr/local/slurm/lib64/slurm/job_container_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/job_container_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/mpi_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/mpi_none.so
0x00002b1ad06e8f57 in pthread_join () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-17.02.7-1.el7.x86_64
(gdb) thread apply all bt

Thread 5 (Thread 0x2b1ad3591700 (LWP 11807)):
#0  0x00002b1ad0a01eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b1ad0a6360d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00002b1ad0a635bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00002b1ad0a63662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00002b1ad06e7e38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b1ad09f434d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x2b1ad3692700 (LWP 11808)):
#0  0x00002b1ad09e9a3d in poll () from /usr/lib64/libc.so.6
#1  0x0000000000453a7a in _poll_internal (pfds=0x2b1ad80009f0, nfds=2, shutdown_time=0) at eio.c:362
#2  0x0000000000453847 in eio_handle_mainloop (eio=0x1397fd0) at eio.c:326
#3  0x000000000043d2c7 in _msg_thr_internal (job_arg=0x138e820) at req.c:242
#4  0x00002b1ad06e7e25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b1ad09f434d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x2b1acf97c700 (LWP 17527)):
#0  0x00002b1ad0a01eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b1ad097e8c8 in _L_lock_2251 () from /usr/lib64/libc.so.6
#2  0x00002b1ad09770ef in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b1ad097c0fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
---Type <return> to continue, or q <return> to quit---
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b1ad06e7e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b1ad09f434d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x2b1ad2d84700 (LWP 18769)):
#0  0x00002b1ad0a01eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b1ad097e8c8 in _L_lock_2251 () from /usr/lib64/libc.so.6
#2  0x00002b1ad09770ef in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b1ad097c0fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b1ad06e7e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b1ad09f434d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x2b1acf8795c0 (LWP 11805)):
#0  0x00002b1ad06e8f57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x000000000054dc32 in acct_gather_profile_fini () at slurm_acct_gather_profile.c:267
#2  0x000000000042f745 in job_manager (job=0x138e820) at mgr.c:1309
#3  0x000000000042a6a7 in main (argc=1, argv=0x7fff78f50808) at slurmstepd.c:183
(gdb)

But... there is one process that had a thread count of

[root@uk1salx00552 slurm]# gdb attach 28568
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 28568
Reading symbols from /usr/local/slurm/sbin/slurmstepd...done.
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 29527]
[New LWP 37659]
[New LWP 14855]
[New LWP 39666]
[New LWP 16219]
[New LWP 41466]
[New LWP 18536]
[New LWP 43760]
[New LWP 20707]
[New LWP 46302]
[New LWP 23056]
[New LWP 48477]
[New LWP 18219]
[New LWP 43465]
[New LWP 20486]
[New LWP 45874]
[New LWP 22592]
[New LWP 48026]
[New LWP 24824]
[New LWP 1115]
[New LWP 26884]
[New LWP 2602]
[New LWP 28249]
[New LWP 5322]
[New LWP 30510]
[New LWP 7465]
[New LWP 32962]
[New LWP 9787]
[New LWP 35043]
[New LWP 12189]
[New LWP 35575]
[New LWP 12477]
[New LWP 37931]
[New LWP 14802]
[New LWP 40066]
[New LWP 17182]
[New LWP 42513]
[New LWP 19435]
[New LWP 45093]
[New LWP 22072]
[New LWP 47436]
[New LWP 24158]
[New LWP 937]
[New LWP 26639]
[New LWP 3511]
[New LWP 29085]
[New LWP 6026]
[New LWP 31334]
[New LWP 8464]
[New LWP 33944]
[New LWP 10694]
[New LWP 36185]
[New LWP 13296]
[New LWP 38683]
[New LWP 15565]
[New LWP 28924]
[New LWP 5919]
[New LWP 31205]
[New LWP 8204]
[New LWP 33522]
[New LWP 10455]
[New LWP 35920]
[New LWP 12758]
[New LWP 37892]
[New LWP 14037]
[New LWP 39609]
[New LWP 16505]
[New LWP 41813]
[New LWP 18926]
[New LWP 44405]
[New LWP 21153]
[New LWP 46825]
[New LWP 23839]
[New LWP 43973]
[New LWP 21205]
[New LWP 46829]
[New LWP 23391]
[New LWP 48962]
[New LWP 26088]
[New LWP 2898]
[New LWP 28462]
[New LWP 5614]
[New LWP 30939]
[New LWP 7898]
[New LWP 33400]
[New LWP 10555]
[New LWP 35690]
[New LWP 12841]
[New LWP 38274]
[New LWP 15091]
[New LWP 40572]
[New LWP 17631]
[New LWP 43144]
[New LWP 20009]
[New LWP 45647]
[New LWP 22650]
[New LWP 47944]
[New LWP 24967]
[New LWP 43355]
[New LWP 20098]
[New LWP 45716]
[New LWP 22694]
[New LWP 47993]
[New LWP 25017]
[New LWP 1772]
[New LWP 27119]
[New LWP 3892]
[New LWP 28655]
[New LWP 5796]
[New LWP 30881]
[New LWP 8110]
[New LWP 33663]
[New LWP 10396]
[New LWP 35915]
[New LWP 13091]
[New LWP 38463]
[New LWP 10505]
[New LWP 35925]
[New LWP 12824]
[New LWP 38075]
[New LWP 13403]
[New LWP 14321]
[New LWP 38725]
[New LWP 15216]
[New LWP 40464]
[New LWP 17131]
[New LWP 42443]
[New LWP 19294]
[New LWP 44672]
[New LWP 20832]
[New LWP 45784]
[New LWP 22416]
[New LWP 47573]
[New LWP 23958]
[New LWP 827]
[New LWP 26532]
[New LWP 3688]
[New LWP 29469]
[New LWP 6554]
[New LWP 31975]
[New LWP 9222]
[New LWP 27539]
[New LWP 4603]
[New LWP 30138]
[New LWP 7076]
[New LWP 32490]
[New LWP 9846]
[New LWP 35281]
[New LWP 12082]
[New LWP 37452]
[New LWP 13736]
[New LWP 39094]
[New LWP 16541]
[New LWP 42731]
[New LWP 20481]
[New LWP 46324]
[New LWP 20686]
[New LWP 40907]
[New LWP 12789]
[New LWP 30815]
[New LWP 5531]
[New LWP 29800]
[New LWP 27733]
[New LWP 32786]
[New LWP 42913]
[New LWP 20158]
[New LWP 44773]
[New LWP 28571]
[New LWP 28570]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_compat.so.2
Reading symbols from /usr/lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnsl.so.1
Reading symbols from /usr/lib64/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_nis.so.2
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/select_cons_res.so
Reading symbols from /usr/local/slurm/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/local/slurm/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/switch_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/gres_gpu.so
Reading symbols from /usr/local/slurm/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/core_spec_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/core_spec_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/task_affinity.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/task_affinity.so
Reading symbols from /usr/local/slurm/lib64/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/proctrack_cgroup.so
Reading symbols from /usr/local/slurm/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/crypto_munge.so
Reading symbols from /usr/local/slurm/lib64/slurm/job_container_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/job_container_none.so
Reading symbols from /usr/local/slurm/lib64/slurm/mpi_none.so...done.
Loaded symbols for /usr/local/slurm/lib64/slurm/mpi_none.so
0x00002b26b14aef57 in pthread_join () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-17.02.7-1.el7.x86_64
(gdb) thread apply all bt

Thread 170 (Thread 0x2b26b4357700 (LWP 28570)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b182960d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00002b26b18295bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00002b26b1829662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00002b26b14ade38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 169 (Thread 0x2b26b4458700 (LWP 28571)):
#0  0x00002b26b17afa3d in poll () from /usr/lib64/libc.so.6
#1  0x0000000000453a7a in _poll_internal (pfds=0x2b26bc0009f0, nfds=2, shutdown_time=0) at eio.c:362
#2  0x0000000000453847 in eio_handle_mainloop (eio=0xdcdf90) at eio.c:326
#3  0x000000000043d2c7 in _msg_thr_internal (job_arg=0xdc4820) at req.c:242
#4  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 168 (Thread 0x2b26b0742700 (LWP 44773)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
---Type <return> to continue, or q <return> to quit---
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

.
.
.

Thread 7 (Thread 0x2b26cee6d700 (LWP 41466)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
---Type <return> to continue, or q <return> to quit---
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 6 (Thread 0x2b26cef6e700 (LWP 16219)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 5 (Thread 0x2b26cf06f700 (LWP 39666)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x2b26cf170700 (LWP 14855)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x2b26cf271700 (LWP 37659)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x2b26cf372700 (LWP 29527)):
#0  0x00002b26b17c7eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002b26b17447d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b26b173cca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b26b17420fe in malloc () from /usr/lib64/libc.so.6
#4  0x0000000000470f41 in slurm_xmalloc (size=24, clear=false, file=0x5d34a9 "pack.c", line=150, func=0x5d3536 <__func__.4517> "init_buf")
    at xmalloc.c:83
#5  0x0000000000487a08 in init_buf (size=16384) at pack.c:150
#6  0x000000000043dd4b in _handle_accept (arg=0x0) at req.c:421
#7  0x00002b26b14ade25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b26b17ba34d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x2b26b063f5c0 (LWP 28568)):
#0  0x00002b26b14aef57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x000000000054dc32 in acct_gather_profile_fini () at slurm_acct_gather_profile.c:267
---Type <return> to continue, or q <return> to quit---
#2  0x000000000042f745 in job_manager (job=0xdc4820) at mgr.c:1309
#3  0x000000000042a6a7 in main (argc=1, argv=0x7ffc856a74b8) at slurmstepd.c:183
(gdb)
(gdb) q
A debugging session is active.

        Inferior 1 [process 28568] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/slurm/sbin/slurmstepd, process 28568
[root@uk1salx00552 slurm]#

Do you want the intervening thread info?

Thanks.
Mark.
Comment 6 Alejandro Sanchez 2018-07-02 04:32:07 MDT
Hi Mark,

Unfortunately, and as I advanced in my first comment, the backtrace undoubtedly shows you are hitting the stepd deadlock which was fixed in this bug:

https://bugs.schedmd.com/show_bug.cgi?id=5103

so we highly encourage to upgrade to the latest 17.11 stable release which incorporates a handful of fixes for this issue.

I'm gonna go ahead and mark this as a duplicate of 5103.

Thanks.

*** This ticket has been marked as a duplicate of ticket 5103 ***