5177 – Jobs in CG state block nodes

Ticket 5177 - Jobs in CG state block nodes

Summary: Jobs in CG state block nodes

Status:	RESOLVED DUPLICATE of ticket 5103

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.2
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-05-15 11:29 MDT by Alex
Modified:	2018-05-16 09:23 MDT (History)
CC List:	2 users (show)

See Also:
Site:	Columbia University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Alex 2018-05-15 11:29:56 MDT

SchedMD support,

We're working with Slurm 17.11.2, and we've been experiencing jobs suspended indefinitely in the CG state, with the corresponding nodes blocked from use by queued jobs, leading to negative consequences like the inability of node owners to access their own servers for job processing.

The way to release the nodes is to restart the slurmd service (via systemctl), but often we must also kill the (multiple) slurmstepd processes on the affected node, and occasionally even perform its reboot in order to have it returned to normal functioning on the cluster.

I've looked at tickets 1639 and 3637 without finding a simple resolution. The ticket 4733 is much more recent and seems to suggest that perhaps a patch might help. Please advise -

Alex

Comment 1 Marshall Garey 2018-05-15 12:22:47 MDT

Yes, 4733 has been resolved as a duplicate of bug 5103, which has been fixed and is in 17.11.6. You may upgrade or apply the referenced commit.

Can you upload a backtrace of all threads of a stuck slurmstepd so I can verify it's the same cause?

gdb attach <stepd pid>
thread apply all bt

Comment 2 Alex 2018-05-15 12:39:35 MDT

Is this sufficient?:

(gdb) thread apply all bt

Thread 2 (Thread 0x2aaaaf87b700 (LWP 6108)):
#0  0x00002aaaabdff0fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaabe5fb5d in _L_lock_183 () from /lib64/libc.so.6
#2  0x00002aaaabe5fb0d in arena_thread_freeres () from /lib64/libc.so.6
#3  0x00002aaaabe5fbb2 in __libc_thread_freeres () from /lib64/libc.so.6
#4  0x00002aaaabae5dd8 in start_thread () from /lib64/libpthread.so.0
#5  0x00002aaaabdf173d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2aaaaab0b500 (LWP 6099)):
#0  0x00002aaaabae6ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000040aa6d in stepd_cleanup (msg=msg@entry=0x63fbd0, job=job@entry=0x659630, 
    cli=cli@entry=0x654eb0, self=self@entry=0x0, rc=0, only_mem=only_mem@entry=false) at slurmstepd.c:184
#2  0x000000000040c133 in main (argc=1, argv=0x7fffffffedb8) at slurmstepd.c:169
(gdb)

Comment 3 Marshall Garey 2018-05-15 13:18:13 MDT

(In reply to Alex from comment #2)
> Is this sufficient?:
> 
> (gdb) thread apply all bt
> 
> Thread 2 (Thread 0x2aaaaf87b700 (LWP 6108)):
> #0  0x00002aaaabdff0fc in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x00002aaaabe5fb5d in _L_lock_183 () from /lib64/libc.so.6
> #2  0x00002aaaabe5fb0d in arena_thread_freeres () from /lib64/libc.so.6
> #3  0x00002aaaabe5fbb2 in __libc_thread_freeres () from /lib64/libc.so.6
> #4  0x00002aaaabae5dd8 in start_thread () from /lib64/libpthread.so.0
> #5  0x00002aaaabdf173d in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x2aaaaab0b500 (LWP 6099)):
> #0  0x00002aaaabae6ef7 in pthread_join () from /lib64/libpthread.so.0
> #1  0x000000000040aa6d in stepd_cleanup (msg=msg@entry=0x63fbd0,
> job=job@entry=0x659630, 
>     cli=cli@entry=0x654eb0, self=self@entry=0x0, rc=0,
> only_mem=only_mem@entry=false) at slurmstepd.c:184
> #2  0x000000000040c133 in main (argc=1, argv=0x7fffffffedb8) at
> slurmstepd.c:169
> (gdb)

Yes, that's the same bug. It's been fixed in 17.11.6 - see bug 5103. You can upgrade or apply the patches from there.

There's a related bug (it also causes deadlock) in the slurmstepd that isn't in 17.11.6, but the fix will be in 17.11.7. We hope to release .7 in just a few weeks. If you'd like that patch before .7 is released, please let us know. It would need to be applied on top of the patches in bug 5103.

Does that answer your question? If you want the patch for the related slurmstepd bug, I can upload it here for you.

Comment 4 Marshall Garey 2018-05-16 09:23:29 MDT

(In reply to Marshall Garey from comment #3)
> There's a related bug (it also causes deadlock) in the slurmstepd that isn't
> in 17.11.6, but the fix will be in 17.11.7.

I realized this sentence is a bit confusing. To clarify - the related bug is in 17.11.6. The fix for it will be in 17.11.7.

I'm closing as resolved/duplicate. Please let us know if you'd like the patch before .7 is released.

*** This ticket has been marked as a duplicate of ticket 5103 ***