5091 – Hanging slurmstepd

Ticket 5091 - Hanging slurmstepd

Summary: Hanging slurmstepd

Status:	RESOLVED DUPLICATE of ticket 4733

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-04-20 14:35 MDT by Josko Plazonic
Modified:	2018-04-20 16:18 MDT (History)
CC List:	0 users

See Also:
Site:	Princeton (PICSciE)
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Job info (17.11 KB, application/x-bzip) 2018-04-20 14:35 MDT, Josko Plazonic	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Josko Plazonic 2018-04-20 14:35:36 MDT

Created attachment 6672 [details]
Job info

It looks like we have another PMI2 related problem.  We have a slurmstepd that hung, apparently while trying to setup a PMI job. Executable in question is using intel-mpi which we configure with:

setenv		 I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so 

though he runs it with

srun --mpi=pmi2 /tigress/slflam/BlockIntel/parallel/block.spin_adapted dmrg.conf > dmrg.out 2>&1

Note that this failed on 107th step:

557782.106   block.spi+                  chem         28  COMPLETED      0:0 
557782.107   block.spi+                  chem         28  CANCELLED      0:0 

and node is now offlined with:

   Reason=Not responding [slurm@2018-04-19T22:53:36]

I attach tar file which contains ps output + job script + backtrace from all of the slurmstepd threads (they are all still there, the thing is thoroughly hung). I still have gdb attached so I can do more with this if it would help (though probably not before Monday).

Finally, there nothing else in logs - not on the node, not on slurmctld.

Comment 1 Tim Wickberg 2018-04-20 14:42:58 MDT

Lemme guess.. you're running RHEL 7 with glibc at 2.17-(some rhel version number)?

Comment 2 Josko Plazonic 2018-04-20 14:45:35 MDT

Yes sir:

rpm -q glibc
glibc-2.17-196.el7_4.2.x86_64
glibc-2.17-196.el7_4.2.i686

Comment 3 Tim Wickberg 2018-04-20 14:59:16 MDT

Good news: we know about this - which is why I knew to ask about glibc and the distro - and have a way to semi-reliably reproduce it locally. I'm going to close this issue as a duplicate of the earliest report - bug 4733 - which we've been updating as we continue to investigate it.

Bad news: we still don't have a fix. As best we can tell, this is either a very specific memory corruption bug in the slurmstepd code (unlikely at this point, and we're only seeing this issue on RHEL glibc 2.17 versions) where we step on some internal malloc() data structures, or - and what we currently believe and are trying to prove - is that glibc (or RedHat's extensive patches on top of the original 2.17) has some subtle race condition that the slurmstepd code can trigger leading to the deadlock in parts of the malloc() code. Usually this shows up as a hung external step, but we have seen user steps hit this as well, which is what you're reporting here.

I would usually ask you to look at upgrading, but as best we can tell this issue still shows up with even the latest RHEL 2.17-222.

Do you happen to have a RHEL support contract active on your cluster? It may be worth getting them in the loop at this point - we've gone through and still been able to reproduce this locally even with all of the "scary" (you'll see references to people blaming pthread_atfork() or signal handlers for doing unsafe things, but we can reproduce this with all our pthread_atfork handlers removed, and slurmstepd does not have any signal handlers registered) stuff we do within the slurmstepd disabled, and I'm now convinced the problem lies outside of our code.

- Tim

*** This ticket has been marked as a duplicate of ticket 4733 ***

Comment 4 Josko Plazonic 2018-04-20 15:09:59 MDT

Sadly no support.  We will update to 2.17-222 but that's part of RHEL7.5 so we will need at best a 3-4 weeks or at worst a couple of months (sort of a major update).

I'll take a look at the 4733 later.  If you haven't submitted this yet to RH you can - support or no support, their bugzilla is fully open (though having support helps a LOT in getting attention).

Thanks!

Comment 5 Tim Wickberg 2018-04-20 16:18:57 MDT

(In reply to Josko Plazonic from comment #4)
> Sadly no support.  We will update to 2.17-222 but that's part of RHEL7.5 so
> we will need at best a 3-4 weeks or at worst a couple of months (sort of a
> major update).

Just to reiterate - 2.17-222 does *not* seem to help, we can still reproduce with that version installed.

> I'll take a look at the 4733 later.  If you haven't submitted this yet to RH
> you can - support or no support, their bugzilla is fully open (though having
> support helps a LOT in getting attention).

I'm familiar with that approach. :)

I'm now trying to find a number of affected folks that do have RHEL support to see how best to get their feedback. We will probably be filling some part of this on their public bugzilla, although it's been hampered by the fact that asking them to "just run slurmstepd" as a reproducer is likely to get us ignored.