Ticket 4333

Summary:	srun: fatal: step_launch.c:1036 step_launch_state_destroy
Product:	Slurm	Reporter:	Dorian Krause <d.krause>
Component:	User Commands	Assignee:	Moe Jette <jette>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.x
Hardware:	Linux
OS:	Linux
Site:	Jülich	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Ticket Depends on:
Ticket Blocks:	4596
Attachments:	Possible fix Debugging patch

Description Dorian Krause 2017-11-05 09:36:51 MST

Testing job launches with 17.11-rc2 gives from time to time:

{{{
# salloc -N 1570 -p batch : -N 66 -p gpus --gres=gpu:4 : -N 1630 -p booster /bin/bash -c "time srun --mpi=none -n 1570 hostname : --mpi=none -n 66 hostname : --mpi=none -n 1630 hostname | wc -l"
salloc: Pending job allocation 22
salloc: job 22 queued and waiting for resources
salloc: job 22 has been allocated resources
salloc: Granted job allocation 22
salloc: Waiting for resource configuration
salloc: Nodes jrc[0056-0090,0092-0228,0230-0245,0247-0385,0387-0455,0491-0496,0500-0766,0768-0793,0795-0940,1138-1253,1255-1382,1395-1404,1406-1500,1502-1548,1550-1566,1568-1839,1841-1884] are ready for job
srun: fatal: step_launch.c:1036 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy
3266

real    0m7.267s
user    0m3.533s
sys     0m2.227s
salloc: Relinquishing job allocation 22
}}}

Comment 2 Isaac Hartung 2017-11-06 13:28:37 MST

Hi Dorian,

We are looking into this and will keep you updated.

--Isaac

Comment 4 Moe Jette 2017-11-09 14:54:58 MST

Created attachment 5539 [details]
Possible fix

I have not been able to reproduce the failure that you have reported, but I did find a bug in which the use of a global variable can result in memory corruption for heterogeneous job steps due to a race condition (multiple job steps re-using the variable). This patch does fix that problem and I hope that it also fixes the problem that you are reporting.
The attached patch has already been committed to Slurm here:
https://github.com/SchedMD/slurm/commit/a23c1032e9d3fe58319d09ab8a596e0efe415417

I have also found and fixed two problems related to the use of reserved ports. This is only used by old versions of Open MPI, so these changes should not effect you. Those changes are in these commits:
https://github.com/SchedMD/slurm/commit/4dcd139daf7dd20f614470e0a9039378fcc64bbd
https://github.com/SchedMD/slurm/commit/d64a5f675dcb46e2c954bb01ceefd44d3c0cef6b

Comment 5 Moe Jette 2017-11-09 14:58:34 MST

Created attachment 5540 [details]
Debugging patch

If the previously attached patch does not fix the problem. This patch will result in srun generating additional output which would be helpful for me to determine what is happening. This will result in additional logging, so you will probably not want to use this except in a limited test environment. In addition to this patch. I would recommend configuring in slurm.conf
DebugFlags=step
SlurmctldDebugLevel=debug
SlurmdDebugLevel=debug

Execute srun with the "-vvv" option. Then save the log files from srun, slurmctld and slurmd for the time period of the failure.

Comment 6 Moe Jette 2017-12-08 09:34:46 MST

There has been no response to this bug in the past month. Please re-open the bug as needed.