Bug 4333 - srun: fatal: step_launch.c:1036 step_launch_state_destroy
Summary: srun: fatal: step_launch.c:1036 step_launch_state_destroy
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 17.11.x
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks: 4596
  Show dependency treegraph
 
Reported: 2017-11-05 09:36 MST by Dorian Krause
Modified: 2018-01-08 14:18 MST (History)
0 users

See Also:
Site: Jülich
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Possible fix (1.44 KB, patch)
2017-11-09 14:54 MST, Moe Jette
Details | Diff
Debugging patch (2.34 KB, patch)
2017-11-09 14:58 MST, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dorian Krause 2017-11-05 09:36:51 MST
Testing job launches with 17.11-rc2 gives from time to time:

{{{
# salloc -N 1570 -p batch : -N 66 -p gpus --gres=gpu:4 : -N 1630 -p booster /bin/bash -c "time srun --mpi=none -n 1570 hostname : --mpi=none -n 66 hostname : --mpi=none -n 1630 hostname | wc -l"
salloc: Pending job allocation 22
salloc: job 22 queued and waiting for resources
salloc: job 22 has been allocated resources
salloc: Granted job allocation 22
salloc: Waiting for resource configuration
salloc: Nodes jrc[0056-0090,0092-0228,0230-0245,0247-0385,0387-0455,0491-0496,0500-0766,0768-0793,0795-0940,1138-1253,1255-1382,1395-1404,1406-1500,1502-1548,1550-1566,1568-1839,1841-1884] are ready for job
srun: fatal: step_launch.c:1036 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy
3266

real    0m7.267s
user    0m3.533s
sys     0m2.227s
salloc: Relinquishing job allocation 22
}}}
Comment 2 Isaac Hartung 2017-11-06 13:28:37 MST
Hi Dorian,

We are looking into this and will keep you updated.

--Isaac
Comment 4 Moe Jette 2017-11-09 14:54:58 MST
Created attachment 5539 [details]
Possible fix

I have not been able to reproduce the failure that you have reported, but I did find a bug in which the use of a global variable can result in memory corruption for heterogeneous job steps due to a race condition (multiple job steps re-using the variable). This patch does fix that problem and I hope that it also fixes the problem that you are reporting.
The attached patch has already been committed to Slurm here:
https://github.com/SchedMD/slurm/commit/a23c1032e9d3fe58319d09ab8a596e0efe415417

I have also found and fixed two problems related to the use of reserved ports. This is only used by old versions of Open MPI, so these changes should not effect you. Those changes are in these commits:
https://github.com/SchedMD/slurm/commit/4dcd139daf7dd20f614470e0a9039378fcc64bbd
https://github.com/SchedMD/slurm/commit/d64a5f675dcb46e2c954bb01ceefd44d3c0cef6b
Comment 5 Moe Jette 2017-11-09 14:58:34 MST
Created attachment 5540 [details]
Debugging patch

If the previously attached patch does not fix the problem. This patch will result in srun generating additional output which would be helpful for me to determine what is happening. This will result in additional logging, so you will probably not want to use this except in a limited test environment. In addition to this patch. I would recommend configuring in slurm.conf
DebugFlags=step
SlurmctldDebugLevel=debug
SlurmdDebugLevel=debug

Execute srun with the "-vvv" option. Then save the log files from srun, slurmctld and slurmd for the time period of the failure.
Comment 6 Moe Jette 2017-12-08 09:34:46 MST
There has been no response to this bug in the past month. Please re-open the bug as needed.