Testing job launches with 17.11-rc2 gives from time to time: {{{ # salloc -N 1570 -p batch : -N 66 -p gpus --gres=gpu:4 : -N 1630 -p booster /bin/bash -c "time srun --mpi=none -n 1570 hostname : --mpi=none -n 66 hostname : --mpi=none -n 1630 hostname | wc -l" salloc: Pending job allocation 22 salloc: job 22 queued and waiting for resources salloc: job 22 has been allocated resources salloc: Granted job allocation 22 salloc: Waiting for resource configuration salloc: Nodes jrc[0056-0090,0092-0228,0230-0245,0247-0385,0387-0455,0491-0496,0500-0766,0768-0793,0795-0940,1138-1253,1255-1382,1395-1404,1406-1500,1502-1548,1550-1566,1568-1839,1841-1884] are ready for job srun: fatal: step_launch.c:1036 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy 3266 real 0m7.267s user 0m3.533s sys 0m2.227s salloc: Relinquishing job allocation 22 }}}
Hi Dorian, We are looking into this and will keep you updated. --Isaac
Created attachment 5539 [details] Possible fix I have not been able to reproduce the failure that you have reported, but I did find a bug in which the use of a global variable can result in memory corruption for heterogeneous job steps due to a race condition (multiple job steps re-using the variable). This patch does fix that problem and I hope that it also fixes the problem that you are reporting. The attached patch has already been committed to Slurm here: https://github.com/SchedMD/slurm/commit/a23c1032e9d3fe58319d09ab8a596e0efe415417 I have also found and fixed two problems related to the use of reserved ports. This is only used by old versions of Open MPI, so these changes should not effect you. Those changes are in these commits: https://github.com/SchedMD/slurm/commit/4dcd139daf7dd20f614470e0a9039378fcc64bbd https://github.com/SchedMD/slurm/commit/d64a5f675dcb46e2c954bb01ceefd44d3c0cef6b
Created attachment 5540 [details] Debugging patch If the previously attached patch does not fix the problem. This patch will result in srun generating additional output which would be helpful for me to determine what is happening. This will result in additional logging, so you will probably not want to use this except in a limited test environment. In addition to this patch. I would recommend configuring in slurm.conf DebugFlags=step SlurmctldDebugLevel=debug SlurmdDebugLevel=debug Execute srun with the "-vvv" option. Then save the log files from srun, slurmctld and slurmd for the time period of the failure.
There has been no response to this bug in the past month. Please re-open the bug as needed.