Summary: | srun: fatal: step_launch.c:1036 step_launch_state_destroy | ||
---|---|---|---|
Product: | Slurm | Reporter: | Dorian Krause <d.krause> |
Component: | User Commands | Assignee: | Moe Jette <jette> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 17.11.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Jülich | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | |||
Ticket Blocks: | 4596 | ||
Attachments: |
Possible fix
Debugging patch |
Description
Dorian Krause
2017-11-05 09:36:51 MST
Hi Dorian, We are looking into this and will keep you updated. --Isaac Created attachment 5539 [details] Possible fix I have not been able to reproduce the failure that you have reported, but I did find a bug in which the use of a global variable can result in memory corruption for heterogeneous job steps due to a race condition (multiple job steps re-using the variable). This patch does fix that problem and I hope that it also fixes the problem that you are reporting. The attached patch has already been committed to Slurm here: https://github.com/SchedMD/slurm/commit/a23c1032e9d3fe58319d09ab8a596e0efe415417 I have also found and fixed two problems related to the use of reserved ports. This is only used by old versions of Open MPI, so these changes should not effect you. Those changes are in these commits: https://github.com/SchedMD/slurm/commit/4dcd139daf7dd20f614470e0a9039378fcc64bbd https://github.com/SchedMD/slurm/commit/d64a5f675dcb46e2c954bb01ceefd44d3c0cef6b Created attachment 5540 [details]
Debugging patch
If the previously attached patch does not fix the problem. This patch will result in srun generating additional output which would be helpful for me to determine what is happening. This will result in additional logging, so you will probably not want to use this except in a limited test environment. In addition to this patch. I would recommend configuring in slurm.conf
DebugFlags=step
SlurmctldDebugLevel=debug
SlurmdDebugLevel=debug
Execute srun with the "-vvv" option. Then save the log files from srun, slurmctld and slurmd for the time period of the failure.
There has been no response to this bug in the past month. Please re-open the bug as needed. |