7863 – srun fails to initiate step in remote allocation

Ticket 7863 - srun fails to initiate step in remote allocation

Summary: srun fails to initiate step in remote allocation

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-04 05:44 MDT by Pär Lindfors
Modified:	2020-01-29 00:34 MST (History)
CC List:	3 users (show)

See Also:
Site:	SNIC
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	UPPMAX
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
add -M support for existing allocation. (v1) (1.32 KB, patch) 2019-10-15 07:16 MDT, Marcin Stolarek	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Pär Lindfors 2019-10-04 05:44:01 MDT

Srun fails to launch a job step in an existing allocation in a remote cluster.

Working example, on the local cluster:

Create allocation:

  $ salloc -A staff -p devel -N1 -t 1:0:0 --no-shell -J local
  salloc: Pending job allocation 10221359
  salloc: job 10221359 queued and waiting for resources
  salloc: job 10221359 has been allocated resources
  salloc: Granted job allocation 10221359

Check status:

  $ squeue -j 10221359
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            10221359     devel    local    paran  R       0:56      1 r483

Run a job step:

  $ srun --jobid=10221359 /usr/bin/hostname -s
  r483


The same example fails when using a remote cluster:


Create allocation:

  $ salloc -A staff -p devel -N1 -t 1:0:0 --no-shell -J remote -M snowy
  salloc: Granted job allocation 721331

Check status:

  $ squeue -j 721331 -M snowy
  CLUSTER: snowy
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              721331     devel   remote    paran  R       0:31      1 s1
  [

Run a job step:

  $ srun --jobid=721331 -M snowy /usr/bin/hostname -s
  srun: error: Unable to confirm allocation for job 721331: Invalid job id specified
  srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 721331

I can work-around the error by manually setting
SLURM_WORKING_CLUSTER. This works even with '-M snowy' removed
from the command line:

  $ SLURM_WORKING_CLUSTER=snowy:snowy-lr1:6817:8192 srun --jobid=721331 -M snowy /usr/bin/hostname -s
  s1
  $ SLURM_WORKING_CLUSTER=snowy:snowy-lr1:6817:8192 srun --jobid=721331 /usr/bin/hostname -s
  s1


The affected users have gotten this work-around, but I do not really like it as they now hard code the server name and port ranges in their commands.


This issue is not new. We discovered it when running 17.11, but have recently upgraded and verified that the problem remains on 19.05. All output above is from 19.05.

Comment 4 Marcin Stolarek 2019-10-07 08:37:07 MDT

Pär,

I just wanted to let you know that we're working on the ticket. I think I'll be able to share a patch with you within a week.

cheers,
Marcin

Comment 9 Marcin Stolarek 2019-10-15 07:16:10 MDT

Created attachment 11954 [details]
add -M support for existing allocation. (v1)

Pär,

The attached patch adds support for srun -M execution creating a step in existing allocation on a remote cluster without SLURM_WORKING_CLUSTER variable set.

It's currently under SchedMD QA process, which means that it passed a set of automatic tests, but it wasn't yet reviewed. It may not be a final version and is not yet scheduled for the release. 

If you're able to apply it and verify locally, we'd appreciate any feedback.

cheers,
Marcin

Comment 11 Pär Lindfors 2019-11-08 04:22:19 MST

I have done very limited testing using the provided patch. Small test like running "hostname" or "echo 'hello world'" now works fine, both in local and remote allocations.

I have not had time to test anything more advanced, like launching an MPI binary or similar.

Comment 12 Pär Lindfors 2020-01-27 07:28:29 MST

(In reply to Marcin Stolarek from comment #9)
> It's currently under SchedMD QA process, which means that it passed a set of
> automatic tests, but it wasn't yet reviewed. It may not be a final version
> and is not yet scheduled for the release. 

Any progress on this?

Comment 13 Marcin Stolarek 2020-01-27 07:39:58 MST

Pär,

I'll ask the reviewer if we can prioritize this ticket.

cheers,
Marcin

Comment 14 Pär Lindfors 2020-01-28 04:37:48 MST

I would primarily like to know if this will be a fix for 19.05.x, 20.02, later or not fixed at all.

If the official fix will be significantly delayed, we might decide to include the patch in our local builds.

Comment 16 Marcin Stolarek 2020-01-29 00:34:53 MST

Pär, 

I have a piece of good news - the patch is now merged 095653d24b510[1] and will be released in Slurm 20.02.

I'm going to close this bug report as fix now. If you'll have any questions feel free to reopen.

cheers,
Marcin


[1] https://github.com/SchedMD/slurm/commit/095653d24b510aa38e57753ec071e98d9dbe463b