Srun fails to launch a job step in an existing allocation in a remote cluster. Working example, on the local cluster: Create allocation: $ salloc -A staff -p devel -N1 -t 1:0:0 --no-shell -J local salloc: Pending job allocation 10221359 salloc: job 10221359 queued and waiting for resources salloc: job 10221359 has been allocated resources salloc: Granted job allocation 10221359 Check status: $ squeue -j 10221359 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10221359 devel local paran R 0:56 1 r483 Run a job step: $ srun --jobid=10221359 /usr/bin/hostname -s r483 The same example fails when using a remote cluster: Create allocation: $ salloc -A staff -p devel -N1 -t 1:0:0 --no-shell -J remote -M snowy salloc: Granted job allocation 721331 Check status: $ squeue -j 721331 -M snowy CLUSTER: snowy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 721331 devel remote paran R 0:31 1 s1 [ Run a job step: $ srun --jobid=721331 -M snowy /usr/bin/hostname -s srun: error: Unable to confirm allocation for job 721331: Invalid job id specified srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 721331 I can work-around the error by manually setting SLURM_WORKING_CLUSTER. This works even with '-M snowy' removed from the command line: $ SLURM_WORKING_CLUSTER=snowy:snowy-lr1:6817:8192 srun --jobid=721331 -M snowy /usr/bin/hostname -s s1 $ SLURM_WORKING_CLUSTER=snowy:snowy-lr1:6817:8192 srun --jobid=721331 /usr/bin/hostname -s s1 The affected users have gotten this work-around, but I do not really like it as they now hard code the server name and port ranges in their commands. This issue is not new. We discovered it when running 17.11, but have recently upgraded and verified that the problem remains on 19.05. All output above is from 19.05.
Pär, I just wanted to let you know that we're working on the ticket. I think I'll be able to share a patch with you within a week. cheers, Marcin
Created attachment 11954 [details] add -M support for existing allocation. (v1) Pär, The attached patch adds support for srun -M execution creating a step in existing allocation on a remote cluster without SLURM_WORKING_CLUSTER variable set. It's currently under SchedMD QA process, which means that it passed a set of automatic tests, but it wasn't yet reviewed. It may not be a final version and is not yet scheduled for the release. If you're able to apply it and verify locally, we'd appreciate any feedback. cheers, Marcin
I have done very limited testing using the provided patch. Small test like running "hostname" or "echo 'hello world'" now works fine, both in local and remote allocations. I have not had time to test anything more advanced, like launching an MPI binary or similar.
(In reply to Marcin Stolarek from comment #9) > It's currently under SchedMD QA process, which means that it passed a set of > automatic tests, but it wasn't yet reviewed. It may not be a final version > and is not yet scheduled for the release. Any progress on this?
Pär, I'll ask the reviewer if we can prioritize this ticket. cheers, Marcin
I would primarily like to know if this will be a fix for 19.05.x, 20.02, later or not fixed at all. If the official fix will be significantly delayed, we might decide to include the patch in our local builds.
Pär, I have a piece of good news - the patch is now merged 095653d24b510[1] and will be released in Slurm 20.02. I'm going to close this bug report as fix now. If you'll have any questions feel free to reopen. cheers, Marcin [1] https://github.com/SchedMD/slurm/commit/095653d24b510aa38e57753ec071e98d9dbe463b