Ticket 1420

Summary: srun hangs when run from interactive job
Product: Slurm Reporter: Will French <will>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 14.11.3   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld log
slurmd log
slurm.conf

Description Will French 2015-02-02 07:43:37 MST
I've tried this multiple times over the last week or so and even for a single-task allocation, srun hangs with "srun: Job step creation temporarily disabled, retrying". I don't recall this ever happening before the upgrade to 14.11.3. (We were using 14.11.0 before.) 

In the example below I've waited about 10 minutes and the srun command is still hanging...

[frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash
salloc: Granted job allocation 145061
[frenchwr@vmp509 ~]$ srun hostname
srun: Job step creation temporarily disabled, retrying
Comment 1 David Bigagli 2015-02-02 08:46:32 MST
The slurmctld is unable to create the job step for some reason. Could you please send us the slurmctld log file?

David
Comment 2 David Bigagli 2015-02-02 10:16:59 MST
Also please send your current slurm.conf.

David
Comment 3 Will French 2015-02-02 23:10:13 MST
Created attachment 1599 [details]
slurmctld log
Comment 4 Will French 2015-02-02 23:10:40 MST
Created attachment 1600 [details]
slurmd log
Comment 5 Will French 2015-02-02 23:11:06 MST
Created attachment 1601 [details]
slurm.conf
Comment 6 Will French 2015-02-02 23:14:13 MST
I've attached both logs with SlurmctldDebug=9 and SlurmdDebug=7 (I turned these back down to 3 in slurm.conf). The controller log doesn't show anything at all. I did the same thing in job 149534 that I listed in my original message, i.e.:

[frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash
salloc: Granted job allocation 149534
[frenchwr@vmp506 ~]$ srun hostname
srun: Job step creation temporarily disabled, retrying
Comment 7 David Bigagli 2015-02-03 03:26:40 MST
If the controller does not log anything it means the job pends for a right reason.
Could you please try to run salloc like this:

$salloc srun --pty --mem-per-cpu=0 /bin/bash

since you schedule using SelectTypeParameters=CR_Core_Memory
and have the DefMemPerCPU=1000 the 'salloc srun --pty /bin/bash'
consumes all the memory allocated to the job so the 'srun hostname' step
has to pend. Using --mem-per-cpu=0 grants the allocation and the memory but
it tells Slurm the step running bash is not consuming any.

You may also want to have a look at the SallocDefaultCommand parameter in
slurm.conf. http://slurm.schedmd.com/slurm.conf.html

David
Comment 8 Will French 2015-02-03 08:51:14 MST
That did it! I'm relieved this wasn't a bug that would have required an upgrade. Thanks for the quick reply.

Will
Comment 9 David Bigagli 2015-02-03 08:55:58 MST
Very good.

David