Bug 1420 - srun hangs when run from interactive job
Summary: srun hangs when run from interactive job
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 14.11.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-02-02 07:43 MST by Will French
Modified: 2015-02-03 08:55 MST (History)
2 users (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (7.94 KB, text/plain)
2015-02-02 23:10 MST, Will French
Details
slurmd log (13.31 KB, text/plain)
2015-02-02 23:10 MST, Will French
Details
slurm.conf (7.01 KB, text/plain)
2015-02-02 23:11 MST, Will French
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Will French 2015-02-02 07:43:37 MST
I've tried this multiple times over the last week or so and even for a single-task allocation, srun hangs with "srun: Job step creation temporarily disabled, retrying". I don't recall this ever happening before the upgrade to 14.11.3. (We were using 14.11.0 before.) 

In the example below I've waited about 10 minutes and the srun command is still hanging...

[frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash
salloc: Granted job allocation 145061
[frenchwr@vmp509 ~]$ srun hostname
srun: Job step creation temporarily disabled, retrying
Comment 1 David Bigagli 2015-02-02 08:46:32 MST
The slurmctld is unable to create the job step for some reason. Could you please send us the slurmctld log file?

David
Comment 2 David Bigagli 2015-02-02 10:16:59 MST
Also please send your current slurm.conf.

David
Comment 3 Will French 2015-02-02 23:10:13 MST
Created attachment 1599 [details]
slurmctld log
Comment 4 Will French 2015-02-02 23:10:40 MST
Created attachment 1600 [details]
slurmd log
Comment 5 Will French 2015-02-02 23:11:06 MST
Created attachment 1601 [details]
slurm.conf
Comment 6 Will French 2015-02-02 23:14:13 MST
I've attached both logs with SlurmctldDebug=9 and SlurmdDebug=7 (I turned these back down to 3 in slurm.conf). The controller log doesn't show anything at all. I did the same thing in job 149534 that I listed in my original message, i.e.:

[frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash
salloc: Granted job allocation 149534
[frenchwr@vmp506 ~]$ srun hostname
srun: Job step creation temporarily disabled, retrying
Comment 7 David Bigagli 2015-02-03 03:26:40 MST
If the controller does not log anything it means the job pends for a right reason.
Could you please try to run salloc like this:

$salloc srun --pty --mem-per-cpu=0 /bin/bash

since you schedule using SelectTypeParameters=CR_Core_Memory
and have the DefMemPerCPU=1000 the 'salloc srun --pty /bin/bash'
consumes all the memory allocated to the job so the 'srun hostname' step
has to pend. Using --mem-per-cpu=0 grants the allocation and the memory but
it tells Slurm the step running bash is not consuming any.

You may also want to have a look at the SallocDefaultCommand parameter in
slurm.conf. http://slurm.schedmd.com/slurm.conf.html

David
Comment 8 Will French 2015-02-03 08:51:14 MST
That did it! I'm relieved this wasn't a bug that would have required an upgrade. Thanks for the quick reply.

Will
Comment 9 David Bigagli 2015-02-03 08:55:58 MST
Very good.

David