I've tried this multiple times over the last week or so and even for a single-task allocation, srun hangs with "srun: Job step creation temporarily disabled, retrying". I don't recall this ever happening before the upgrade to 14.11.3. (We were using 14.11.0 before.) In the example below I've waited about 10 minutes and the srun command is still hanging... [frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash salloc: Granted job allocation 145061 [frenchwr@vmp509 ~]$ srun hostname srun: Job step creation temporarily disabled, retrying
The slurmctld is unable to create the job step for some reason. Could you please send us the slurmctld log file? David
Also please send your current slurm.conf. David
Created attachment 1599 [details] slurmctld log
Created attachment 1600 [details] slurmd log
Created attachment 1601 [details] slurm.conf
I've attached both logs with SlurmctldDebug=9 and SlurmdDebug=7 (I turned these back down to 3 in slurm.conf). The controller log doesn't show anything at all. I did the same thing in job 149534 that I listed in my original message, i.e.: [frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash salloc: Granted job allocation 149534 [frenchwr@vmp506 ~]$ srun hostname srun: Job step creation temporarily disabled, retrying
If the controller does not log anything it means the job pends for a right reason. Could you please try to run salloc like this: $salloc srun --pty --mem-per-cpu=0 /bin/bash since you schedule using SelectTypeParameters=CR_Core_Memory and have the DefMemPerCPU=1000 the 'salloc srun --pty /bin/bash' consumes all the memory allocated to the job so the 'srun hostname' step has to pend. Using --mem-per-cpu=0 grants the allocation and the memory but it tells Slurm the step running bash is not consuming any. You may also want to have a look at the SallocDefaultCommand parameter in slurm.conf. http://slurm.schedmd.com/slurm.conf.html David
That did it! I'm relieved this wasn't a bug that would have required an upgrade. Thanks for the quick reply. Will
Very good. David