Summary: | srun hangs when run from interactive job | ||
---|---|---|---|
Product: | Slurm | Reporter: | Will French <will> |
Component: | Other | Assignee: | David Bigagli <david> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Vanderbilt | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurmctld log
slurmd log slurm.conf |
Description
Will French
2015-02-02 07:43:37 MST
The slurmctld is unable to create the job step for some reason. Could you please send us the slurmctld log file? David Also please send your current slurm.conf. David Created attachment 1599 [details]
slurmctld log
Created attachment 1600 [details]
slurmd log
Created attachment 1601 [details]
slurm.conf
I've attached both logs with SlurmctldDebug=9 and SlurmdDebug=7 (I turned these back down to 3 in slurm.conf). The controller log doesn't show anything at all. I did the same thing in job 149534 that I listed in my original message, i.e.: [frenchwr@vmps60 ~]$ salloc srun --pty /bin/bash salloc: Granted job allocation 149534 [frenchwr@vmp506 ~]$ srun hostname srun: Job step creation temporarily disabled, retrying If the controller does not log anything it means the job pends for a right reason. Could you please try to run salloc like this: $salloc srun --pty --mem-per-cpu=0 /bin/bash since you schedule using SelectTypeParameters=CR_Core_Memory and have the DefMemPerCPU=1000 the 'salloc srun --pty /bin/bash' consumes all the memory allocated to the job so the 'srun hostname' step has to pend. Using --mem-per-cpu=0 grants the allocation and the memory but it tells Slurm the step running bash is not consuming any. You may also want to have a look at the SallocDefaultCommand parameter in slurm.conf. http://slurm.schedmd.com/slurm.conf.html David That did it! I'm relieved this wasn't a bug that would have required an upgrade. Thanks for the quick reply. Will Very good. David |