Summary: | srun: Job step creation temporarily disabled, retrying | ||
---|---|---|---|
Product: | Slurm | Reporter: | Don Lipari <lipari1> |
Component: | nss_slurm | Assignee: | Danny Auble <da> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | mahdim4, sts, testiiit0, tim |
Version: | - Unsupported Older Versions | ||
Hardware: | IBM BlueGene | ||
OS: | Linux | ||
Site: | LLNL | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | RHEL |
Machine Name: | CLE Version: | ||
Version Fixed: | N/A | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Don Lipari
2012-09-04 08:53:39 MDT
If she is trying to run multiple steps per block then yes. This will happen the same as on any other cluster. If she gets this on the first step I would say it is strange. It could mean somehow the previous job didn't finish yet and the resources somehow got back into the mix. For this perticular job you can look through the log and see step 0 wasn't finished until 2012-09-04T12:28:48 and that is when the success happened. Since she was asking for all 512 cnodes on both steps she had to wait for the first one before the second one could run. (In reply to comment #1) > If she is trying to run multiple steps per block then yes. This will happen > the same as on any other cluster. > > If she gets this on the first step I would say it is strange. It could mean > somehow the previous job didn't finish yet and the resources somehow got > back into the mix. > > For this particular job you can look through the log and see step 0 wasn't > finished until 2012-09-04T12:28:48 and that is when the success happened. > Since she was asking for all 512 cnodes on both steps she had to wait for > the first one before the second one could run. After further conversation with Sheila, she has found that indeed her job scripts invoke two sruns in succession, and that yes, the second srun emits the warning. Sorry for the false alarm. Thanks for looking into this. The solution to this problem in my case was simply to remove repeated "srun"! Originally I wrote: srun srun sss2 -omp 7 Cf101010NF Then I corrected it to: srun sss2 -omp 7 Cf101010NF And the error message disappeared and the program ran normally. Such a silly mistake I know :P Hi, I have getting below error. srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy) Can anybody help me out on this? Hi, I am getting below error. srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy) Can anybody help me out on this? Please don't alter old bug header :). Please open a new ticket as this one deals specifically with Bluegene systems. |