Ticket 119 - srun: Job step creation temporarily disabled, retrying
Summary: srun: Job step creation temporarily disabled, retrying
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: nss_slurm (show other tickets)
Version: - Unsupported Older Versions
Hardware: IBM BlueGene Linux
: --- 4 - Minor Issue
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2012-09-04 08:53 MDT by Don Lipari
Modified: 2021-06-03 11:07 MDT (History)
4 users (show)

See Also:
Site: LLNL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: RHEL
Machine Name:
CLE Version:
Version Fixed: N/A
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Don Lipari 2012-09-04 08:53:39 MDT
Sheila reports that she consistently sees the following in her job output on the vulcan machine:

srun: Job step creation temporarily disabled, retrying
srun: Job step created

I tracket this down to a for loop in create_job_step() which repeatedly invokes slurm_step_ctx_create() until it succeeds.

The logs tell me that the first 3 or 4 of these invocations fail here:

[2012-09-04T12:28:47] select_p_step_pick_nodes: Looking for more than one midplane of block RMP04Se122726811 for job 11405, but some of it is used.
[2012-09-04T12:28:47] _slurm_rpc_job_step_create for job 11405: Requested nodes are busy

The slurm_step_ctx_create() invocation that eventually succeeds looks like this:

[2012-09-04T12:28:49] select_p_step_pick_nodes: new step for job 11405 will be running on RMP04Se122726811(vulcan0100)
[2012-09-04T12:28:49] sched: _slurm_rpc_job_step_create: StepId=11405.1 vulcan0100 usec=26231

Is this normal?
Comment 1 Danny Auble 2012-09-04 09:57:53 MDT
If she is trying to run multiple steps per block then yes.  This will happen the same as on any other cluster.

If she gets this on the first step I would say it is strange.  It could mean somehow the previous job didn't finish yet and the resources somehow got back into the mix.

For this perticular job you can look through the log and see step 0 wasn't finished until 2012-09-04T12:28:48 and that is when the success happened.  Since she was asking for all 512 cnodes on both steps she had to wait for the first one before the second one could run.
Comment 2 Don Lipari 2012-09-05 03:53:42 MDT
(In reply to comment #1)
> If she is trying to run multiple steps per block then yes.  This will happen
> the same as on any other cluster.
> 
> If she gets this on the first step I would say it is strange.  It could mean
> somehow the previous job didn't finish yet and the resources somehow got
> back into the mix.
> 
> For this particular job you can look through the log and see step 0 wasn't
> finished until 2012-09-04T12:28:48 and that is when the success happened. 
> Since she was asking for all 512 cnodes on both steps she had to wait for
> the first one before the second one could run.

After further conversation with Sheila, she has found that indeed her job scripts invoke two sruns in succession, and that yes, the second srun emits the warning.  Sorry for the false alarm.  Thanks for looking into this.
Comment 3 Mohammed Mahdi 2019-05-27 14:15:08 MDT
The solution to this problem in my case was simply to remove repeated "srun"! Originally I wrote:
srun srun sss2 -omp 7 Cf101010NF
Then I corrected it to:
srun sss2 -omp 7 Cf101010NF
And the error message disappeared and the program ran normally. Such a silly mistake I know :P
Comment 4 Dharmendra 2021-05-11 06:56:45 MDT
Hi,
I have getting below error.

srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy)


Can anybody help me out on this?
Comment 5 Dharmendra 2021-05-11 06:57:39 MDT
Hi,
I am getting below error.

srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy)


Can anybody help me out on this?
Comment 6 Danny Auble 2021-06-03 11:07:42 MDT
Please don't alter old bug header :).

Please open a new ticket as this one deals specifically with Bluegene systems.