Sheila reports that she consistently sees the following in her job output on the vulcan machine: srun: Job step creation temporarily disabled, retrying srun: Job step created I tracket this down to a for loop in create_job_step() which repeatedly invokes slurm_step_ctx_create() until it succeeds. The logs tell me that the first 3 or 4 of these invocations fail here: [2012-09-04T12:28:47] select_p_step_pick_nodes: Looking for more than one midplane of block RMP04Se122726811 for job 11405, but some of it is used. [2012-09-04T12:28:47] _slurm_rpc_job_step_create for job 11405: Requested nodes are busy The slurm_step_ctx_create() invocation that eventually succeeds looks like this: [2012-09-04T12:28:49] select_p_step_pick_nodes: new step for job 11405 will be running on RMP04Se122726811(vulcan0100) [2012-09-04T12:28:49] sched: _slurm_rpc_job_step_create: StepId=11405.1 vulcan0100 usec=26231 Is this normal?
If she is trying to run multiple steps per block then yes. This will happen the same as on any other cluster. If she gets this on the first step I would say it is strange. It could mean somehow the previous job didn't finish yet and the resources somehow got back into the mix. For this perticular job you can look through the log and see step 0 wasn't finished until 2012-09-04T12:28:48 and that is when the success happened. Since she was asking for all 512 cnodes on both steps she had to wait for the first one before the second one could run.
(In reply to comment #1) > If she is trying to run multiple steps per block then yes. This will happen > the same as on any other cluster. > > If she gets this on the first step I would say it is strange. It could mean > somehow the previous job didn't finish yet and the resources somehow got > back into the mix. > > For this particular job you can look through the log and see step 0 wasn't > finished until 2012-09-04T12:28:48 and that is when the success happened. > Since she was asking for all 512 cnodes on both steps she had to wait for > the first one before the second one could run. After further conversation with Sheila, she has found that indeed her job scripts invoke two sruns in succession, and that yes, the second srun emits the warning. Sorry for the false alarm. Thanks for looking into this.
The solution to this problem in my case was simply to remove repeated "srun"! Originally I wrote: srun srun sss2 -omp 7 Cf101010NF Then I corrected it to: srun sss2 -omp 7 Cf101010NF And the error message disappeared and the program ran normally. Such a silly mistake I know :P
Hi, I have getting below error. srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy) Can anybody help me out on this?
Hi, I am getting below error. srun: Job 1723246 step creation temporarily disabled, retrying (Requested nodes are busy) Can anybody help me out on this?
Please don't alter old bug header :). Please open a new ticket as this one deals specifically with Bluegene systems.