Hi, When using the AFTERCORR dependency between 2 arrays, and some of the tasks of the first array fail, some of the tasks of the second array are never scheduled and are left pending with the reason "DependencyNeverSatisfied". Below you can find the scripts used for both array jobs, the commands to submit them, and the output of sacct and scontrol show job (for one of the pending jobs) and the output of squeue. There seems to be a bug when updating the dependencies after MinJobAge. Regards, Carlos [fenoyc@rkalbhpc014 array_dep]$ cat my_job1.sh #!/bin/bash #SBATCH --qos=short if [ $SLURM_ARRAY_TASK_ID -eq 4 ];then exit -1 fi sleep 10 [fenoyc@rkalbhpc014 array_dep]$ cat my_job2.sh #!/bin/bash #SBATCH --qos=long sleep 60 [fenoyc@rkalbhpc014 array_dep]$ sbatch -a 1-30%5 my_job1.sh [fenoyc@rkalbhpc014 array_dep]$ sbatch -a 1-30%5 --dependency=aftercorr:8835280 my_job2.sh [fenoyc@rkalbhpc014 array_dep]$ sacct -j 8835280 -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 8835280_30 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_1 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_2 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_3 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_4 my_job1.sh defq itu 1 FAILED 127:0 8835280_5 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_6 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_7 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_8 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_9 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_10 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_11 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_12 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_13 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_14 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_15 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_16 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_17 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_18 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_19 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_20 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_21 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_22 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_23 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_24 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_25 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_26 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_27 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_28 my_job1.sh defq itu 1 COMPLETED 0:0 8835280_29 my_job1.sh defq itu 1 COMPLETED 0:0 [fenoyc@rkalbhpc014 array_dep]$ sacct -j 8835286 -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 8835286_[30+ my_job2.sh defq itu 1 PENDING 0:0 8835286_1 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_2 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_3 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_4 my_job2.sh defq itu 1 PENDING 0:0 8835286_5 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_6 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_7 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_8 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_9 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_10 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_11 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_12 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_13 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_14 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_15 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_16 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_17 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_18 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_19 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_20 my_job2.sh defq itu 1 COMPLETED 0:0 8835286_21 my_job2.sh defq itu 1 PENDING 0:0 8835286_22 my_job2.sh defq itu 1 PENDING 0:0 8835286_23 my_job2.sh defq itu 1 PENDING 0:0 8835286_24 my_job2.sh defq itu 1 PENDING 0:0 8835286_25 my_job2.sh defq itu 1 PENDING 0:0 8835286_26 my_job2.sh defq itu 1 PENDING 0:0 8835286_27 my_job2.sh defq itu 1 PENDING 0:0 8835286_28 my_job2.sh defq itu 1 PENDING 0:0 8835286_29 my_job2.sh defq itu 1 PENDING 0:0 [fenoyc@rkalbhpc014 array_dep]$ scontrol show job -d 8835286_21 JobId=8835333 ArrayJobId=8835286 ArrayTaskId=21 JobName=my_job2.sh UserId=fenoyc(82718) GroupId=itu(20356) MCS_label=N/A Priority=43037 Nice=0 Account=itu QOS=long JobState=PENDING Reason=DependencyNeverSatisfied Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2018-01-08T14:32:03 EligibleTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=defq AllocNode:Sid=rkalbhpc014:143042 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=8800,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=8800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/my_job2.sh WorkDir=/pstore/scratch/u/fenoyc/tests/slurm/array_dep StdErr=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/slurm-8835286_21.out StdIn=/dev/null StdOut=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/slurm-8835286_21.out Power= [fenoyc@rkalbhpc014 array_dep]$ squeue -u fenoyc JOBID PARTITION QOS NAME USER STATE TIME TIME_LIMIT CPUS NODES NODELIST(REASON) 8835286_[4,21-30] defq long my_job2.sh fenoyc PENDING 0:00 15-00:00:00 1 1 (DependencyNeverSatisfied) [fenoyc@rkalbhpc014 array_dep]$ scontrol -V slurm 17.02.7
Hi It seems to work as design. I agree, in this case behavior is not predictable/good documented. If any of parent array task fail and whole array is completed,all tasks from dependent array will be marked as 'DependencyNeverSatisfied', This can be cleaned only by admin. Dominik
If it works as designed documentation should mention this case. Anyway, could this be fixed so that only tasks depending on failed ones are not run?
Hi We fixed this in https://github.com/SchedMD/slurm/commit/7b5a36740a38979a2bf66a5c54afb5438b175b59. This commit was dappled to 17.02 branch, but I am not sure if we will have next 17.02 release. Dominik