Ticket 4590 - Dependencies between array tasks not working properly
Summary: Dependencies between array tasks not working properly
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.02.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-01-08 07:57 MST by Carlos Fenoy
Modified: 2018-01-16 10:06 MST (History)
2 users (show)

See Also:
Site: Roche
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.10
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Carlos Fenoy 2018-01-08 07:57:13 MST
Hi,

When using the AFTERCORR dependency between 2 arrays, and some of the tasks of the first array fail, some of the tasks of the second array are never scheduled and are left pending with the reason "DependencyNeverSatisfied".

Below you can find the scripts used for both array jobs, the commands to submit them, and the output of sacct and scontrol show job (for one of the pending jobs) and the output of squeue.

There seems to be a bug when updating the dependencies after MinJobAge.

Regards,
Carlos

[fenoyc@rkalbhpc014 array_dep]$ cat my_job1.sh
#!/bin/bash

#SBATCH --qos=short

if [ $SLURM_ARRAY_TASK_ID -eq 4 ];then
	exit -1
fi

sleep 10
[fenoyc@rkalbhpc014 array_dep]$ cat my_job2.sh
#!/bin/bash

#SBATCH --qos=long

sleep 60
[fenoyc@rkalbhpc014 array_dep]$ sbatch -a 1-30%5 my_job1.sh
[fenoyc@rkalbhpc014 array_dep]$ sbatch -a 1-30%5 --dependency=aftercorr:8835280 my_job2.sh


[fenoyc@rkalbhpc014 array_dep]$ sacct -j 8835280 -X
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8835280_30   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_1    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_2    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_3    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_4    my_job1.sh       defq        itu          1     FAILED    127:0
8835280_5    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_6    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_7    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_8    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_9    my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_10   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_11   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_12   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_13   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_14   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_15   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_16   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_17   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_18   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_19   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_20   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_21   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_22   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_23   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_24   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_25   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_26   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_27   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_28   my_job1.sh       defq        itu          1  COMPLETED      0:0
8835280_29   my_job1.sh       defq        itu          1  COMPLETED      0:0

[fenoyc@rkalbhpc014 array_dep]$ sacct -j 8835286 -X
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8835286_[30+ my_job2.sh       defq        itu          1    PENDING      0:0
8835286_1    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_2    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_3    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_4    my_job2.sh       defq        itu          1    PENDING      0:0
8835286_5    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_6    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_7    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_8    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_9    my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_10   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_11   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_12   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_13   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_14   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_15   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_16   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_17   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_18   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_19   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_20   my_job2.sh       defq        itu          1  COMPLETED      0:0
8835286_21   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_22   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_23   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_24   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_25   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_26   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_27   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_28   my_job2.sh       defq        itu          1    PENDING      0:0
8835286_29   my_job2.sh       defq        itu          1    PENDING      0:0

[fenoyc@rkalbhpc014 array_dep]$ scontrol show job -d 8835286_21
JobId=8835333 ArrayJobId=8835286 ArrayTaskId=21 JobName=my_job2.sh
   UserId=fenoyc(82718) GroupId=itu(20356) MCS_label=N/A
   Priority=43037 Nice=0 Account=itu QOS=long
   JobState=PENDING Reason=DependencyNeverSatisfied Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2018-01-08T14:32:03 EligibleTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=rkalbhpc014:143042
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=8800,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=8800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/my_job2.sh
   WorkDir=/pstore/scratch/u/fenoyc/tests/slurm/array_dep
   StdErr=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/slurm-8835286_21.out
   StdIn=/dev/null
   StdOut=/pstore/scratch/u/fenoyc/tests/slurm/array_dep/slurm-8835286_21.out
   Power=

[fenoyc@rkalbhpc014 array_dep]$ squeue -u fenoyc
             JOBID PARTITION          QOS         NAME     USER    STATE         TIME   TIME_LIMIT   CPUS  NODES NODELIST(REASON)
 8835286_[4,21-30]      defq         long   my_job2.sh   fenoyc  PENDING         0:00  15-00:00:00      1      1 (DependencyNeverSatisfied)

[fenoyc@rkalbhpc014 array_dep]$ scontrol -V
slurm 17.02.7
Comment 1 Dominik Bartkiewicz 2018-01-09 09:35:16 MST
Hi

It seems to work as design.
I agree, in this case behavior is not predictable/good documented.
If any of parent array task fail and whole array is completed,all tasks from dependent array will be marked as 'DependencyNeverSatisfied', This can be cleaned only by admin.

Dominik
Comment 2 Carlos Fenoy 2018-01-09 09:58:34 MST
If it works as designed documentation should mention this case. Anyway, could this be fixed so that only tasks depending on failed ones are not run?
Comment 5 Dominik Bartkiewicz 2018-01-15 02:47:49 MST
Hi

We fixed this in https://github.com/SchedMD/slurm/commit/7b5a36740a38979a2bf66a5c54afb5438b175b59.
This commit was dappled to 17.02 branch, but I am not sure if we will have next 17.02 release.

Dominik