Bug 5740 - Dependent job arrays created via aftercorr getting states confused
Summary: Dependent job arrays created via aftercorr getting states confused
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other bugs)
Version: 18.08.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-17 08:40 MDT by eli_venter
Modified: 2018-09-17 08:56 MDT (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description eli_venter 2018-09-17 08:40:30 MDT
I recently installed Slurm 18.08.0 on my debian test cluster, and started playing with dependent job arrays i.e. using -d aftercorr on the sbatch submissions. I'm seeing weird behavior when one of the jobs in an early array fails. The corresponding job in the next array is getting started, but other jobs end up being cancelled. For example I ran a test with 4 dependent job arrays, each with 96 jobs. First array of jobs was 315348 and its jobs all finished successfully. First dependent array was 315349 and it had one job failure, index 94. Second dependent array job is 315350. So I'd expect it's _94 job to be canceled due to the failure in it's dependent array. Instead I see the 315350_94 job run, and several other jobs in the 315350 array getting cancelled. Here's an egrep of the  slurmctld log showing the sequence for just 2 arrays and 2 jobs with odd behavior:

$ zegrep '31534(49|50)_21|31534(49|50)_94' slurmctld.log.1

[2018-09-14T15:00:02.877] build_job_queue: Split out JobId=3153449_21(3153687) for SLURM_DEPEND_AFTER_CORRESPOND use
[2018-09-14T15:09:43.165] sched: Allocate JobId=3153449_21(3153687) NodeList=r4-16 #CPUs=2 Partition=low
[2018-09-14T15:41:54.735] sched: Allocate JobId=3153449_94(3153762) NodeList=r1-02 #CPUs=2 Partition=low
[2018-09-14T16:18:46.376] _job_complete: JobId=3153449_21(3153687) WEXITSTATUS 0
[2018-09-14T16:18:46.376] _job_complete: JobId=3153449_21(3153687) done
[2018-09-14T16:25:08.352] build_job_queue: Split out JobId=3153450_21(3153780) for SLURM_DEPEND_AFTER_CORRESPOND use
[2018-09-14T16:28:18.951] _job_complete: JobId=3153449_94(3153762) WEXITSTATUS 1
[2018-09-14T16:28:18.952] _job_complete: JobId=3153449_94(3153762) done
[2018-09-14T17:29:35.858] _kill_dependent: Job dependency can't be satisfied, cancelling JobId=3153450_21(3153780)
[2018-09-14T17:29:39.765] sched: Allocate JobId=3153450_94(3153889) NodeList=r1-03 #CPUs=2 Partition=low
[2018-09-14T17:41:26.529] _job_complete: JobId=3153450_94(3153889) WEXITSTATUS 0
[2018-09-14T17:41:26.530] _job_complete: JobId=3153450_94(3153889) done