Ticket 1685

Summary: Add an sbatch option to block until the batch has finished executing
Product: Slurm Reporter: Marios Hadjieleftheriou <marioh>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da, rod
Version: 14.11.0   
Hardware: Linux   
OS: Linux   
Site: Lion Cave Capital Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.0-pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sbatch --wait patch for Slurm v14.08

Description Marios Hadjieleftheriou 2015-05-21 07:42:25 MDT
By default sbatch submits a script and returns immediately.

Currently, there are two simple ways to know when a job has finished:
1. Notify by email.
2. Run a depended job with afterany,afternotok,afterok.

#1 is mostly useful when humans run a batch script.

#2 could be used by a workflow manager, but has issues. First afterok and afternotok cannot be used in practice, because if a batch finishes successfully, then the afternotok job blocks indefinitely, and vice versa if the batch finishes with an error. afterany works well, but then the command invoked does not know what was the exit status of the batch.

There are other ways to identify that a batch has finished. All of them use some sort of polling (poll over the output files of the job and wait for a signal that each job has completed, or poll calling scontrol show job until the job status changes, etc.). All of these approaches are error prone, and impose a heavy burden on the user.

But there is a simple way to fix this problem. Introduce an sbatch parameter that instructs sbatch to block and wait until the batch has finished. The exit code of sbatch can be the exit code of the batch script or for job arrays the largest exit code across all jobs.
Comment 1 Moe Jette 2015-09-04 09:51:18 MDT
It was too late to get this into version 15.08, but it will be in the next major release (16.05, May 2016) and this patch will apply cleanly to version 15.08 if you care to use it.
https://github.com/SchedMD/slurm/commit/6638cafa62ab93e92eb9623449999d55a160bddc
Comment 2 Rodney Mach 2015-09-24 06:28:57 MDT
could I get a batch against 14.11.9 we also need this.
Comment 3 Moe Jette 2015-09-24 09:06:37 MDT
Created attachment 2246 [details]
sbatch --wait patch for Slurm v14.08
Comment 4 Moe Jette 2015-09-24 09:07:14 MDT
(In reply to Rodney Mach from comment #2)
> could I get a batch against 14.11.9 we also need this.

Done, see attachment