Bug 3216 - Option to kill batch job when job step fails
Summary: Option to kill batch job when job step fails
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 17.02.x
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-10-27 12:53 MDT by Davide Vanzo
Modified: 2020-08-21 08:38 MDT (History)
1 user (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Davide Vanzo 2016-10-27 12:53:15 MDT
In certain use cases it would be helpful to have a srun/sbatch option that instruct Slurm to kill a job if any process within a job step gets killed for any reason (e.g. cgroup memory overflow).
Two examples to better clarify:

- Running a shell script with srun. In this case if one process launched by the shell script uses more memory than allocated, the process gets killed but the script keeps running. With the proposed option the whole job step would be killed.

- Multiple job steps within a batch job. In this case if a job step fails because of memory overflow Slurm kills the job step and proceeds with the next. With the proposed option Slurm kills the whole batch job as soon as a job step fails.

You may correctly argue that in the first example a more refined error catching structure should be implemented in the script. Unfortunately most of the times our users run big scripts written by unknown developers and it is extremely hard to do so. Hence having the option of controlling the script behavior "from above" would help a lot.

Davide
Comment 1 Tim Wickberg 2016-10-27 13:02:44 MDT
(In reply to Davide Vanzo from comment #0)
> In certain use cases it would be helpful to have a srun/sbatch option that
> instruct Slurm to kill a job if any process within a job step gets killed
> for any reason (e.g. cgroup memory overflow).
> Two examples to better clarify:
> 
> - Running a shell script with srun. In this case if one process launched by
> the shell script uses more memory than allocated, the process gets killed
> but the script keeps running. With the proposed option the whole job step
> would be killed.

Everything within a single step should be getting killed on OOM or similar.

> - Multiple job steps within a batch job. In this case if a job step fails
> because of memory overflow Slurm kills the job step and proceeds with the
> next. With the proposed option Slurm kills the whole batch job as soon as a
> job step fails.

This is a bit trickier, although adding a 'set -e' at the start of the script covers the most common use case of chaining discrete steps together one-after-another.

> You may correctly argue that in the first example a more refined error
> catching structure should be implemented in the script. Unfortunately most
> of the times our users run big scripts written by unknown developers and it
> is extremely hard to do so. Hence having the option of controlling the
> script behavior "from above" would help a lot.

I could see some use for such a "all or nothing" option, I'll look into how that may work. I'm assuming you'd want this as a per-job option, and if you wanted to force it onto users you'd just do that through a job submit plugin?
Comment 2 Davide Vanzo 2016-10-27 13:12:58 MDT
Hello Tim,
yes, it does not make sense to enforce it on all jobs but the user can choose if he/she wants it or not for every single job.

As for the first example, could you suggest where I can find more information about OOM settings?

Thanks

Davide
Comment 3 Tim Wickberg 2016-10-27 13:20:58 MDT
(In reply to Davide Vanzo from comment #2)
> Hello Tim,
> yes, it does not make sense to enforce it on all jobs but the user can
> choose if he/she wants it or not for every single job.

If you're expecting users to manually request this, 'set -e', or changing the first line of their script to "#!/bin/bash -e" should do most of this.

> As for the first example, could you suggest where I can find more
> information about OOM settings?

http://slurm.schedmd.com/cgroups.html would be the place to start, although I'm seeing that it doesn't elaborate much on how the enforcement mechanisms work.

Everything within that step should be getting killed if the step goes over its limit.

I'll note this is assuming that it hits the cgroup limits first - if you're hitting the system OOM-killer somehow then the results can be unpredictable, which is why we usually suggest setting the available memory in the node lower than the actual memory to leave some space for the kernel / OS daemons.
Comment 4 Davide Vanzo 2016-10-27 13:51:38 MDT
Tim,
I apologize. I forgot to mention that we are already using cgroups. 
This request arises from a user that is trying to adapt a series of Torque scripts to work with Slurm. In Torque the default behavior was to kill the whole job whenever a process uses more memory that what was allocated for the job. That's because Torque was not using cgroups but simply polling the system to check for memory usage and killing the job if needed. With Slurm and cgroups only the process gets killed by the OOM while the job stays alive. Surely this could be solved by using "set -e" but this will not discriminate between killed process because of insufficient resources and runtime errors.

Davide