6911 – user-provided epilog does not always run

Ticket 6911 - user-provided epilog does not always run

Summary: user-provided epilog does not always run

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Importance:	--- 6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-04-25 15:11 MDT by Alex Chekholko
Modified:	2019-04-25 15:29 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Alex Chekholko 2019-04-25 15:11:49 MDT

First, I would like to confirm whether we correctly understand the intended behavior.  My expectation is that the epilog script gets run no matter what happens to the job (fails, canceled, timeout, etc). Is that the case, or are there corner cases?

Here is my specific issue; I thought about writing to the mailing list but it seems like we expect the obvious behavior but something else happens, so maybe it's a bug and not just a misunderstanding on our side.

My OS is Ubuntu 18.04.2 LTS and my SLURM is 18.08.7 built by me with no modifications.

The end goal is to get the user to clean up his own temp files in his own epilog script to make the job more portable between clusters.  So in his case, the epilog has some "rm -rf" in it which can be slow.

Here is my simple user epilog script:


$ cat example-user-epilog.sh 
#!/bin/bash

# table is https://slurm.schedmd.com/prolog_epilog.html
echo "inside my own epilog"
printenv | grep SLURM


Here is my simple user job script:


$ cat example-user-sbatch.sh 
#!/bin/bash

echo "starting my job"

#first and only task inside my job
srun --epilog=/home/alex/example-user-epilog.sh sleep 600


What would be the cases where this epilog script would not run?

I tried just running the job so it completes normally; I tried running it with a short timelimit so it gets canceled by timeout, I tried scancel to cancel the job, and I also tried just killing my sleep command on the node so the job fails.  So that's four distinct cases.  It seemed to work OK:

You can see the SLURM env vars in your output file, from the epilog script.


However, if I now amend the epilog script to include a sleep command, it seems to get killed half-way through.


$ cat example-user-epilog.sh 
#!/bin/bash

# table is https://slurm.schedmd.com/prolog_epilog.html
echo "inside my own epilog"
sleep 10
printenv | grep SLURM


I thought maybe there was some epilog timeout but specifically the PrologEpilogTimeout is set to the default in my case.

alex@cb-login:~/test-nate-srun2$ scontrol show config |grep -i time
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2019-04-25T09:03:20
EioTimeout              = 60
EpilogMsgTime           = 2000 usec
GetEnvTimeout           = 2 sec
GroupUpdateTime         = 600 sec
KeepAliveTime           = SYSTEM_DEFAULT
LogTimeFormat           = iso8601_ms
MessageTimeout          = 10 sec
OverTimeLimit           = 0 min
PrologEpilogTimeout     = 65534
ResumeTimeout           = 60 sec
SchedulerTimeSlice      = 30 sec
SlurmctldTimeout        = 300 sec
SlurmdTimeout           = 300 sec
SuspendTime             = NONE
SuspendTimeout          = 30 sec
TCPTimeout              = 2 sec
UnkillableStepTimeout   = 60 sec
WaitTime                = 0 sec


Looking through the slurmd logs on the compute nodes, I sometimes see a message like 
[2019-04-25T05:54:52.241] epilog for job 17280 ran for 12 seconds
so I'm guessing that gets reported for epilogs which run "long".


But in this case they seem to get killed or not run at all.

How can I troubleshoot further?

In case this is some kind of cgroups thing, I do have 
Delegate=yes
in my slurmd systemd unit file.

Comment 1 Jacob Jenson 2019-04-25 15:19:40 MDT

Alex,

These types of questions are handled by the SchedMD Slurm support team. Before this question can be routed to the support team we need to verify you work at a site with a support contract. Can you please let me know which site you work for? 

Thanks,
Jacob

Comment 2 Alex Chekholko 2019-04-25 15:23:53 MDT

Hi,

I'm at Calico Labs (calicolabs.com) and we do not have a support contract.

Should I direct my question to the users mailing list instead?

Regards,
Alex

On Thu, Apr 25, 2019 at 2:19 PM <bugs@schedmd.com> wrote:
>
> Comment # 1 on bug 6911 from Jacob Jenson
>
> Alex,
>
> These types of questions are handled by the SchedMD Slurm support team. Before
> this question can be routed to the support team we need to verify you work at a
> site with a support contract. Can you please let me know which site you work
> for?
>
> Thanks,
> Jacob
>
> ________________________________
> You are receiving this mail because:
>
> You reported the bug.

Comment 3 Jacob Jenson 2019-04-25 15:29:49 MDT

Alex,

If you would like SchedMD to help with this question then a support contract is required. 

Thanks,
Jacob