First, I would like to confirm whether we correctly understand the intended behavior. My expectation is that the epilog script gets run no matter what happens to the job (fails, canceled, timeout, etc). Is that the case, or are there corner cases? Here is my specific issue; I thought about writing to the mailing list but it seems like we expect the obvious behavior but something else happens, so maybe it's a bug and not just a misunderstanding on our side. My OS is Ubuntu 18.04.2 LTS and my SLURM is 18.08.7 built by me with no modifications. The end goal is to get the user to clean up his own temp files in his own epilog script to make the job more portable between clusters. So in his case, the epilog has some "rm -rf" in it which can be slow. Here is my simple user epilog script: $ cat example-user-epilog.sh #!/bin/bash # table is https://slurm.schedmd.com/prolog_epilog.html echo "inside my own epilog" printenv | grep SLURM Here is my simple user job script: $ cat example-user-sbatch.sh #!/bin/bash echo "starting my job" #first and only task inside my job srun --epilog=/home/alex/example-user-epilog.sh sleep 600 What would be the cases where this epilog script would not run? I tried just running the job so it completes normally; I tried running it with a short timelimit so it gets canceled by timeout, I tried scancel to cancel the job, and I also tried just killing my sleep command on the node so the job fails. So that's four distinct cases. It seemed to work OK: You can see the SLURM env vars in your output file, from the epilog script. However, if I now amend the epilog script to include a sleep command, it seems to get killed half-way through. $ cat example-user-epilog.sh #!/bin/bash # table is https://slurm.schedmd.com/prolog_epilog.html echo "inside my own epilog" sleep 10 printenv | grep SLURM I thought maybe there was some epilog timeout but specifically the PrologEpilogTimeout is set to the default in my case. alex@cb-login:~/test-nate-srun2$ scontrol show config |grep -i time BatchStartTimeout = 10 sec BOOT_TIME = 2019-04-25T09:03:20 EioTimeout = 60 EpilogMsgTime = 2000 usec GetEnvTimeout = 2 sec GroupUpdateTime = 600 sec KeepAliveTime = SYSTEM_DEFAULT LogTimeFormat = iso8601_ms MessageTimeout = 10 sec OverTimeLimit = 0 min PrologEpilogTimeout = 65534 ResumeTimeout = 60 sec SchedulerTimeSlice = 30 sec SlurmctldTimeout = 300 sec SlurmdTimeout = 300 sec SuspendTime = NONE SuspendTimeout = 30 sec TCPTimeout = 2 sec UnkillableStepTimeout = 60 sec WaitTime = 0 sec Looking through the slurmd logs on the compute nodes, I sometimes see a message like [2019-04-25T05:54:52.241] epilog for job 17280 ran for 12 seconds so I'm guessing that gets reported for epilogs which run "long". But in this case they seem to get killed or not run at all. How can I troubleshoot further? In case this is some kind of cgroups thing, I do have Delegate=yes in my slurmd systemd unit file.
Alex, These types of questions are handled by the SchedMD Slurm support team. Before this question can be routed to the support team we need to verify you work at a site with a support contract. Can you please let me know which site you work for? Thanks, Jacob
Hi, I'm at Calico Labs (calicolabs.com) and we do not have a support contract. Should I direct my question to the users mailing list instead? Regards, Alex On Thu, Apr 25, 2019 at 2:19 PM <bugs@schedmd.com> wrote: > > Comment # 1 on bug 6911 from Jacob Jenson > > Alex, > > These types of questions are handled by the SchedMD Slurm support team. Before > this question can be routed to the support team we need to verify you work at a > site with a support contract. Can you please let me know which site you work > for? > > Thanks, > Jacob > > ________________________________ > You are receiving this mail because: > > You reported the bug.
Alex, If you would like SchedMD to help with this question then a support contract is required. Thanks, Jacob