Bug 9715 - signal delivery to batch step not working
Summary: signal delivery to batch step not working
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 20.02.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-09-01 11:31 MDT by Troy Baer
Modified: 2020-09-03 15:57 MDT (History)
2 users (show)

See Also:
Site: Ohio State OSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Troy Baer 2020-09-01 11:31:09 MDT
Some of our support staff reported that setting traps in Slurm jobs was not working for them.  In the course of debugging, we discovered that any sort of signal sent to the batch step of a job, using either --signal=B:[...] or scancel --batch --signal=[...], never seems to be delivered.

For instance, consider the following job script:
-----
#!/bin/bash
#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60

function my_handler() {
  echo "Catching signal"
  touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal
  exit
}

trap my_handler USR1
trap my_handler TERM

sleep 3600
-----

If I submit and run this job script, it only produces the following output:

troy@pitzer-login04:~/Beowulf/slurm/signal-delivery$ more minimal_trap.18491.log
slurmstepd: error: *** JOB 18491 ON p0614 CANCELLED AT 2020-09-01T12:55:14 DUE TO TIME LIMIT ***

Note that the echo from the signal handler does not appear in the output.  The file created by the touch command in the signal handler does not exist either, leading me to suspect that the handler is never invoked.

Similarly, if I submit this job script, wait for it to start running, and then immediately do an scancel --batch --signal=USR1 <jobid>, there is no evidence that the signal handler fired:

troy@pitzer-login04:~/Beowulf/slurm/signal-delivery$ sbatch minimal_trap.job
Submitted batch job 18517

troy@pitzer-login04:~/Beowulf/slurm/signal-delivery$ scancel --batch --signal=USR1 18517

troy@pitzer-login04:~/Beowulf/slurm/signal-delivery$ more minimal_trap.18517.log
slurmstepd: error: *** JOB 18517 ON p0614 CANCELLED AT 2020-09-01T13:19:14 DUE TO TIME LIMIT ***

Other tests we have done have shown that #SBATCH --signal=<spec> directives that do not include B: work as expected, so this appears to be specific to the batch steps of jobs.
Comment 2 Michael Hinton 2020-09-01 16:25:38 MDT
Hi Troy,

You need to do the following:

    sleep 3600 &
    wait

From https://slurm.schedmd.com/scancel.html:

"Note that most shells cannot handle signals while a command is running (child process of the batch step), the shell [needs to] use `wait` [to] wait until the command ends to then handle the signal."

Thanks,
-Michael
Comment 3 Troy Baer 2020-09-02 09:26:58 MDT
Thanks Michael, I've verified that the background+wait trick works.

The whole reason this came up is that "trap 'handler_cmd' TERM" is a common job script pattern that our users have been using for years in TORQUE, and we were a bit surprised to discover that it doesn't seem to work in Slurm.  Do you know why there is a difference in behavior with that?
Comment 4 Michael Hinton 2020-09-02 10:13:51 MDT
(In reply to Troy Baer from comment #3)
> The whole reason this came up is that "trap 'handler_cmd' TERM" is a common
> job script pattern that our users have been using for years in TORQUE, and
> we were a bit surprised to discover that it doesn't seem to work in Slurm. 
> Do you know why there is a difference in behavior with that?
I'm not entirely sure. But I do notice that if I do this:

    scancel --full --signal=USR1 <jobid>

then the batch script indeed does catch the signal, even when executing a blocking `sleep 3600`. I haven't looked at the code, but my guess is that --batch doesn't know how to route the signal to just the batch script process when there is a blocking child process in the foreground, whereas --full sends a signal to the batch process and all its children, so it doesn't have to be selective about it. This might be an area where we could improve Slurm, if it's possible.
Comment 5 Troy Baer 2020-09-02 10:35:12 MDT
Having a way to specify the equivalent behavior of scancel --full -signal=<signal> with sbatch --signal would be quite useful IMHO.
Comment 6 Michael Hinton 2020-09-02 10:39:41 MDT
(In reply to Troy Baer from comment #5)
> Having a way to specify the equivalent behavior of scancel --full
> -signal=<signal> with sbatch --signal would be quite useful IMHO.
I'm surprised by that asymmetry. That does seem useful. Feel free to open an enhancement ticket to address this.
Comment 7 Michael Hinton 2020-09-02 16:39:29 MDT
TLDR:
====================
So after playing around with Slurm and looking at the code, I think I understand what is going on. Your users should probably be using `scancel --full` instead of `scancel --batch` to get the same behavior as TORQUE.

Explanation:
====================
Let's say I put this in my batch script:

sleep 3600 &
sleep 3600 &
sleep 3600
wait

Here's what the process tree looks like:

$ pstree -p | grep "step\|sleep\|slurm_script"
           |-slurmstepd(947812)-+-slurm_script(947817)-+-sleep(947818)
           |                    |                      |-sleep(947819)
           |                    |                      `-sleep(947820)
           |                    |-{slurmstepd}(947813)
           |                    |-{slurmstepd}(947814)
           |                    |-{slurmstepd}(947815)
           |                    `-{slurmstepd}(947816)

--batch tells the stepd to kill() slurm_script(947817), whereas --full tells the stepd to pgkill() slurm_script(947817), which in turn kill()s the 3 children sleeps.

The first two `sleep 3600 &` calls are likely implemented by the shell as fork()s, which return control back to slurm_script immediately after forking.

The `sleep 3600` call is likely implemented by the shell as some kind of exec() call. This does not return control to the shell immediately afterwards like fork(), but rather the shell process "becomes" the sleep for the duration of the sleep. This makes it so the shell process is no longer running. Since it’s not running, it can’t run any custom signal handlers it registered with the OS, and those signals bounce off, unhandled.

If I changed my script to this:

sleep 3600 &
sleep 3600 &
sleep 3600 &
wait

Then the shell process is actually running (`wait`ing), and all the sleeps are forked children. So in this case, the shell is free to run the custom signal handler (though pstree will look the same).

Here’s another example of this Shell quirk, outside of Slurm (in Bash on Ubuntu). Let’s say I execute this shell script in my terminal:

$ cat ./9715-2.sh  
#!/bin/bash
function my_handler() {
       echo "Catching signal USR1"
}
trap my_handler USR1
sleep 3600

$ ./9715-2.sh

In another terminal, I do this:

$ pstree -p | grep sleep
           |                |               |-bash(2279)---bash(949501)---sleep(949502)
$ kill -s SIGUSR1 949501

What do you think will happen?

It turns out that nothing happens:
$ pstree -p | grep sleep
           |                |               |-bash(2279)---bash(949501)---sleep(949502)

What about this? 
$ kill -s SIGUSR2 949501

It terminates the parent bash process (949501), and the child sleep (949502) gets orphaned and put under the init process (systemd):

$ pstree -p | grep sleep
          |-sleep(949502)

The difference between these two scenarios is that when the shell script registers a signal handler, the shell script needs to be running in order to handle it. If no handler is defined, then the default Linux process handler kicks in, and the process gets terminated, regardless of if the shell process is running or not. So SIGUSR1 got ignored, and SIGUSR2 made the OS kill the process.
Comment 9 Michael Hinton 2020-09-03 15:57:04 MDT
Hopefully that satisfies you. I'll go ahead and close this out as info given.

Thanks!
-Michael