Ticket 10196 - Some preempted jobs show up as failed
Summary: Some preempted jobs show up as failed
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-11-11 09:09 MST by Michael Robbert
Modified: 2021-03-16 13:39 MDT (History)
5 users (show)

See Also:
Site: Colorado School of Mines
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Submit script from LAMMPS example (807 bytes, text/plain)
2020-11-11 10:08 MST, Michael Robbert
Details
Submit script for mt3d-usgs example (429 bytes, text/plain)
2020-11-11 10:10 MST, Michael Robbert
Details
slurm.conf (3.77 KB, text/plain)
2020-11-17 12:36 MST, Michael Robbert
Details
QoS configuration (1.83 KB, text/plain)
2020-11-17 12:37 MST, Michael Robbert
Details
show nodes output (63.86 KB, text/plain)
2020-11-17 12:39 MST, Michael Robbert
Details
Accounting data for JobID 4874366 (5.00 KB, text/plain)
2020-11-17 12:39 MST, Michael Robbert
Details
slurmd.log from preemption that got logged as cancelled (256.39 KB, text/plain)
2020-11-23 13:15 MST, Michael Robbert
Details
Preempted job that shows as preempted. (30.35 KB, text/plain)
2020-11-23 13:25 MST, Michael Robbert
Details
New slurmd.log from preempted job that shows cancelled. (35.99 KB, text/plain)
2020-12-02 17:58 MST, Michael Robbert
Details
slurmd.log from sbatch wrap with debug2 (130.08 KB, text/plain)
2020-12-03 11:32 MST, Michael Robbert
Details
slurmctld.log from sbatch wrap with debug2 (6.19 MB, text/plain)
2020-12-03 11:41 MST, Michael Robbert
Details
Output of "slurm show config" (7.83 KB, text/plain)
2020-12-14 16:07 MST, Michael Robbert
Details
slurmd running at debug3 (140.08 KB, text/plain)
2020-12-15 15:17 MST, Michael Robbert
Details
patch for 20.02 (WIP) (876 bytes, patch)
2020-12-16 11:55 MST, Nate Rini
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Robbert 2020-11-11 09:09:19 MST
We have discovered that some jobs on our cluster are being preempted, but their output and our accounting records show the job as failed or cancelled. This took some time to figure out because we had no reason to assume that Slurm was misreporting the reason that the job ended and so we assumed that there was a problem with the job. There are other jobs on the system that do properly report preemption in the job output and accounting so this isn't system wide. Right now I believe the problem is application specific, but we have seen it and I've been able to replicate it in 2 applications so far. The way we've reproduced this problem is to take a job, same application and same inputs, that is known to complete when run in a non-preemptable QoS. We run it in a preemptable QoS and then start some dummy jobs in a preemptor QoS that request the same node.
The 2 applications where we've seen this are mt3d-usgs, a serial Fortran code, and LAMMPS.

We haven't turned on any extra debugging yet, but unless this is a known issue I'm sure you'll want to see something so just let me know exactly what you want to see, but here is a snapshot of what we're seeing with our standard logging settings.

slurmctld.log messages from the job exit:
[2020-11-11T08:28:47.335] _job_complete: JobId=4874366 WEXITSTATUS 143
[2020-11-11T08:28:47.336] _job_complete: JobId=4874366 done

slurmd.log from the node where the job was running and around the time of exit:
[2020-11-11T08:28:47.075] [4874366.batch] debug:  Handling REQUEST_STEP_UID
[2020-11-11T08:28:47.075] debug:  _rpc_signal_tasks: sending signal 18 to all steps job 4874366 flag 4
[2020-11-11T08:28:47.088] [4874366.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2020-11-11T08:28:47.088] [4874366.extern] debug:  _handle_signal_container for step=4874366.4294967295 uid=390 signal=18
[2020-11-11T08:28:47.088] [4874366.extern] Sent signal 18 to 4874366.4294967295
[2020-11-11T08:28:47.098] [4874366.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2020-11-11T08:28:47.098] [4874366.0] debug:  _handle_signal_container for step=4874366.0 uid=390 signal=18
[2020-11-11T08:28:47.098] [4874366.0] Sent signal 18 to 4874366.0
[2020-11-11T08:28:47.110] [4874366.batch] debug:  Handling REQUEST_STEP_UID
[2020-11-11T08:28:47.110] debug:  _rpc_signal_tasks: sending signal 15 to all steps job 4874366 flag 4
[2020-11-11T08:28:47.121] [4874366.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2020-11-11T08:28:47.121] [4874366.extern] debug:  _handle_signal_container for step=4874366.4294967295 uid=390 signal=15
[2020-11-11T08:28:47.121] [4874366.extern] Sent signal 15 to 4874366.4294967295
[2020-11-11T08:28:47.122] [4874366.extern] debug:  step_terminate_monitor_stop signaling condition
[2020-11-11T08:28:47.133] [4874366.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2020-11-11T08:28:47.134] [4874366.0] debug:  _handle_signal_container for step=4874366.0 uid=390 signal=15
[2020-11-11T08:28:47.134] [4874366.0] error: *** STEP 4874366.0 ON c039 CANCELLED AT 2020-11-11T08:28:47 ***
[2020-11-11T08:28:47.134] [4874366.0] Sent signal 15 to 4874366.0

Accounting data for the job:
[mrobbert@wendian001 scratch]$ sacct -j 4874366
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4874366          mrLMPS    compute   hpcgroup         18     FAILED     15:0
4874366.bat+      batch              hpcgroup         18     FAILED     15:0
4874366.ext+     extern              hpcgroup         18  COMPLETED      0:0
4874366.0       lmp_mpi              hpcgroup         18  CANCELLED     0:15

Is there any explanation for this? If not, what debugging do you want to see?

Thanks,
Mike
Comment 1 Nate Rini 2020-11-11 10:04:37 MST
Please attach your current slurm.conf (& friends) along with teh following:
> sacctmgr show -p qos
> scontrol show nodes
> sacct -j 4874366 -p -o all

(In reply to Michael Robbert from comment #0)
> Right now I believe the problem is application specific, but we
> have seen it and I've been able to replicate it in 2 applications so far.
Is it possible to get the submission script (assuming they are sbatch jobs) for these jobs too?
Comment 2 Michael Robbert 2020-11-11 10:08:25 MST
Created attachment 16601 [details]
Submit script from LAMMPS example
Comment 3 Michael Robbert 2020-11-11 10:10:27 MST
Created attachment 16602 [details]
Submit script for mt3d-usgs example

In the mt3d-usgs case 'runModel' is a bash script with the following contents:

#!/bin/bash

./mf6
./mt3dusgs baseModelTran.nam
Comment 5 Michael Robbert 2020-11-17 08:27:45 MST
Any updates or thoughts on what might be causing this issue at our site?

Mike
Comment 6 Broderick Gardner 2020-11-17 12:33:57 MST
We need more information, namely
> slurm.conf
> sacctmgr show -p qos
> scontrol show nodes
> sacct -j 4874366 -p -o all

Thanks
Comment 7 Michael Robbert 2020-11-17 12:36:30 MST
Created attachment 16693 [details]
slurm.conf
Comment 8 Michael Robbert 2020-11-17 12:37:47 MST
Created attachment 16694 [details]
QoS configuration
Comment 9 Michael Robbert 2020-11-17 12:39:11 MST
Created attachment 16695 [details]
show nodes output
Comment 10 Michael Robbert 2020-11-17 12:39:59 MST
Created attachment 16696 [details]
Accounting data for JobID 4874366
Comment 11 Michael Robbert 2020-11-17 12:40:55 MST
I've add all of the requested output as attachments.

Mike
Comment 12 Michael Robbert 2020-11-23 09:45:13 MST
Any further updates on this issue?
Comment 13 Broderick Gardner 2020-11-23 10:08:42 MST
Thank you for the information. I have done some basic tests to try and reproduce your issue, and so far I have not.

Can you attach the slurmd log covering 2 full jobs that get preempted: 
One that reproduces the issue and one that does not. Make sure to make a note of which job id is which.

I'm looking into why the log message you already posted does not say it was cancelled due to preemption. Here is one from my tests:
error: *** JOB 2712 ON node1 CANCELLED AT 2020-11-17T13:34:27 DUE TO PREEMPTION ***

This is with qos preemption, PreemptMode=REQUEUE, JobRequeue=0 

Thanks
Comment 14 Michael Robbert 2020-11-23 13:15:52 MST
Created attachment 16787 [details]
slurmd.log from preemption that got logged as cancelled

Here is the first part of what you're asking for, but now I'm having a problem finding a job that actually shows a state of preempted. I know that I've seen this in our accounting records in the past, but I'll need to do some research find a job that I can try to replicate. This job from this log was HPL. I've tried also with a job that just calls sleep 600 and even that is showing a state of cancelled.
Comment 15 Michael Robbert 2020-11-23 13:25:51 MST
Created attachment 16788 [details]
Preempted job that shows as preempted.

I spoke too soon in my last post. My final test did correctly show up as preempted in the accounting logs. I was moving slurmd.log and restarting the daemon between each test so each log should only show the preempted and preemptor of each test.
This test was done with just "sleep 600". I will note that running "srun sleep 600" still resulted in the job being recorded as cancelled.
Comment 16 Michael Robbert 2020-12-01 11:50:37 MST
Any thoughts on the logs that I sent last week?
I just did a few more tests today that appear to confirm that any job which is running a step under srun and gets preempted shows up as cancelled, but any job that is running a job step that isn't started with srun shows up correctly as preempted, both in the logs and the accounting records.

I have tried:
srun sleep 600 vs sleep 600
and
srun xhpl vs mpirun xhpl

Let me know if you'd like to see any of those logs or if there are any other test that you'd like me to run.

Mike
Comment 17 Broderick Gardner 2020-12-02 17:46:51 MST
I'm still looking into the problem. Could you send the slurmd.log for a preempted job that ends up failed again? A simple job, like an srun sleep that you found still showing as failed. Also include the slurmctld.log for the same period.

Also post the sacct output for the job.

Tip: tail -f <logfile> | tee save.log
That way you don't have to keep moving the log file.

I still haven't found why it's cancelling the job normally and not due to preemption.

Thanks
Comment 18 Michael Robbert 2020-12-02 17:58:24 MST
Created attachment 16936 [details]
New slurmd.log from preempted job that shows cancelled.

JobId 5131745 submitted with preemptable QoS inside a reservation. It ran "srun sleep 600". Then I submitted 5131747 with a preemptor QoS.

[mrobbert@wendian001 scratch]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5131745           sleep    compute   hpcgroup         36     FAILED     15:0
5131745.bat+      batch              hpcgroup         36     FAILED     15:0
5131745.ext+     extern              hpcgroup         36  COMPLETED      0:0
5131745.0         sleep              hpcgroup          1  CANCELLED     0:15
5131747           sleep    compute   hpcgroup         36    RUNNING      0:0
5131747.bat+      batch              hpcgroup         36    RUNNING      0:0
5131747.ext+     extern              hpcgroup         36    RUNNING      0:0
5131747.0         sleep              hpcgroup          1    RUNNING      0:0

Logs from slurmctld.log:

[2020-12-02T17:52:44.187] _slurm_rpc_submit_batch_job: JobId=5131745 InitPrio=10425 usec=1399
[2020-12-02T17:52:45.123] sched: Allocate JobId=5131745 NodeList=c042 #CPUs=36 Partition=compute
[2020-12-02T17:52:45.171] prolog_running_decr: Configuration for JobId=5131745 is complete
[2020-12-02T17:53:03.744] _slurm_rpc_submit_batch_job: JobId=5131746 InitPrio=106530 usec=1804
[2020-12-02T17:53:03.898] _slurm_rpc_submit_batch_job: JobId=5131747 InitPrio=110425 usec=1286
[2020-12-02T17:53:04.217] sched: Allocate JobId=5131746 NodeList=c039 #CPUs=1 Partition=compute
[2020-12-02T17:53:04.268] prolog_running_decr: Configuration for JobId=5131746 is complete
[2020-12-02T17:53:04.319] _job_complete: JobId=5131745 WEXITSTATUS 143
[2020-12-02T17:53:04.320] _job_complete: JobId=5131745 done
[2020-12-02T17:53:08.092] backfill: Started JobId=5131747 in compute on c042
[2020-12-02T17:53:09.959] prolog_running_decr: Configuration for JobId=5131747 is complete
Comment 19 Broderick Gardner 2020-12-03 11:07:56 MST
This is very strange. I'm not sure why the batch step is returning 15. The other 15 is expected (sort of); the step 0 is killed by SIGTERM. But the number before the : is the process exit code.

Can you reproduce it with this:
sbatch --qos=<low prio> --wrap="srun sleep 600"

If it takes more than that to reproduce, please include everything that is required to run the reproducing job.

Nothing appears wrong with your slurm.conf. It doesn't appear to be related to error codes or catching signals. 

Please increase the debug level of slurmctld to debug2 and do it again, posting the slurmctld.log covering the job.
Comment 20 Michael Robbert 2020-12-03 11:32:19 MST
Created attachment 16952 [details]
slurmd.log from sbatch wrap with debug2
Comment 21 Michael Robbert 2020-12-03 11:41:42 MST
Created attachment 16953 [details]
slurmctld.log from sbatch wrap with debug2

I've attached slurmd.log and slurmctld.log, both with debug2 set.
The preempted JobID is 5151474 and the preemptor was 5151500.
You may notice that there is also a JobID 5151481 that I submitted to the same reservation, but I quickly realized that it didn't preempt because 2 single core jobs were able to fit on the node so I cancelled that job and submitted a new one with the exclusive flag. Here is the full command that I used to submit:

sbatch --qos=full --reservation=TestPreempt --wrap="srun sleep 600"
sbatch --qos=hpc --exclusive --reservation=TestPreempt --wrap="srun sleep 600"

The reservation that I used is restricted to the one node (c042). I am pretty sure that the problem happens without using a reservation so if you'd like me to reproduce outside of that I can. It would just take a little more work.

Mike
Comment 22 Michael Robbert 2020-12-09 09:34:30 MST
Did my most recent debug logs show you anything useful? Are we any closer to figuring out what might be causing this? 
I don't have any definite plans yet, but we are due for a full outage for upgrades on this cluster. I'll probably update Slurm to 20.11 at that time. If you'd like me to try any system level changes that would be a good time to do it. 

Mike
Comment 23 Michael Robbert 2020-12-14 11:24:33 MST
It has been almost two weeks since I've heard any response on the debugging data that I sent you. Can I please get this case escalated?

Thanks,
Mike
Comment 24 Jason Booth 2020-12-14 11:29:47 MST
Hi Mike - This was brought to my attention today. I will have Nathan from our team assist with this issue. Please be patience while he reviews this bug.
Comment 26 Nate Rini 2020-12-14 15:40:16 MST
(In reply to Michael Robbert from comment #21)
> Created attachment 16953 [details]
> slurmctld.log from sbatch wrap with debug2

Is there any reason that this test would have lost logs (possibly due to Syslog)?
Comment 28 Michael Robbert 2020-12-14 15:49:12 MST
I don't think so, we're logging directly to the file, not through Syslog

SlurmctldLogFile=/var/log/slurm/slurmctld.log

That file is on a local disk so nothing remote.
Is there something you're seeing that would make you think that something didn't get logged?

Mike
Comment 29 Nate Rini 2020-12-14 15:59:41 MST
(In reply to Michael Robbert from comment #28)
> That file is on a local disk so nothing remote.
> Is there something you're seeing that would make you think that something
> didn't get logged?

When jobs are pre-empted, they usually log that as an info() on slurmctld. However, there is one other code path that doesn't log which we suspect your jobs have been using. If the logs are lossy, then it's hard to rule out the other code paths.

Please provide the job submission script and the reservation info of the test job in comment #21. Having the QoS configuration would also be helpful.

> sacctmgr show qos -p
> scontrol show res TestPreempt
> scontrol show config
Comment 30 Michael Robbert 2020-12-14 16:07:43 MST
Created attachment 17148 [details]
Output of "slurm show config"
Comment 31 Michael Robbert 2020-12-14 16:12:07 MST
I just attached the output of "scontrol show config" and out slurm.conf is already attached to the ticket. The QoS config is already attached to the ticket as "QoS configuration". 
The reservation that was in place from those logs is no longer there, but I found the command that I used to create it:

scontrol create Reservationname=TestPreempt Nodes=c042 Start=Now Duration=7-00:00:00 users=mrobbert

I will note that we never use reservations for user jobs so this problem exists independent of reservations. I only used one in order to make it easier to reproduce without impacting the rest of the cluster. If your current thinking is that reservations are the cause then I'll put in the extra work needed to reproduce the error without any reservations in place.
Comment 41 Nate Rini 2020-12-14 22:22:26 MST
(In reply to Michael Robbert from comment #31)
> I will note that we never use reservations for user jobs so this problem
> exists independent of reservations. I only used one in order to make it
> easier to reproduce without impacting the rest of the cluster. If your
> current thinking is that reservations are the cause then I'll put in the
> extra work needed to reproduce the error without any reservations in place.

Just wanted to make sure a different code path was not in use. That confirms it.

Is it possible to rerun the same test as in comment #18 with slurmd set to debug3?

I currently suspect that this is caused by a race condition of different job steps reporting, possibly due to CPU/network load while running those applications noted in comment#0.

Also is it possible to get a copy of the PrologSlurmctld script? I'm mainly looked to see if it could pause a different amount for these job steps.
Comment 44 Michael Robbert 2020-12-15 15:17:10 MST
Created attachment 17167 [details]
slurmd running at debug3

In this example JobID 5299934 preempted JobID 5299930
Comment 45 Michael Robbert 2020-12-15 15:21:11 MST
Here is what we're doing in our slurmctld prolog:

[root@gengan ~]# cat /etc/slurm/prolog-slurmctld
#!/bin/bash
PREJOBS_DIR=/etc/slurm/ctld-prologs
# execute our prologs
for script in ${PREJOBS_DIR}/*; do
    if [ -x "${script}" ]; then
    source "${script}"
    fi
done


And here is the contents of the only script that is in /etc/slurm/ctld-prologs:

[root@gengan ~]# cat /etc/slurm/ctld-prologs/beeond.sh
#!/bin/bash
if [[ ! ${SLURM_JOB_CONSTRAINTS} =~ "beeond" ]]
then
        #User didn't ask for BeeOND so don't set it up
        exit 0
fi
# slurm strips out PATH during hooks execution
export PATH=$PATH:/usr/bin
NODEFILE=/beegfs/shared/jobs-nodelists/nodes-${SLURM_JOB_ID}
CLIENT_MNT=/mnt/beeond-${SLURM_JOB_ID}
JOBS_DATA_DIR=/localscratch/beeond-${SLURM_JOB_ID}
# setup beeond
#mkdir -p $(dirname ${NODEFILE})
scontrol show hostnames "${SLURM_JOB_NODELIST}" > "${NODEFILE}"
sudo /usr/bin/beeond start -n "${NODEFILE}" -d "${JOBS_DATA_DIR}" -c "${CLIENT_MNT}"



I am pretty sure nobody is using this so the script should be exiting without doing anything. Since nobody is using this I'm willing to completely disable it if you think it would help.
Comment 50 Nate Rini 2020-12-16 11:55:46 MST
Created attachment 17185 [details]
patch for 20.02 (WIP)

Michael,

We believe we have recreated the issue locally.

I'm attaching a patch for testing/verification that should stop the job steps from getting marked as FAILED but will instead get marked as CANCELLED (batch) or COMPLETED (extern). If you can try this on your system, it would allow us to confirm that we have recreated the same issue.

The patch can be applied to Slurm with
> git am /tmp/bug10196.WIP.2002.v1.CSM-ONLY.patch
and then follow the normal make install procedure.

Thanks,
--Nate
Comment 51 Michael Robbert 2020-12-16 12:18:18 MST
We have Slurm installed with RPMS that we build from the SRPMS. Can I do a manual build from source and then just replace the slurmctld binary in order to test this?

I'm also curious if you plan to create a patch that will get the preempted jobs to show up as PREEMPTED rather than FAILED?

Mike
Comment 52 Nate Rini 2020-12-16 12:36:10 MST
(In reply to Michael Robbert from comment #51)
> We have Slurm installed with RPMS that we build from the SRPMS. Can I do a
> manual build from source and then just replace the slurmctld binary in order
> to test this?
Yes, but it would easier to just grab the compiled binary out of the RPM build dir by passing this arg to rpmbuild:
> -D 'noclean 1'
Apply the patch and then rebuild (using make not rpmbuild) in the directory. Slurm uses a lot of librariesm so changes to the build should work butm this should avoid any odd linker issues.

> I'm also curious if you plan to create a patch that will get the preempted
> jobs to show up as PREEMPTED rather than FAILED?
This poses an interesting accounting question that we will need to discuss internally. Mainly, job steps can run and end before getting pre-empted that maybe should not get tagged as pre-empted. By tagging all steps as PREEMPTED, we are in effect potentially allowing those steps to run to completion and not get charged for at some sites. Most likely any change in this regard will be deferred to 21.08 release. For now, the current patch corrects the step incorrectly getting marked as FAILED.
Comment 53 Michael Robbert 2020-12-16 14:04:49 MST
So, I think that did what you expected. Here is the standard accounting output from a job I preempted after applying the patch:

5307146            wrap    compute   hpcgroup          1     FAILED     0:15
5307146.bat+      batch              hpcgroup          1  CANCELLED     0:15
5307146.ext+     extern              hpcgroup          1  COMPLETED      0:0
5307146.0         sleep              hpcgroup          1  CANCELLED     0:15

However I still don't understand why that is the behavior that anybody would want to see when a job is preempted. Can you explain why this is happening on this cluster with such a simple job? 
As a reference, on our other cluster this is not happening so what is the important difference. Our other cluster is still running 19.05 and is doing partition based preemption and it doesn't have the same prolog scripts and this is what we see for a job that had a single job step started with srun:

11568103     CuPW_N2O_+    compute                     8  PREEMPTED      0:0
11568103.ba+      batch                                8  CANCELLED     0:15
11568103.ex+     extern                                8  COMPLETED      0:0
11568103.0          g16                                8  CANCELLED      0:6

The user also sees the reason in their output: 
error: *** STEP 11568103.0 ON compute214 CANCELLED AT 2020-12-16T13:51:16 DUE TO PREEMPTION ***
error: *** JOB 11568103 ON compute214 CANCELLED AT 2020-12-16T13:51:16 DUE TO PREEMPTION ***

Mike
Comment 54 Michael Robbert 2021-01-04 09:41:27 MST
Can I get an answer to my most recent question about what is the difference between clusters that is causing this problem to happen on some, but not others. i.e. What might I change if I wanted to get preempted jobs to show a status of preempted rather than cancelled or failed?
Comment 55 Nate Rini 2021-01-04 09:57:04 MST
(In reply to Michael Robbert from comment #54)
> Can I get an answer to my most recent question about what is the difference
> between clusters that is causing this problem to happen on some, but not
> others. 
Sure. Holidays are now over for most of us.

> i.e. What might I change if I wanted to get preempted jobs to show a
> status of preempted rather than cancelled or failed?
There is a race condition that causes the PREEMPTED state to get updated with CANCELLED/COMPLETED when the job step reports to the controller. This causes it to be inconsistent and we are working on a patch for it. There are some complications to ensure that steps that already completed are not overridden with the wrong state.

(In reply to Michael Robbert from comment #53)
> As a reference, on our other cluster this is not happening so what is the
> important difference. Our other cluster is still running 19.05 and is doing
> partition based preemption and it doesn't have the same prolog scripts and
> this is what we see for a job that had a single job step started with srun:
What is the exact version of that cluster?
> slurmctld -V
Comment 56 Michael Robbert 2021-01-04 10:39:53 MST
Our other cluster that isn't seeing this problem is running:

slurm 19.05.8

Is this a newish bug that all sites are seeing, at least with certain versions of Slurm? This seems like a big enough problem that I haven't seen mentioned on the mailing list yet so I assumed that it was due to some strange local configuration. Also, it appeared to be difficult for you to replicate in house. What did you have have to do in order to replicate it?
Comment 57 Broderick Gardner 2021-01-04 13:57:51 MST
As far as I can tell, we have not seen this with any other sites. So far, we have only partially replicated it:
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2761               wrap    general        dev         20  PREEMPTED      0:0 
2761.batch        batch                   dev          2     FAILED     15:0 
2761.extern      extern                   dev         20  COMPLETED      0:0 
2761.0            sleep                   dev         10  CANCELLED     0:15

These are our steps:
Add a PrologSlurmctld that sleeps for some time (eg 25s)
Start a preemptible job and wait for it to start (eg 25s)
Use gdb to pause the batch step of the job
Start a new job that preempts the old one
Wait some time (5s is too short, 20s is enough)
Continue the batch step to let it be stopped by preemption

It's still unclear why your system is exhibiting behavior remotely similar to this. The resulting logs from this mostly match yours. The main effect of pausing the batch step like this is to cause it to miss the RPC message that tells it to cancel due to preemption. Since the job step (eg 2761.0) is cancelled as well, it returns a failure, causing the batch step to fail (the batch step is supposed to have been cancelled by this time). 
The PrologSlurmctld is required because it affects some job cancellation logic; that part is not fully clear yet.


Try removing the PrologSlurmctld, and see if it affects the sacct result at all. With the previous patch if you still have it running; please make a note of whether or not you are still running with that patch.

Our reproducer and partial fix still don't touch the final state of the whole job. I'm currently in the middle of a deep dive into the preemption/signalling mechanism to find the cause.

Thanks
Comment 58 Broderick Gardner 2021-01-04 14:06:08 MST
Also try temporarily removing the GraceTime from the preemptible QoS to see if that has any effect.
Comment 59 Michael Robbert 2021-01-06 11:00:05 MST
I just got a chance to test this again and disabling PrologSlurmctld did not fix the problem, then when I set GraceTime for the "full" QoS to 0 I got an srun job to show up as preempted.

sacctmgr modify qos where name=full set Gracetime=0

Is it possible that is the only setting that is causing this problem to happen? I don't recall why I set that in the first place so I'm good with leaving it set to 0 unless there is a good reason for it to be set higher.
Comment 61 Michael Robbert 2021-02-11 16:48:20 MST
Unless you're leaving this ticket open in order to fix the bug I'm fine with closing it. I believe that everything has been working fine after setting Gracetime=0. We're fine with that being the fix for now.

Mike
Comment 62 Michael Hinton 2021-02-12 10:38:16 MST
Ok, I'll go ahead and close this out, then. Sorry that we have not been able to reproduce this fully. Feel free to reopen as needed.

Thanks!
-Michael
Comment 63 Chad Vizino 2021-03-16 13:39:20 MDT
There was a related issue in bug 10953 when using partition_prio preemption type and preempt mode requeue. Here are a couple things from it that may be helpful to know, mainly for more background.

Step termination when GraceTime > 0 is understood and can be (and was) reproduced in separate tests for each preempt type qos and partition_prio. It is not a bug.

GraceTime handling behavior (notably for requeue preempt mode) changed starting in 20.02 for both qos and partition_prio preempt types. See bug 10953 comment 34 for a more thorough explanation and ways to work with it (partition_prio type is addressed but applies to qos as well). Short version: Setting GraceTime=0 is probably the easiest way to work with it but there are other ways available.