10953 – After upgrading to 20.11 preempted jobs are cancelled, instead of requeued.

Ticket 10953 - After upgrading to 20.11 preempted jobs are cancelled, instead of requeued.

Summary: After upgrading to 20.11 preempted jobs are cancelled, instead of requeued.

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.11.3
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Chad Vizino
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-02-25 11:22 MST by pbisbal
Modified:	2021-03-16 19:29 MDT (History)
CC List:	2 users (show)

See Also:	10196
Site:	Princeton Plasma Physics Laboratory (PPPL)
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	CentOS
Machine Name:	PPPL Cluster
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (9.50 KB, text/plain) 2021-02-25 14:16 MST, pbisbal	Details
slurmctld.log (1.32 KB, text/x-log) 2021-02-25 14:22 MST, pbisbal	Details
slurmd.log file from the node where the cancellation occured. (3.35 KB, text/x-log) 2021-03-01 09:33 MST, pbisbal	Details
slurmctld.log with SlurmCtldDebug=debug5 (28.43 MB, text/plain) 2021-03-02 15:03 MST, pbisbal	Details
slurmd.log file with SlurmdDebug=debug5 (30.48 MB, text/plain) 2021-03-02 15:07 MST, pbisbal	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description pbisbal 2021-02-25 11:22:21 MST

We recently upgraded from 19.05.8 to 20.11.3 In our slurm configuration we have a partition named "interruptible" that can be preempted by our main parition, "general". This configuration was working well for several years before the upgrade. After the upgrade, preempted jobs are cancelled and not requeued, when they should be requeued. 

I tested this by creating a job named "interrupt_me" in the interruptible queue, and then once it started running, submitting an MPI "hello, world" job. That should immediately preempt the interrupt_me job. Here's the output of 'squeue' right after I submitted the mpihello job: 

$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1076678   general mpihello  pbisbal PD       0:00      1 (Resources)
           1076677 interrupt interrup  pbisbal  R       0:30      1 ellis004

As you can see, both jobs are still in the queue, and the interruptible job is still running, as expected, since I just submitted the higher-priority job.

A second later, the mpihello job is the only job in the queue owned by me, and the interruptible job is completely gone from squeue output: 

$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1076678   general mpihello  pbisbal PD       0:00      1 (Priority)


Checking sacct output for the job shows that if was cancelled: 

$ sacct -j 1076677
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1076677      interrupt+ interrupt+       unix         16     FAILED     15:0 
1076677.bat+      batch                  unix         16     FAILED     15:0 
1076677.ext+     extern                  unix         16  COMPLETED      0:0 
1076677.0    interrupt+                  unix         16  CANCELLED     0:15 


And the error file shows it was cancelled, too: 

$ more interrupt_me-1076677.err
slurmstepd: error: *** STEP 1076677.0 ON ellis004 CANCELLED AT 2021-02-25T13:05:56 ***
srun: error: ellis004: tasks 0-15: Terminated
srun: Force Terminated StepId=1076677.0

I checked the documentation, for preemtible jobs (https://slurm.schedmd.com/preempt.html), and everything seems to be in order with my configuration. 

Here's the output of 'scontrol show config' in case I'm overlooking something: 

$ scontrol show config
Configuration data as of 2021-02-25T13:20:10
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = slurm.pppl.gov
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2021-02-16T06:13:34
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = test_pppl
CommunicationParameters = (null)
CompleteWait            = 302 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = TraceJobs
DefMemPerCPU            = 2000
DependencyParameters    = (null)
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = /usr/pppl/nhc/1.3.1/sbin/nhc
InactiveLimit           = 300 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = job_submit/lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 300 sec
LaunchParameters        = use_interactive_step
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 21068
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 1076680
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 15-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 30-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 10000
PriorityWeightPartition = 10000
PriorityWeightQOS       = 10000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /usr/bin/reboot
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU_MEMORY
SlurmUser               = slurm(123)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = slurm.pppl.gov(172.16.2.23)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6820-6830
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.11.3
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurmctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/tree
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec
VSizeFactor             = 0 percent
WaitTime                = 60 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /cgroup
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at slurm.pppl.gov is UP


Let me know if you need to see my slurm.conf file. I can attach it to this ticket if necessary. 

Thanks, 
Prentice

Comment 1 Chad Vizino 2021-02-25 14:02:36 MST

Hi. Can you supply your slurm.conf and the slurmctld.log covering your test runs?

Comment 2 pbisbal 2021-02-25 14:16:31 MST

Created attachment 18139 [details]
slurm.conf

slurm.conf file attached. Retrieving slurmctld.log now.

Comment 3 pbisbal 2021-02-25 14:22:27 MST

Created attachment 18140 [details]
slurmctld.log

Attached is the relevant section of my slurmctld.log file. It contains all log entries from when the first job starts until the second job ends. I figured you didn't want to scroll through thousands of log entries.

Comment 4 pbisbal 2021-02-25 14:45:24 MST

Here's some feedback from the user who pointed out this change in behavior. He used the interruptible partition regularly: 

submitted job: _sjob (the one with the _ in front of the name).
slurm output: _slurm-1075117.out (also with an _ )

At the end of the file slurm reports the job as CANCELLED:
slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 ***
In the past this was PREEMPTED for jobs that were interrupted.

It seems CANCELLED does not requeue the job while PREEMPTED does re-queueing.
The "#SBATCH --requeue" is present in _sjob

I checked the file "_sjob", and he does have --requeue as an option, even though that should be the default behavior for preempted jobs on our cluster: 

$ grep ^#SBATCH _sjob 
#SBATCH --job-name=WISCCX_3N16_0110_06
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE,TIME_LIMIT_90
#SBATCH --mail-user=gkramer@pppl.gov
#SBATCH --ntasks=256
#SBATCH --mem=1000M
#SBATCH --partition=interruptible
#SBATCH --time=2-00:00:00
#SBATCH --requeue

and here's the last 30 lines of output from his output/error file: 


$ tail -30 _slurm-1075117.out 
#P## send particle 126620 to processor 101
####### sending back from processor: 16
#P## send particle 126622 to processor 16
####### sending back from processor: 180
#G## init particle 126623 on processor 180
####### sending back from processor: 120
#G## init particle 126624 on processor 120
slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 ***
srun: error: greene015: tasks 208-210,212-223: Terminated
srun: error: greene014: tasks 192-201,203-207: Terminated
srun: error: greene009: tasks 112-117,119-127: Terminated
srun: error: greene017: tasks 240-255: Terminated
srun: error: greene012: tasks 160-175: Terminated
srun: error: greene005: tasks 48-63: Terminated
srun: error: greene010: tasks 128-143: Terminated
srun: error: greene008: tasks 96-111: Terminated
srun: error: greene007: tasks 80-95: Terminated
srun: error: greene006: tasks 64-79: Terminated
srun: error: greene003: tasks 16-31: Terminated
srun: error: greene016: tasks 224-239: Terminated
srun: error: greene002: tasks 0-15: Terminated
srun: error: greene011: tasks 144-159: Terminated
srun: error: greene004: tasks 32-47: Terminated
srun: error: greene013: tasks 176-191: Terminated
srun: error: greene015: task 211: Terminated
srun: error: greene014: task 202: Terminated
srun: error: greene009: task 118: Terminated
srun: Force Terminated StepId=1075117.0
Ended program at  : Thu Feb 25 16:07:48 EST 2021

Comment 6 Chad Vizino 2021-02-26 14:36:58 MST

We are having trouble reproducing your issue but am still trying on my system. If you can reproduce it, would you send the slurmctld.log and slurmd.log from the used node(s)?

Comment 7 Chad Vizino 2021-02-26 14:54:56 MST

Would you also send these?

>scontrol show part general
>scontrol show part interruptible

Comment 8 pbisbal 2021-03-01 09:32:16 MST

(In reply to Chad Vizino from comment #6)
> We are having trouble reproducing your issue but am still trying on my
> system. If you can reproduce it, would you send the slurmctld.log and
> slurmd.log from the used node(s)?

I have already included the slumctld.log for when this problem occurs. The job IDs are the ones shown in my original description of the problem (1076677 and 1076678) 

I'm attaching the relevant sections of the slurmd.log file from the hosts where this problem occured now. 

Prentice

Comment 9 pbisbal 2021-03-01 09:33:49 MST

Created attachment 18173 [details]
slurmd.log file from the node where the cancellation occured.

These are the relevant lines from the slurmd.log file on the node where the interruptible job was cancelled instead of preempted.

Comment 10 pbisbal 2021-03-01 09:35:14 MST

(In reply to Chad Vizino from comment #7)
> Would you also send these?
> 
> >scontrol show part general
> >scontrol show part interruptible

$ scontrol show part general
PartitionName=general
   AllowGroups=ALL AllowAccounts=ALL AllowQos=general,debug
   AllocNodes=ALL Default=YES QoS=general
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=300 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101-102,105,113,116-117,119-153,155-162,200-236],ellis[001-010],ganesh[21-22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036]
   PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=4592 TotalNodes=166 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED



$ scontrol show part interruptible
PartitionName=interruptible
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=interruptible
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=300 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101-102,105,113,116-117,119-153,155-162,200-236],ellis[001-010],ganesh[21-22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036],greene[001-020,037-048]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=5104 TotalNodes=198 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Comment 11 Chad Vizino 2021-03-02 11:47:03 MST

(In reply to pbisbal from comment #3)
> Created attachment 18140 [details]
> slurmctld.log
> 
> Attached is the relevant section of my slurmctld.log file. It contains all
> log entries from when the first job starts until the second job ends. I
> figured you didn't want to scroll through thousands of log entries.

Thanks but there is info missing about the preemption action that I'd like to see. Can you supply everything relating to those two jobs between the submission of the first and end of the second?

Comment 12 pbisbal 2021-03-02 14:28:57 MST

(In reply to Chad Vizino from comment #11)
> (In reply to pbisbal from comment #3)
> > Created attachment 18140 [details]
> > slurmctld.log
> > 
> > Attached is the relevant section of my slurmctld.log file. It contains all
> > log entries from when the first job starts until the second job ends. I
> > figured you didn't want to scroll through thousands of log entries.
> 
> Thanks but there is info missing about the preemption action that I'd like
> to see. Can you supply everything relating to those two jobs between the
> submission of the first and end of the second?

That *is* everything from the submission of the first job to the end of the second. I just double-checked what's in my slurmctld.log file vs. what I uploaded, and that's *every* log entry from the time the first job is submitted: 

[2021-02-25T13:05:00.720] _slurm_rpc_submit_batch_job: JobId=1076677 InitPrio=10211 usec=1104

Until the second job is done: 

[2021-02-25T13:07:28.293] _job_complete: JobId=1076678 done

I think what ever preemption information you're looking for is missing as a result of this issue. 

Let me increase the log level through SlurmCtlDebug and SlurmdDebug, and attach the results of that. 

Give me a few minutes...

Prentice

Comment 13 pbisbal 2021-03-02 15:03:36 MST

Created attachment 18201 [details]
slurmctld.log with SlurmCtldDebug=debug5

slurmctld.log with SlurmCtldDebug=debug5 while reproducing the error.

Comment 14 pbisbal 2021-03-02 15:07:04 MST

Created attachment 18202 [details]
slurmd.log file with SlurmdDebug=debug5

slurmd.log file with SlurmdDebug=debug5, from the node where the jobs ran while reproducing the problem.

Comment 15 Chad Vizino 2021-03-02 15:08:51 MST

(In reply to pbisbal from comment #14)
> Created attachment 18202 [details]
> slurmd.log file with SlurmdDebug=debug5
> 
> slurmd.log file with SlurmdDebug=debug5, from the node where the jobs ran
> while reproducing the problem.

Thanks for this new logging. I'll look it over. I do want to see what's in them but am also thinking a logging patch may be useful to help pinpoint where the controller is changing the job preempt mode to cancel. In theory would you be open to this if the logs don't help?

Comment 16 pbisbal 2021-03-02 15:12:11 MST

I just set SlurmCtldDebug=debug5 and SlurmdDebug=5 and then reproduced the problem. A transcript of my commands and their output is below, which shows the jobIDs and how the job 'interrupt_me' is stopped shortly after submitting 'mpihello':

$ sbatch interrupt_me.sbatch 
Submitted batch job 1081556

$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1081556 interrupt interrup  pbisbal  R       0:04      1 dawson083


$ sbatch mpihello.sbatch 
Submitted batch job 1081558

$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1081558   general mpihello  pbisbal PD       0:00      1 (Priority)
           1081556 interrupt interrup  pbisbal  R       0:18      1 dawson083

A couple seconds later... 

$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1081558   general mpihello  pbisbal PD       0:00      1 (Priority)


$ squeue -u pbisbal
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1081558   general mpihello  pbisbal  R       0:05      1 dawson083

Comment 17 pbisbal 2021-03-02 15:14:03 MST

I wouldn't be able to apply the patch until our next scheduled downtime, which is two weeks from today. I'd rather not apply a patch if I can avoid it, but if that's what's needed to get to the bottom of this, I'm willing to do it.

Comment 18 Chad Vizino 2021-03-02 18:36:04 MST

Good to know about your patch situation--we'll keep that on the back burner for now.

I've looked over the slurmctld log you sent me (job 1081556 is the preemptee and 1081558 is the preemptor) and am following up on this:

>[2021-03-02T16:34:44.961] debug:  setting 300 sec preemption grace time for JobId=1081556 to reclaim resources for JobId=1081558
This comes from _job_check_grace_internal() which contains this check:

>416         if (grace_time) {
>417                 debug("setting %u sec preemption grace time for %pJ to reclaim resources for %pJ",
>418                       grace_time, job_ptr, preemptor_ptr);
>419                 job_signal(job_ptr, SIGCONT, 0, 0, 0);
>420                 if (preempt_send_user_signal && job_ptr->warn_signal &&
>421                     !(job_ptr->warn_flags & WARN_SENT))
>422                         send_job_warn_signal(job_ptr, true);
>423                 else
>424                         job_signal(job_ptr, SIGTERM, 0, 0, 0);
>425         } else
>426                 rc = 1;
Note on line 424 a SIGTERM is sent. job_signal()'s prototype is this:

>5468 extern int job_signal(job_record_t *job_ptr, uint16_t signal,
>5469                       uint16_t flags, uid_t uid, bool preempt)
So preempt is 0. Then we can see in job_signal() that:

>5596         if (preempt)
>5597                 job_term_state = JOB_PREEMPTED;
>5598         else
>5599                 job_term_state = JOB_CANCELLED;
Which means that the state is JOB_CANCELLED but I expected that we want JOB_PREEMPTED.

I looked back at the 19.05.8 source and there's similar logic (but the code has been refactored since then) so this could be a red herring but want to pursue it some more.

I also just ran a test on a 20.11.4 system but tried setting Gracetime=300 on the preemptible partition (like you have) this time--I am still not seeing what you are so want to try a few more parameters changes on my tests.

Just out of curiosity, would you send the output from this?
>scontrol show qos -p

Comment 19 pbisbal 2021-03-03 12:23:20 MST

I think you meant 'sacctmgr show qos -p'. Here you go: 

$ sacctmgr show qos -p 
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
normal|0|00:00:00|||cluster|||1.000000||||||||||||||||||
dawson|100|00:00:00|||cluster|||1.000000|||||||cpu=1024||||cpu=1024|30|||||cpu=16|
ellis|100|00:00:00|||cluster|||1.000000|||||||cpu=15,node=1||||cpu=80|45||||||
kruskal|100|00:00:00|||cluster|||1.000000|||||||cpu=512||||cpu=512|8|||||cpu=16|
mque|100|00:00:00|||cluster|||1.000000|||||||cpu=128,node=4|||||20||||||
default|0|00:00:00|||cluster|||1.000000||||||||||||40||||||
mccune|100|00:00:00|||cluster|||1.000000|||||||cpu=512||||cpu=512|||||||
sque|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=100|||||||||||
fenx|100|00:00:00|||cluster|||1.000000|||||||cpu=40,node=16||||cpu=40,node=16|||||||
fielder|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=96||||cpu=96,node=12|||||||
gque|100|00:00:00|||cluster|||1.000000|||||||cpu=32,node=1|||||||||||
jassby|100|00:00:00|||cluster|||1.000000|||||||cpu=96,node=6||||cpu=96,node=6|||||||
greene|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=32||||cpu=512,node=32|||||||
pswift|100|00:00:00|||cluster|||1.000000||||||||||||||||||
interactive|100|00:00:00|||cluster|||1.000000||||||||||12:00:00|cpu=1024|1||||||
general|100|00:00:00|||cluster|||1.000000||||||||||2-00:00:00|cpu=1024|||||||
interruptible|1|00:00:00|||cluster|||1.000000||||||||||2-00:00:00|cpu=1536|||||||
debug|10000|00:00:00|||cluster|||1.000000||||||||||00:30:00|cpu=128|1|1|||||

Comment 20 Chad Vizino 2021-03-03 12:55:50 MST

(In reply to pbisbal from comment #19)
> I think you meant 'sacctmgr show qos -p'. 
Oops--yes. Thanks.

I have a patch to show the preempt mode for a job when it's referenced but am trying to see if there's another way for you to see this w/o the patch. I'm also looking at the decision points in the code where the controller is going to do the preempt for some clues.

Comment 21 Chad Vizino 2021-03-04 13:30:41 MST

After looking through the Slurm code, there is a SIGTERM sent at the beginning of of the grace time period and I am wondering if your preemptible job is terminating when this is sent.

Some questions:

What were you running in the preemptible job script?
You mentioned the preemptor job was an MPI one.

Could you try this and send me the controller log afterwards?

First submit a job that will be preempted.

>sbatch -p interruptible -t 5 --wrap="sleep 600"
Wait for this to start running and then run this.

>sbatch -p general -t 5 --wrap="sleep 600"
You can change the node count for either job if needed to help trigger the preemption. I'm assuming that in the past when you've run your test, that all the nodes from the partition general are in use.

Let's start with this and the outcome will help direct me in a next step for you.

Comment 22 pbisbal 2021-03-04 14:34:12 MST

Chad, 

What I was running in the preemptible job script was a MPI "hello,world" program that sleeps for 5 minutes, prints hello statelements, and then sleeps for 5 more minutes. Here's the code to the program: 

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include "mpi.h"

#define SLEEP_TIME 300

int main(int argc, char* argv[]){
    int my_rank;
    int p;
    int source;
    int dest;
    int tag=0;
    char message[256];
    char my_name[32];
    MPI_Status status;

    /* Start up MPI */
    MPI_Init(&argc, &argv);
 
    /* Find out process rank */
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
 
    /* Find out number of processes */
    MPI_Comm_size(MPI_COMM_WORLD, &p);

    /* What's my hostname? */
    gethostname(my_name, 32); 
    sleep(SLEEP_TIME);

    if (my_rank == 0) {
        printf("MPIHello running on %i processors.\n", p);
        printf("Greetings from processor %i, on host %s.\n", my_rank, my_name);
        for (source=1; source<p; source++) {
            MPI_Recv(message, 256, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);
            printf("%s", message);
        } 
    } else if (my_rank != 0) {
        sprintf(message, "Greetings from processor %i, on host %s.\n", my_rank, my_name);
        dest=0;
        MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); 
    }
    sleep(SLEEP_TIME);
    MPI_Finalize();
}


And then this is the script I submit it with: 

#!/bin/bash

#SBATCH --mem=32G
#SBATCH -n 32
#SBATCH -w dawson083
#SBATCH -p interruptible
#SBATCH -t 00:15:00
#SBATCH -J interrupt_me
#SBATCH -o interrupt_me-%j.out
#SBATCH -e interrupt_me-%j.err
#SBATCH --mail-type=ALL

module load gcc/9.3.0
module load openmpi/4.0.3
srun --mpi=pmi2 ./interrupt_me


The program that preempts this one, is the same source code, but the sleep amount is defined as 30 seconds instead of 300: 

#define SLEEP_TIME 30

And this is the batch script that submits that job: 

#!/bin/bash

#SBATCH --mem=32G
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -w dawson083
#SBATCH -p general
#SBATCH -t 00:10:00
#SBATCH -J mpihello
#SBATCH -o mpihello-%j.out
#SBATCH -e mpihello-%j.err
#SBATCH --mail-type=ALL
#SBATCH --export=ALL


module load gcc
module load openmpi
module load pmix

srun ./mpihello

As you can see, I specify the same node in each case and use all the cores on it so to force a situation that will trigger preemption. 

When I first set up preemption and tested it, I used these exact same programs in the same exact way, and preemption worked as desired: the interrupt_me job was preempted and rescheduled.

Comment 23 pbisbal 2021-03-04 14:36:32 MST

Chad, 

The general partition is completely full at the moment, so I can't run your experiment. There are a couple other "primary" partitions that the interruptible partition spans, but I don't want to muddy the waters right now by changing additional variables. I'll update the ticket when there are some free nodes in the general partition I can test with. 

Prentice

Comment 24 Chad Vizino 2021-03-04 16:34:48 MST

(In reply to pbisbal from comment #23)
> The general partition is completely full at the moment, so I can't run your
> experiment. There are a couple other "primary" partitions that the
> interruptible partition spans, but I don't want to muddy the waters right
> now by changing additional variables. I'll update the ticket when there are
> some free nodes in the general partition I can test with. 
No problem. Here's some new info that may help. The key here is that SIGTERM is sent at preemption time (see comment 21) and is sent to the job steps. When a job step gets a SIGTERM, it terminates *unless* SIGTERM is set to be ignored.

So in your MPI hello world code you can try this and ignore SIGTERM by adding "#include <signal.h>" in the header section and then "signal(SIGTERM, SIG_IGN);" after the declarations in main(). Recompile and rerun your test--you should see things working as expected (preempted job gets requeued).

The reason I couldn't reproduce the issue is that I was not running a job step in the preemptible job. When I submit this way:

>sbatch -p <my-interruptible-partition> -t 5 --wrap="srun sleep 600"
The job dies and is not requeued (reproduces your issue). If I do this (as per the test I suggested in comment 21):

>sbatch -p <my-interruptible-partition> -t 5 --wrap="sleep 600"
The job is requeued after the GraceTime because it doesn't have a job step and the SIGTERM is not sent to the batch script. This is what I was trying on my end and why I couldn't reproduce your problem initially.

Another way around this is to let the user pick the signal sent at preempt time (rather than SIGTERM) via --signal (see the sbatch man page) when submitting and setting up Slurm with SlurmctldParameters=preempt_user_signal in slurm.conf. I played around with using SIGURG so the step wouldn't get killed without changing the code run in the step. I'm not necessarily suggesting this, just mentioning the outcome of an experiment.

To explain the change in behavior from 19.05.8 to 20.11.3, I need to look into it some more and get back to you.

Comment 25 Chad Vizino 2021-03-05 10:07:25 MST

(In reply to Chad Vizino from comment #24)
> To explain the change in behavior from 19.05.8 to 20.11.3, I need to look
> into it some more and get back to you.
After some thought, this seems more like a bug and will pursue it that way.

Comment 34 Chad Vizino 2021-03-12 16:15:39 MST

(In reply to Chad Vizino from comment #25)
> (In reply to Chad Vizino from comment #24)
> > To explain the change in behavior from 19.05.8 to 20.11.3, I need to look
> > into it some more and get back to you.
> After some thought, this seems more like a bug and will pursue it that way.
I'm changing course on this one after looking at the code and running some tests with 19.05.8 to understand how things used to work. The short version is that there is no bug. If you are going to have GraceTime set on the preemptible partition, then jobs running in it will need to trap SIGTERM in the step (see comment 24). If users do not trap, then steps will die and jobs will not be requeued (as you have observed). If you want to avoid this without requiring users to change their jobs, then you will need to remove GraceTime from the partition (or set it to 0).

Backing up a bit to explain more: In 19.05 GraceTime was only meaningful for preempt mode CANCEL and not REQUEUE (slurm.conf(1) stated this about Gracetime: "Meaningful only for PreemptMode=CANCEL."). So if GraceTime had been set on the partition in 19.05 it would not have been used for requeue preemption. That would explain why things worked with requeue before your upgrade even if you had GraceTime specified on the partition.

Starting in 20.02, GraceTime being limited to CANCEL mode only was lifted and applies to REQUEUE mode as well. The 20.11 doc for slurm.conf says this about GraceTime set on a partition (very similar to what the 19.05 version said but minus the comment about preempt mode cancel):

--
Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. This second set of signals is sent to both the tasks and the containing batch script, if applicable.
--

So if you just kept your partition definition from 19.05 with the GraceTime (> 0) setting, then this is why your users are now seeing their jobs getting cancelled (step dies with the SIGTERM sent) and not requeued.

Again, if you want to avoid this, the simplest thing would be to just remove GraceTime from the partition.

Hopefully, this makes sense--let me know if you have any questions.

Comment 35 Chad Vizino 2021-03-12 16:23:32 MST

(In reply to Chad Vizino from comment #34)
> Again, if you want to avoid this, the simplest thing would be to just remove
> GraceTime from the partition.
I should have mentioned that you have this in your slurm.conf:

>PartitionName=default GraceTime=300
So that GraceTime is being applied to your "interruptible" partition. Try setting GraceTime=0 on "interruptible" and see if that fixes things for you.

Comment 36 pbisbal 2021-03-16 12:53:23 MDT

Chad, 

Thanks for the explanation. That seems to have fixed my problem. After setting GraceTime=0:

$ grep interruptible slurm.conf
PartitionName=interruptible Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101,105,113,116-117,119-153,155,156-162,200-236],ellis[001-010],ganesh[22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036],greene[001-020,037-048] MaxTime=48:00:00 State=Up PriorityTier=1 QOS=interruptible GraceTime=0

There were several jobs running in the interruptible partition, so I submitted as job that would compete for the same resources. After submitting my job, one of the other jobs interrupted. And mine started to run. The interrupted job is now back in the queue waiting to run again. 

Prentice

Comment 37 Chad Vizino 2021-03-16 13:04:38 MDT

Great--glad things are working for you.

Closing ticket.

Comment 38 pbisbal 2021-03-16 19:29:38 MDT

Thank you for your help.