We recently upgraded from 19.05.8 to 20.11.3 In our slurm configuration we have a partition named "interruptible" that can be preempted by our main parition, "general". This configuration was working well for several years before the upgrade. After the upgrade, preempted jobs are cancelled and not requeued, when they should be requeued. I tested this by creating a job named "interrupt_me" in the interruptible queue, and then once it started running, submitting an MPI "hello, world" job. That should immediately preempt the interrupt_me job. Here's the output of 'squeue' right after I submitted the mpihello job: $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1076678 general mpihello pbisbal PD 0:00 1 (Resources) 1076677 interrupt interrup pbisbal R 0:30 1 ellis004 As you can see, both jobs are still in the queue, and the interruptible job is still running, as expected, since I just submitted the higher-priority job. A second later, the mpihello job is the only job in the queue owned by me, and the interruptible job is completely gone from squeue output: $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1076678 general mpihello pbisbal PD 0:00 1 (Priority) Checking sacct output for the job shows that if was cancelled: $ sacct -j 1076677 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1076677 interrupt+ interrupt+ unix 16 FAILED 15:0 1076677.bat+ batch unix 16 FAILED 15:0 1076677.ext+ extern unix 16 COMPLETED 0:0 1076677.0 interrupt+ unix 16 CANCELLED 0:15 And the error file shows it was cancelled, too: $ more interrupt_me-1076677.err slurmstepd: error: *** STEP 1076677.0 ON ellis004 CANCELLED AT 2021-02-25T13:05:56 *** srun: error: ellis004: tasks 0-15: Terminated srun: Force Terminated StepId=1076677.0 I checked the documentation, for preemtible jobs (https://slurm.schedmd.com/preempt.html), and everything seems to be in order with my configuration. Here's the output of 'scontrol show config' in case I'm overlooking something: $ scontrol show config Configuration data as of 2021-02-25T13:20:10 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = slurm.pppl.gov AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2021-02-16T06:13:34 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = test_pppl CommunicationParameters = (null) CompleteWait = 302 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CredType = cred/munge DebugFlags = TraceJobs DefMemPerCPU = 2000 DependencyParameters = (null) DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /usr/pppl/nhc/1.3.1/sbin/nhc InactiveLimit = 300 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = job_submit/lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 300 sec LaunchParameters = use_interactive_step LaunchType = launch/slurm Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxDBDMsgs = 21068 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 1076680 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/partition_prio PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 15-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 30-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1000 PriorityWeightAssoc = 0 PriorityWeightFairShare = 10000 PriorityWeightJobSize = 10000 PriorityWeightPartition = 10000 PriorityWeightQOS = 10000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain,X11 PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /usr/bin/reboot ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SbcastParameters = (null) SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY SlurmUser = slurm(123) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = slurm.pppl.gov(172.16.2.23) SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6820-6830 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 300 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 20.11.3 SrunEpilog = (null) SrunPortRange = 60001-63000 SrunProlog = (null) StateSaveLocation = /var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/tree TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec VSizeFactor = 0 percent WaitTime = 60 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /cgroup ConstrainCores = yes ConstrainDevices = yes ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at slurm.pppl.gov is UP Let me know if you need to see my slurm.conf file. I can attach it to this ticket if necessary. Thanks, Prentice
Hi. Can you supply your slurm.conf and the slurmctld.log covering your test runs?
Created attachment 18139 [details] slurm.conf slurm.conf file attached. Retrieving slurmctld.log now.
Created attachment 18140 [details] slurmctld.log Attached is the relevant section of my slurmctld.log file. It contains all log entries from when the first job starts until the second job ends. I figured you didn't want to scroll through thousands of log entries.
Here's some feedback from the user who pointed out this change in behavior. He used the interruptible partition regularly: submitted job: _sjob (the one with the _ in front of the name). slurm output: _slurm-1075117.out (also with an _ ) At the end of the file slurm reports the job as CANCELLED: slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 *** In the past this was PREEMPTED for jobs that were interrupted. It seems CANCELLED does not requeue the job while PREEMPTED does re-queueing. The "#SBATCH --requeue" is present in _sjob I checked the file "_sjob", and he does have --requeue as an option, even though that should be the default behavior for preempted jobs on our cluster: $ grep ^#SBATCH _sjob #SBATCH --job-name=WISCCX_3N16_0110_06 #SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE,TIME_LIMIT_90 #SBATCH --mail-user=gkramer@pppl.gov #SBATCH --ntasks=256 #SBATCH --mem=1000M #SBATCH --partition=interruptible #SBATCH --time=2-00:00:00 #SBATCH --requeue and here's the last 30 lines of output from his output/error file: $ tail -30 _slurm-1075117.out #P## send particle 126620 to processor 101 ####### sending back from processor: 16 #P## send particle 126622 to processor 16 ####### sending back from processor: 180 #G## init particle 126623 on processor 180 ####### sending back from processor: 120 #G## init particle 126624 on processor 120 slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 *** srun: error: greene015: tasks 208-210,212-223: Terminated srun: error: greene014: tasks 192-201,203-207: Terminated srun: error: greene009: tasks 112-117,119-127: Terminated srun: error: greene017: tasks 240-255: Terminated srun: error: greene012: tasks 160-175: Terminated srun: error: greene005: tasks 48-63: Terminated srun: error: greene010: tasks 128-143: Terminated srun: error: greene008: tasks 96-111: Terminated srun: error: greene007: tasks 80-95: Terminated srun: error: greene006: tasks 64-79: Terminated srun: error: greene003: tasks 16-31: Terminated srun: error: greene016: tasks 224-239: Terminated srun: error: greene002: tasks 0-15: Terminated srun: error: greene011: tasks 144-159: Terminated srun: error: greene004: tasks 32-47: Terminated srun: error: greene013: tasks 176-191: Terminated srun: error: greene015: task 211: Terminated srun: error: greene014: task 202: Terminated srun: error: greene009: task 118: Terminated srun: Force Terminated StepId=1075117.0 Ended program at : Thu Feb 25 16:07:48 EST 2021
We are having trouble reproducing your issue but am still trying on my system. If you can reproduce it, would you send the slurmctld.log and slurmd.log from the used node(s)?
Would you also send these? >scontrol show part general >scontrol show part interruptible
(In reply to Chad Vizino from comment #6) > We are having trouble reproducing your issue but am still trying on my > system. If you can reproduce it, would you send the slurmctld.log and > slurmd.log from the used node(s)? I have already included the slumctld.log for when this problem occurs. The job IDs are the ones shown in my original description of the problem (1076677 and 1076678) I'm attaching the relevant sections of the slurmd.log file from the hosts where this problem occured now. Prentice
Created attachment 18173 [details] slurmd.log file from the node where the cancellation occured. These are the relevant lines from the slurmd.log file on the node where the interruptible job was cancelled instead of preempted.
(In reply to Chad Vizino from comment #7) > Would you also send these? > > >scontrol show part general > >scontrol show part interruptible $ scontrol show part general PartitionName=general AllowGroups=ALL AllowAccounts=ALL AllowQos=general,debug AllocNodes=ALL Default=YES QoS=general DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=300 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101-102,105,113,116-117,119-153,155-162,200-236],ellis[001-010],ganesh[21-22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036] PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=4592 TotalNodes=166 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED $ scontrol show part interruptible PartitionName=interruptible AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=interruptible DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=300 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101-102,105,113,116-117,119-153,155-162,200-236],ellis[001-010],ganesh[21-22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036],greene[001-020,037-048] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=5104 TotalNodes=198 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
(In reply to pbisbal from comment #3) > Created attachment 18140 [details] > slurmctld.log > > Attached is the relevant section of my slurmctld.log file. It contains all > log entries from when the first job starts until the second job ends. I > figured you didn't want to scroll through thousands of log entries. Thanks but there is info missing about the preemption action that I'd like to see. Can you supply everything relating to those two jobs between the submission of the first and end of the second?
(In reply to Chad Vizino from comment #11) > (In reply to pbisbal from comment #3) > > Created attachment 18140 [details] > > slurmctld.log > > > > Attached is the relevant section of my slurmctld.log file. It contains all > > log entries from when the first job starts until the second job ends. I > > figured you didn't want to scroll through thousands of log entries. > > Thanks but there is info missing about the preemption action that I'd like > to see. Can you supply everything relating to those two jobs between the > submission of the first and end of the second? That *is* everything from the submission of the first job to the end of the second. I just double-checked what's in my slurmctld.log file vs. what I uploaded, and that's *every* log entry from the time the first job is submitted: [2021-02-25T13:05:00.720] _slurm_rpc_submit_batch_job: JobId=1076677 InitPrio=10211 usec=1104 Until the second job is done: [2021-02-25T13:07:28.293] _job_complete: JobId=1076678 done I think what ever preemption information you're looking for is missing as a result of this issue. Let me increase the log level through SlurmCtlDebug and SlurmdDebug, and attach the results of that. Give me a few minutes... Prentice
Created attachment 18201 [details] slurmctld.log with SlurmCtldDebug=debug5 slurmctld.log with SlurmCtldDebug=debug5 while reproducing the error.
Created attachment 18202 [details] slurmd.log file with SlurmdDebug=debug5 slurmd.log file with SlurmdDebug=debug5, from the node where the jobs ran while reproducing the problem.
(In reply to pbisbal from comment #14) > Created attachment 18202 [details] > slurmd.log file with SlurmdDebug=debug5 > > slurmd.log file with SlurmdDebug=debug5, from the node where the jobs ran > while reproducing the problem. Thanks for this new logging. I'll look it over. I do want to see what's in them but am also thinking a logging patch may be useful to help pinpoint where the controller is changing the job preempt mode to cancel. In theory would you be open to this if the logs don't help?
I just set SlurmCtldDebug=debug5 and SlurmdDebug=5 and then reproduced the problem. A transcript of my commands and their output is below, which shows the jobIDs and how the job 'interrupt_me' is stopped shortly after submitting 'mpihello': $ sbatch interrupt_me.sbatch Submitted batch job 1081556 $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1081556 interrupt interrup pbisbal R 0:04 1 dawson083 $ sbatch mpihello.sbatch Submitted batch job 1081558 $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1081558 general mpihello pbisbal PD 0:00 1 (Priority) 1081556 interrupt interrup pbisbal R 0:18 1 dawson083 A couple seconds later... $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1081558 general mpihello pbisbal PD 0:00 1 (Priority) $ squeue -u pbisbal JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1081558 general mpihello pbisbal R 0:05 1 dawson083
I wouldn't be able to apply the patch until our next scheduled downtime, which is two weeks from today. I'd rather not apply a patch if I can avoid it, but if that's what's needed to get to the bottom of this, I'm willing to do it.
Good to know about your patch situation--we'll keep that on the back burner for now. I've looked over the slurmctld log you sent me (job 1081556 is the preemptee and 1081558 is the preemptor) and am following up on this: >[2021-03-02T16:34:44.961] debug: setting 300 sec preemption grace time for JobId=1081556 to reclaim resources for JobId=1081558 This comes from _job_check_grace_internal() which contains this check: >416 if (grace_time) { >417 debug("setting %u sec preemption grace time for %pJ to reclaim resources for %pJ", >418 grace_time, job_ptr, preemptor_ptr); >419 job_signal(job_ptr, SIGCONT, 0, 0, 0); >420 if (preempt_send_user_signal && job_ptr->warn_signal && >421 !(job_ptr->warn_flags & WARN_SENT)) >422 send_job_warn_signal(job_ptr, true); >423 else >424 job_signal(job_ptr, SIGTERM, 0, 0, 0); >425 } else >426 rc = 1; Note on line 424 a SIGTERM is sent. job_signal()'s prototype is this: >5468 extern int job_signal(job_record_t *job_ptr, uint16_t signal, >5469 uint16_t flags, uid_t uid, bool preempt) So preempt is 0. Then we can see in job_signal() that: >5596 if (preempt) >5597 job_term_state = JOB_PREEMPTED; >5598 else >5599 job_term_state = JOB_CANCELLED; Which means that the state is JOB_CANCELLED but I expected that we want JOB_PREEMPTED. I looked back at the 19.05.8 source and there's similar logic (but the code has been refactored since then) so this could be a red herring but want to pursue it some more. I also just ran a test on a 20.11.4 system but tried setting Gracetime=300 on the preemptible partition (like you have) this time--I am still not seeing what you are so want to try a few more parameters changes on my tests. Just out of curiosity, would you send the output from this? >scontrol show qos -p
I think you meant 'sacctmgr show qos -p'. Here you go: $ sacctmgr show qos -p Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES| normal|0|00:00:00|||cluster|||1.000000|||||||||||||||||| dawson|100|00:00:00|||cluster|||1.000000|||||||cpu=1024||||cpu=1024|30|||||cpu=16| ellis|100|00:00:00|||cluster|||1.000000|||||||cpu=15,node=1||||cpu=80|45|||||| kruskal|100|00:00:00|||cluster|||1.000000|||||||cpu=512||||cpu=512|8|||||cpu=16| mque|100|00:00:00|||cluster|||1.000000|||||||cpu=128,node=4|||||20|||||| default|0|00:00:00|||cluster|||1.000000||||||||||||40|||||| mccune|100|00:00:00|||cluster|||1.000000|||||||cpu=512||||cpu=512||||||| sque|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=100||||||||||| fenx|100|00:00:00|||cluster|||1.000000|||||||cpu=40,node=16||||cpu=40,node=16||||||| fielder|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=96||||cpu=96,node=12||||||| gque|100|00:00:00|||cluster|||1.000000|||||||cpu=32,node=1||||||||||| jassby|100|00:00:00|||cluster|||1.000000|||||||cpu=96,node=6||||cpu=96,node=6||||||| greene|100|00:00:00|||cluster|||1.000000|||||||cpu=512,node=32||||cpu=512,node=32||||||| pswift|100|00:00:00|||cluster|||1.000000|||||||||||||||||| interactive|100|00:00:00|||cluster|||1.000000||||||||||12:00:00|cpu=1024|1|||||| general|100|00:00:00|||cluster|||1.000000||||||||||2-00:00:00|cpu=1024||||||| interruptible|1|00:00:00|||cluster|||1.000000||||||||||2-00:00:00|cpu=1536||||||| debug|10000|00:00:00|||cluster|||1.000000||||||||||00:30:00|cpu=128|1|1|||||
(In reply to pbisbal from comment #19) > I think you meant 'sacctmgr show qos -p'. Oops--yes. Thanks. I have a patch to show the preempt mode for a job when it's referenced but am trying to see if there's another way for you to see this w/o the patch. I'm also looking at the decision points in the code where the controller is going to do the preempt for some clues.
After looking through the Slurm code, there is a SIGTERM sent at the beginning of of the grace time period and I am wondering if your preemptible job is terminating when this is sent. Some questions: What were you running in the preemptible job script? You mentioned the preemptor job was an MPI one. Could you try this and send me the controller log afterwards? First submit a job that will be preempted. >sbatch -p interruptible -t 5 --wrap="sleep 600" Wait for this to start running and then run this. >sbatch -p general -t 5 --wrap="sleep 600" You can change the node count for either job if needed to help trigger the preemption. I'm assuming that in the past when you've run your test, that all the nodes from the partition general are in use. Let's start with this and the outcome will help direct me in a next step for you.
Chad, What I was running in the preemptible job script was a MPI "hello,world" program that sleeps for 5 minutes, prints hello statelements, and then sleeps for 5 more minutes. Here's the code to the program: #include <stdio.h> #include <string.h> #include <unistd.h> #include "mpi.h" #define SLEEP_TIME 300 int main(int argc, char* argv[]){ int my_rank; int p; int source; int dest; int tag=0; char message[256]; char my_name[32]; MPI_Status status; /* Start up MPI */ MPI_Init(&argc, &argv); /* Find out process rank */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Find out number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &p); /* What's my hostname? */ gethostname(my_name, 32); sleep(SLEEP_TIME); if (my_rank == 0) { printf("MPIHello running on %i processors.\n", p); printf("Greetings from processor %i, on host %s.\n", my_rank, my_name); for (source=1; source<p; source++) { MPI_Recv(message, 256, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s", message); } } else if (my_rank != 0) { sprintf(message, "Greetings from processor %i, on host %s.\n", my_rank, my_name); dest=0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } sleep(SLEEP_TIME); MPI_Finalize(); } And then this is the script I submit it with: #!/bin/bash #SBATCH --mem=32G #SBATCH -n 32 #SBATCH -w dawson083 #SBATCH -p interruptible #SBATCH -t 00:15:00 #SBATCH -J interrupt_me #SBATCH -o interrupt_me-%j.out #SBATCH -e interrupt_me-%j.err #SBATCH --mail-type=ALL module load gcc/9.3.0 module load openmpi/4.0.3 srun --mpi=pmi2 ./interrupt_me The program that preempts this one, is the same source code, but the sleep amount is defined as 30 seconds instead of 300: #define SLEEP_TIME 30 And this is the batch script that submits that job: #!/bin/bash #SBATCH --mem=32G #SBATCH -N 1 #SBATCH -n 32 #SBATCH -w dawson083 #SBATCH -p general #SBATCH -t 00:10:00 #SBATCH -J mpihello #SBATCH -o mpihello-%j.out #SBATCH -e mpihello-%j.err #SBATCH --mail-type=ALL #SBATCH --export=ALL module load gcc module load openmpi module load pmix srun ./mpihello As you can see, I specify the same node in each case and use all the cores on it so to force a situation that will trigger preemption. When I first set up preemption and tested it, I used these exact same programs in the same exact way, and preemption worked as desired: the interrupt_me job was preempted and rescheduled.
Chad, The general partition is completely full at the moment, so I can't run your experiment. There are a couple other "primary" partitions that the interruptible partition spans, but I don't want to muddy the waters right now by changing additional variables. I'll update the ticket when there are some free nodes in the general partition I can test with. Prentice
(In reply to pbisbal from comment #23) > The general partition is completely full at the moment, so I can't run your > experiment. There are a couple other "primary" partitions that the > interruptible partition spans, but I don't want to muddy the waters right > now by changing additional variables. I'll update the ticket when there are > some free nodes in the general partition I can test with. No problem. Here's some new info that may help. The key here is that SIGTERM is sent at preemption time (see comment 21) and is sent to the job steps. When a job step gets a SIGTERM, it terminates *unless* SIGTERM is set to be ignored. So in your MPI hello world code you can try this and ignore SIGTERM by adding "#include <signal.h>" in the header section and then "signal(SIGTERM, SIG_IGN);" after the declarations in main(). Recompile and rerun your test--you should see things working as expected (preempted job gets requeued). The reason I couldn't reproduce the issue is that I was not running a job step in the preemptible job. When I submit this way: >sbatch -p <my-interruptible-partition> -t 5 --wrap="srun sleep 600" The job dies and is not requeued (reproduces your issue). If I do this (as per the test I suggested in comment 21): >sbatch -p <my-interruptible-partition> -t 5 --wrap="sleep 600" The job is requeued after the GraceTime because it doesn't have a job step and the SIGTERM is not sent to the batch script. This is what I was trying on my end and why I couldn't reproduce your problem initially. Another way around this is to let the user pick the signal sent at preempt time (rather than SIGTERM) via --signal (see the sbatch man page) when submitting and setting up Slurm with SlurmctldParameters=preempt_user_signal in slurm.conf. I played around with using SIGURG so the step wouldn't get killed without changing the code run in the step. I'm not necessarily suggesting this, just mentioning the outcome of an experiment. To explain the change in behavior from 19.05.8 to 20.11.3, I need to look into it some more and get back to you.
(In reply to Chad Vizino from comment #24) > To explain the change in behavior from 19.05.8 to 20.11.3, I need to look > into it some more and get back to you. After some thought, this seems more like a bug and will pursue it that way.
(In reply to Chad Vizino from comment #25) > (In reply to Chad Vizino from comment #24) > > To explain the change in behavior from 19.05.8 to 20.11.3, I need to look > > into it some more and get back to you. > After some thought, this seems more like a bug and will pursue it that way. I'm changing course on this one after looking at the code and running some tests with 19.05.8 to understand how things used to work. The short version is that there is no bug. If you are going to have GraceTime set on the preemptible partition, then jobs running in it will need to trap SIGTERM in the step (see comment 24). If users do not trap, then steps will die and jobs will not be requeued (as you have observed). If you want to avoid this without requiring users to change their jobs, then you will need to remove GraceTime from the partition (or set it to 0). Backing up a bit to explain more: In 19.05 GraceTime was only meaningful for preempt mode CANCEL and not REQUEUE (slurm.conf(1) stated this about Gracetime: "Meaningful only for PreemptMode=CANCEL."). So if GraceTime had been set on the partition in 19.05 it would not have been used for requeue preemption. That would explain why things worked with requeue before your upgrade even if you had GraceTime specified on the partition. Starting in 20.02, GraceTime being limited to CANCEL mode only was lifted and applies to REQUEUE mode as well. The 20.11 doc for slurm.conf says this about GraceTime set on a partition (very similar to what the 19.05 version said but minus the comment about preempt mode cancel): -- Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. This second set of signals is sent to both the tasks and the containing batch script, if applicable. -- So if you just kept your partition definition from 19.05 with the GraceTime (> 0) setting, then this is why your users are now seeing their jobs getting cancelled (step dies with the SIGTERM sent) and not requeued. Again, if you want to avoid this, the simplest thing would be to just remove GraceTime from the partition. Hopefully, this makes sense--let me know if you have any questions.
(In reply to Chad Vizino from comment #34) > Again, if you want to avoid this, the simplest thing would be to just remove > GraceTime from the partition. I should have mentioned that you have this in your slurm.conf: >PartitionName=default GraceTime=300 So that GraceTime is being applied to your "interruptible" partition. Try setting GraceTime=0 on "interruptible" and see if that fixes things for you.
Chad, Thanks for the explanation. That seems to have fixed my problem. After setting GraceTime=0: $ grep interruptible slurm.conf PartitionName=interruptible Nodes=dawson[027-030,032-036,038-040,042-044,046-052,054-062,064,083,085,088-090,092,097-099,101,105,113,116-117,119-153,155,156-162,200-236],ellis[001-010],ganesh[22,24-27],gpusrv[01-02],jassby[001-006],kruskal[001-003,006-007,010,012-013,015,022,024,033-036],greene[001-020,037-048] MaxTime=48:00:00 State=Up PriorityTier=1 QOS=interruptible GraceTime=0 There were several jobs running in the interruptible partition, so I submitted as job that would compete for the same resources. After submitting my job, one of the other jobs interrupted. And mine started to run. The interrupted job is now back in the queue waiting to run again. Prentice
Great--glad things are working for you. Closing ticket.
Thank you for your help.