Bug 12350 - Magnetic reservation
Summary: Magnetic reservation
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 20.02.7
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-08-25 04:58 MDT by Bas van der Vlies
Modified: 2021-11-09 10:15 MST (History)
1 user (show)

See Also:
Site: SURF
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bas van der Vlies 2021-08-25 04:58:29 MDT
We have installed slurm 20.02.7 and I am trying to use this new reservation flag
MAGNETIC:
  * https://slurm.schedmd.com/reservations.html

From this page I understand that the job will land in the reservation even if we
did not specify the reservation name.  I tested it on our cluster setup but it does not seems to work as expected

I create a reservation for 1 node for user 'bas' with flag magnetic. I submit a
job and to my surprise the job is scheduled on a free node and not on the node
in the reservation. It is only scheduled in the reservation if all nodes are
occupied. Is this the default behavior or did a miss a setting?

I have set all available nodes offline except the reservation node and then I
see this:
```
bas@batch2:~/src$ srun -N1 --pty /bin/bash
srun: Required node not available (down, drained or reserved)
srun: job 1713 queued and waiting for resources
srun: job 1713 has been allocated resources
```

From this I see that the "magnetic" reservation is considered as last.

we do a lot of Jupyterhub course setups and for each course we setup multiple reservations.  For use it will ease the setup that a magnetic reservation is tried first. to me it also seems logic to use the magnetic reservation first, because we setup this reservation for an reason.

Regards

Bas van der Vlies
Comment 1 Ben Roberts 2021-08-25 10:31:34 MDT
Hi Bas,

The behavior you expected is the correct behavior, jobs that qualify for the reservation should go to the reservation before they run on other nodes.  I ran a quick test as an example of how it should work.

1.  I created a reservation on one node with the magnetic flag:

$ scontrol create reservation reservationname=magnetic nodes=node05 starttime=now duration=1:00:00 user=ben flags=magnetic
Reservation created: magnetic

$ scontrol show reservations 
ReservationName=magnetic StartTime=2021-08-25T11:09:45 EndTime=2021-08-25T12:09:45 Duration=01:00:00
   Nodes=node05 NodeCnt=1 CoreCnt=24 Features=(null) PartitionName=(null) Flags=SPEC_NODES,MAGNETIC
   TRES=cpu=24
   Users=ben Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


2.  I submitted a job as my user so that it should qualify for that reservation.

$ sbatch -N1 --wrap='srun sleep 120'
Submitted batch job 31006


3.  It does go to node05, which is the reserved node.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             31006     debug     wrap      ben  R       0:01      1 node05

$ scontrol show job 31006
JobId=31006 JobName=wrap
   UserId=ben(1000) GroupId=ben(1000) MCS_label=N/A
   Priority=8573 Nice=0 Account=sub1 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:09 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2021-08-25T11:10:04 EligibleTime=2021-08-25T11:10:04
   AccrueTime=2021-08-25T11:10:04
   StartTime=2021-08-25T11:10:05 EndTime=Unknown Deadline=N/A
   PreemptEligibleTime=2021-08-25T11:10:05 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-25T11:10:05 Scheduler=Main
   Partition=debug AllocNode:Sid=kitt:7074
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node05
   BatchHost=node05
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=15678M,node=1,billing=9
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=15678M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=magnetic
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/ben/slurm
   StdErr=/home/ben/slurm/slurm-31006.out
   StdIn=/dev/null
   StdOut=/home/ben/slurm/slurm-31006.out
   Power=



Can you send information on the steps you're taking to create the reservation?  I would also like to see the output of 'scontrol show node <node_name>' for the node you use for the reservation.  

Thanks,
Ben
Comment 2 Bas van der Vlies 2021-08-25 10:51:39 MDT
Dear Ben,

 Thanks for the quick response. I just tested it on the cluster and it works as expected. Looks at a quirk at our site. Sorry the confusion. Tomorrow I will do some other test and let you know if the issue can be closed.

Thanks
Comment 3 Bas van der Vlies 2021-08-25 11:37:54 MDT
Maybe I was luckly. But we did some more tests and now it is failing:

$ scontrol create reservation reservationname=magnetic nodecnt=1 starttime=now duration=1:00:00 account=bas flags=magnetic   

$ scontrol show reservationname=magnetic

ReservationName=magnetic StartTime=2021-08-25T19:26:01 EndTime=2021-08-25T20:26:01 Duration=01:00:00 Nodes=r13n1 NodeCnt=1 CoreCnt=16 Features=(null) PartitionName=normal Flags=MAGNETIC TRES=cpu=16 Users=(null) Accounts=bas Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a  MaxStartDelay=(null)


$ srun -N1 --pty /bin/bash 
19:34 r27n17.lisa.surfsara.nl:~ $ 

$ sbatch -N1 --wrap='srun sleep 120'
 8149075    normal     wrap      bas  R       0:04      1 r13n1

So 'sbatch' works as expected but `srun` not. We did all our tests with 'srun' that explains a lot.  Does 'srun' works on your cluster?
Comment 4 Ben Roberts 2021-08-25 14:02:53 MDT
Hi Bas,

You're right, it looks like jobs submitted with sbatch will go to the reservation, but jobs submitted with srun or salloc don't.  It looks like I can get around this by setting 'defer' as one of my SchedulerParameters to prevent the scheduler from trying to schedule these interactive jobs immediately.  Alternatively you can add '--begin=now+1' to the submission.  I'm still looking into this, but I'm curious if you see the same behavior when you add a delay in the start time of these jobs.

Thanks,
Ben
Comment 5 Bas van der Vlies 2021-08-26 03:28:10 MDT
Hi Ben,

  Both solutions work: 
    1. the '--begin=now+1' for srun. 
    2. Adding 'defer' to the ScheduleParamers has also the desired effect.

Thanks

Bas
Comment 6 Bas van der Vlies 2021-08-26 09:29:00 MDT
Hi Ben,

 We did some excesive tests with magnetic reservation and we noticed that if submit a lot of jobs after each other it schedule most jobs on the the reservation node but not all.

-- reservation ---
ReservationName=jupyterhub_course_jhldem001_2021-08-26 StartTime=2021-08-26T15:00:00 EndTime=2021-08-26T23:40:00 Duration=08:40:00
   Nodes=r11n1 NodeCnt=1 CoreCnt=16 Features=(null) PartitionName=shared Flags=SPEC_NODES,MAGNETIC TRES=cpu=16 Users=(null) Accounts=jhldem001 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null)


script used: submit.sh
---
!/bin/bash
#SBATCH --cpus-per-task=1 
#SBATCH --mem=10M
#SBATCH --partition=shared

hostname 
sleep 120
---

 * for i in `seq 1 30`; do echo $i; sbatch submit.sh&  done

Then all jobs run on this node and I can do the above script again when all jobs are in 'R' state the jobs will end up on the same node due to OVERSUBSCRIBE feature.

But when one of the jobs is in 'CG' state the jobs will overflow to other nodes even if there is plenty of resources on the node,eg:
 8151460  lcur0144 jhldem00    shared default  submit.sh  CG 2:52 1 1 1 r11n1                
 8151461  lcur0144 jhldem00    shared default  submit.sh  R  2:56 1 1 1 r11n1                
 8151462  lcur0144 jhldem00    shared default  submit.sh  CG 2:57 1 1 1 r11n1                
 8151463  lcur0144 jhldem00    shared default  submit.sh  CG 2:56 1 1 1 r11n1                
 8151517  lcur0144 jhldem00    shared default  submit.sh  R  4:56 1 1 1 r10n1                
 8151518  lcur0144 jhldem00    shared default  submit.sh  R  4:56 1 1 1 r10n1

When all 'CG' jobs are gone the nodes will end up on the reservation. When we specify the '--reservation' on the commandline everything works as expected and we do not see the problem with jobs in the 'CG' state preventing jobs to run in reservation. Is the the intended behaviour?
Comment 7 Ben Roberts 2021-08-26 09:57:39 MDT
Hi Bas,

I'm glad to hear that setting 'defer' on SchedulerParameters or adding '--begin=now+1' to the submission works.  Using the SchedulerParameters seems like a better approach since it doesn't require any change on the part of the users.  Does setting 'defer' work for you as a solution or do you have other requirements preventing you from using that setting?  

For your most recent question, this does look like intended behavior.  When using a magnetic reservation it will try to get jobs that are eligible to run in it, but if they can't start in the reservation right away, they will go to other available nodes.  When you request the reservation explicitly then the job will wait until it is able to run in the reservation.  

Thanks,
Ben
Comment 9 Bas van der Vlies 2021-08-26 11:39:35 MDT
Hi Ben,

 I have applied the 'defer' option that is a suitable option for us. As for the answer of the other question. Maybe I did not describe the problem that we see properly. The case is this node can run for eg 16 1 core jobs. I submit 2 jobs without reservation and the jobs are scheduled on the reservation node. As soon 1 of these jobs in the 'CG' state it will schedule the next job on other node. The job is not stuck in the 'CG' state is just "completing: the job.

This behaviour is not when we specify the `--reservation` parameter. 2 jobs run 1 is in the 'CG' state. The new job will happily run in the reservation and is not in the 'PD' state. 

To me this feels strange why it can run when the "--reservation" is present but not without it.  The resources are available in the reservation. 

Regards

Bas
Comment 11 Ben Roberts 2021-08-26 16:20:04 MDT
I'm sorry I didn't quite get what you meant with the jobs in completing preventing new jobs from starting on the reserved node.  I re-read your description and I do see what you're talking about.  I set up a test to see if I could reproduce this.  I created a magnetic reservation like I did in my previous example.  I also configured an epilogSlurmctld with a sleep so that jobs would stay in a completing state longer (making it easier to catch the problem).  I then submitted a series of short jobs that should be eligible to go to this reservation with a 1 second pause between each submission:

$ for i in {1..10}; do sbatch -n1 --mem=100M --wrap='srun sleep 5'; sleep 1; done
Submitted batch job 31141
Submitted batch job 31142
Submitted batch job 31143
Submitted batch job 31144
Submitted batch job 31145
Submitted batch job 31146
Submitted batch job 31147
Submitted batch job 31148
Submitted batch job 31149
Submitted batch job 31150


While this was going I was collecting squeue output in another terminal and in my case the jobs did start on the reserved node (node05) while there were jobs in a 'CG' state.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             31141     debug     wrap      ben CG       0:05      1 
             31146     debug     wrap      ben PD       0:00      1 (None)
             31145     debug     wrap      ben PD       0:00      1 (None)
             31144     debug     wrap      ben PD       0:00      1 (None)
             31142     debug     wrap      ben  R       0:03      1 node05
             31143     debug     wrap      ben  R       0:03      1 node05

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             31141     debug     wrap      ben CG       0:05      1 
             31148     debug     wrap      ben PD       0:00      1 (None)
             31147     debug     wrap      ben PD       0:00      1 (None)
             31145     debug     wrap      ben  R       0:02      1 node05
             31146     debug     wrap      ben  R       0:02      1 node05
             31142     debug     wrap      ben  R       0:05      1 node05
             31143     debug     wrap      ben  R       0:05      1 node05
             31144     debug     wrap      ben  S       0:00      1 node05


It seems like there is something else involved that is triggering this behavior that we haven't identified yet.  However, this issue doesn't look directly related to the original issue, where srun jobs don't go to magnetic reservations.  It sounds like you're ok using 'defer' to get the behavior you want, but I do think that it should work the same way, whether using sbatch or srun, without having to use defer to introduce a brief delay in scheduling the jobs.  I am still investigating how to fix that issue.  In order to avoid confusing the two different issues I would ask if you could open a new ticket for the fact that jobs in a completing state are preventing other jobs from starting in the magnetic reservation so we can look into it further.  I'll keep working on the original issue and keep you updated.

Thanks,
Ben
Comment 12 Bas van der Vlies 2021-08-26 23:41:14 MDT
Hi Ben,

 Your are right this is another issue. Thanks that you had done the test. May I asked on which slurm version you tested the 'CG' problem?

Good luck with tackling the srun/salloc problem.

Greetz Bas
Comment 15 Ben Roberts 2021-08-27 11:05:48 MDT
Thank you for opening a new ticket for the CG issue.  I ran my previous test with version 21.08, but I just ran it again with version 20.11 and got the same results.  

I did some more investigation into the difference between sbatch and srun jobs and I can see the difference, but I'm handing it over to a colleague (Dominik) who is better qualified to put together a fix for it.  

Thanks,
Ben
Comment 23 Dominik Bartkiewicz 2021-11-09 07:19:37 MST
Hi

Sorry for the late reply.
Those commits protect jobs from starting outside of the magnetic reservation.
https://github.com/SchedMD/slurm/commit/37523ba12d
https://github.com/SchedMD/slurm/commit/6a4accbe9b
All of them are included in 21.08.3.
Let me know if we can close this ticket now.

Dominik
Comment 24 Bas van der Vlies 2021-11-09 09:09:47 MST
On this cluster we use an older version but I will test it another cluster. You can close it. 

Thanks
Comment 25 Jason Booth 2021-11-09 10:15:25 MST
Fixed