We have installed slurm 20.02.7 and I am trying to use this new reservation flag MAGNETIC: * https://slurm.schedmd.com/reservations.html From this page I understand that the job will land in the reservation even if we did not specify the reservation name. I tested it on our cluster setup but it does not seems to work as expected I create a reservation for 1 node for user 'bas' with flag magnetic. I submit a job and to my surprise the job is scheduled on a free node and not on the node in the reservation. It is only scheduled in the reservation if all nodes are occupied. Is this the default behavior or did a miss a setting? I have set all available nodes offline except the reservation node and then I see this: ``` bas@batch2:~/src$ srun -N1 --pty /bin/bash srun: Required node not available (down, drained or reserved) srun: job 1713 queued and waiting for resources srun: job 1713 has been allocated resources ``` From this I see that the "magnetic" reservation is considered as last. we do a lot of Jupyterhub course setups and for each course we setup multiple reservations. For use it will ease the setup that a magnetic reservation is tried first. to me it also seems logic to use the magnetic reservation first, because we setup this reservation for an reason. Regards Bas van der Vlies
Hi Bas, The behavior you expected is the correct behavior, jobs that qualify for the reservation should go to the reservation before they run on other nodes. I ran a quick test as an example of how it should work. 1. I created a reservation on one node with the magnetic flag: $ scontrol create reservation reservationname=magnetic nodes=node05 starttime=now duration=1:00:00 user=ben flags=magnetic Reservation created: magnetic $ scontrol show reservations ReservationName=magnetic StartTime=2021-08-25T11:09:45 EndTime=2021-08-25T12:09:45 Duration=01:00:00 Nodes=node05 NodeCnt=1 CoreCnt=24 Features=(null) PartitionName=(null) Flags=SPEC_NODES,MAGNETIC TRES=cpu=24 Users=ben Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) 2. I submitted a job as my user so that it should qualify for that reservation. $ sbatch -N1 --wrap='srun sleep 120' Submitted batch job 31006 3. It does go to node05, which is the reserved node. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 31006 debug wrap ben R 0:01 1 node05 $ scontrol show job 31006 JobId=31006 JobName=wrap UserId=ben(1000) GroupId=ben(1000) MCS_label=N/A Priority=8573 Nice=0 Account=sub1 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:09 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2021-08-25T11:10:04 EligibleTime=2021-08-25T11:10:04 AccrueTime=2021-08-25T11:10:04 StartTime=2021-08-25T11:10:05 EndTime=Unknown Deadline=N/A PreemptEligibleTime=2021-08-25T11:10:05 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-25T11:10:05 Scheduler=Main Partition=debug AllocNode:Sid=kitt:7074 ReqNodeList=(null) ExcNodeList=(null) NodeList=node05 BatchHost=node05 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=15678M,node=1,billing=9 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=15678M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=magnetic OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/ben/slurm StdErr=/home/ben/slurm/slurm-31006.out StdIn=/dev/null StdOut=/home/ben/slurm/slurm-31006.out Power= Can you send information on the steps you're taking to create the reservation? I would also like to see the output of 'scontrol show node <node_name>' for the node you use for the reservation. Thanks, Ben
Dear Ben, Thanks for the quick response. I just tested it on the cluster and it works as expected. Looks at a quirk at our site. Sorry the confusion. Tomorrow I will do some other test and let you know if the issue can be closed. Thanks
Maybe I was luckly. But we did some more tests and now it is failing: $ scontrol create reservation reservationname=magnetic nodecnt=1 starttime=now duration=1:00:00 account=bas flags=magnetic $ scontrol show reservationname=magnetic ReservationName=magnetic StartTime=2021-08-25T19:26:01 EndTime=2021-08-25T20:26:01 Duration=01:00:00 Nodes=r13n1 NodeCnt=1 CoreCnt=16 Features=(null) PartitionName=normal Flags=MAGNETIC TRES=cpu=16 Users=(null) Accounts=bas Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) $ srun -N1 --pty /bin/bash 19:34 r27n17.lisa.surfsara.nl:~ $ $ sbatch -N1 --wrap='srun sleep 120' 8149075 normal wrap bas R 0:04 1 r13n1 So 'sbatch' works as expected but `srun` not. We did all our tests with 'srun' that explains a lot. Does 'srun' works on your cluster?
Hi Bas, You're right, it looks like jobs submitted with sbatch will go to the reservation, but jobs submitted with srun or salloc don't. It looks like I can get around this by setting 'defer' as one of my SchedulerParameters to prevent the scheduler from trying to schedule these interactive jobs immediately. Alternatively you can add '--begin=now+1' to the submission. I'm still looking into this, but I'm curious if you see the same behavior when you add a delay in the start time of these jobs. Thanks, Ben
Hi Ben, Both solutions work: 1. the '--begin=now+1' for srun. 2. Adding 'defer' to the ScheduleParamers has also the desired effect. Thanks Bas
Hi Ben, We did some excesive tests with magnetic reservation and we noticed that if submit a lot of jobs after each other it schedule most jobs on the the reservation node but not all. -- reservation --- ReservationName=jupyterhub_course_jhldem001_2021-08-26 StartTime=2021-08-26T15:00:00 EndTime=2021-08-26T23:40:00 Duration=08:40:00 Nodes=r11n1 NodeCnt=1 CoreCnt=16 Features=(null) PartitionName=shared Flags=SPEC_NODES,MAGNETIC TRES=cpu=16 Users=(null) Accounts=jhldem001 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) script used: submit.sh --- !/bin/bash #SBATCH --cpus-per-task=1 #SBATCH --mem=10M #SBATCH --partition=shared hostname sleep 120 --- * for i in `seq 1 30`; do echo $i; sbatch submit.sh& done Then all jobs run on this node and I can do the above script again when all jobs are in 'R' state the jobs will end up on the same node due to OVERSUBSCRIBE feature. But when one of the jobs is in 'CG' state the jobs will overflow to other nodes even if there is plenty of resources on the node,eg: 8151460 lcur0144 jhldem00 shared default submit.sh CG 2:52 1 1 1 r11n1 8151461 lcur0144 jhldem00 shared default submit.sh R 2:56 1 1 1 r11n1 8151462 lcur0144 jhldem00 shared default submit.sh CG 2:57 1 1 1 r11n1 8151463 lcur0144 jhldem00 shared default submit.sh CG 2:56 1 1 1 r11n1 8151517 lcur0144 jhldem00 shared default submit.sh R 4:56 1 1 1 r10n1 8151518 lcur0144 jhldem00 shared default submit.sh R 4:56 1 1 1 r10n1 When all 'CG' jobs are gone the nodes will end up on the reservation. When we specify the '--reservation' on the commandline everything works as expected and we do not see the problem with jobs in the 'CG' state preventing jobs to run in reservation. Is the the intended behaviour?
Hi Bas, I'm glad to hear that setting 'defer' on SchedulerParameters or adding '--begin=now+1' to the submission works. Using the SchedulerParameters seems like a better approach since it doesn't require any change on the part of the users. Does setting 'defer' work for you as a solution or do you have other requirements preventing you from using that setting? For your most recent question, this does look like intended behavior. When using a magnetic reservation it will try to get jobs that are eligible to run in it, but if they can't start in the reservation right away, they will go to other available nodes. When you request the reservation explicitly then the job will wait until it is able to run in the reservation. Thanks, Ben
Hi Ben, I have applied the 'defer' option that is a suitable option for us. As for the answer of the other question. Maybe I did not describe the problem that we see properly. The case is this node can run for eg 16 1 core jobs. I submit 2 jobs without reservation and the jobs are scheduled on the reservation node. As soon 1 of these jobs in the 'CG' state it will schedule the next job on other node. The job is not stuck in the 'CG' state is just "completing: the job. This behaviour is not when we specify the `--reservation` parameter. 2 jobs run 1 is in the 'CG' state. The new job will happily run in the reservation and is not in the 'PD' state. To me this feels strange why it can run when the "--reservation" is present but not without it. The resources are available in the reservation. Regards Bas
I'm sorry I didn't quite get what you meant with the jobs in completing preventing new jobs from starting on the reserved node. I re-read your description and I do see what you're talking about. I set up a test to see if I could reproduce this. I created a magnetic reservation like I did in my previous example. I also configured an epilogSlurmctld with a sleep so that jobs would stay in a completing state longer (making it easier to catch the problem). I then submitted a series of short jobs that should be eligible to go to this reservation with a 1 second pause between each submission: $ for i in {1..10}; do sbatch -n1 --mem=100M --wrap='srun sleep 5'; sleep 1; done Submitted batch job 31141 Submitted batch job 31142 Submitted batch job 31143 Submitted batch job 31144 Submitted batch job 31145 Submitted batch job 31146 Submitted batch job 31147 Submitted batch job 31148 Submitted batch job 31149 Submitted batch job 31150 While this was going I was collecting squeue output in another terminal and in my case the jobs did start on the reserved node (node05) while there were jobs in a 'CG' state. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 31141 debug wrap ben CG 0:05 1 31146 debug wrap ben PD 0:00 1 (None) 31145 debug wrap ben PD 0:00 1 (None) 31144 debug wrap ben PD 0:00 1 (None) 31142 debug wrap ben R 0:03 1 node05 31143 debug wrap ben R 0:03 1 node05 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 31141 debug wrap ben CG 0:05 1 31148 debug wrap ben PD 0:00 1 (None) 31147 debug wrap ben PD 0:00 1 (None) 31145 debug wrap ben R 0:02 1 node05 31146 debug wrap ben R 0:02 1 node05 31142 debug wrap ben R 0:05 1 node05 31143 debug wrap ben R 0:05 1 node05 31144 debug wrap ben S 0:00 1 node05 It seems like there is something else involved that is triggering this behavior that we haven't identified yet. However, this issue doesn't look directly related to the original issue, where srun jobs don't go to magnetic reservations. It sounds like you're ok using 'defer' to get the behavior you want, but I do think that it should work the same way, whether using sbatch or srun, without having to use defer to introduce a brief delay in scheduling the jobs. I am still investigating how to fix that issue. In order to avoid confusing the two different issues I would ask if you could open a new ticket for the fact that jobs in a completing state are preventing other jobs from starting in the magnetic reservation so we can look into it further. I'll keep working on the original issue and keep you updated. Thanks, Ben
Hi Ben, Your are right this is another issue. Thanks that you had done the test. May I asked on which slurm version you tested the 'CG' problem? Good luck with tackling the srun/salloc problem. Greetz Bas
Thank you for opening a new ticket for the CG issue. I ran my previous test with version 21.08, but I just ran it again with version 20.11 and got the same results. I did some more investigation into the difference between sbatch and srun jobs and I can see the difference, but I'm handing it over to a colleague (Dominik) who is better qualified to put together a fix for it. Thanks, Ben
Hi Sorry for the late reply. Those commits protect jobs from starting outside of the magnetic reservation. https://github.com/SchedMD/slurm/commit/37523ba12d https://github.com/SchedMD/slurm/commit/6a4accbe9b All of them are included in 21.08.3. Let me know if we can close this ticket now. Dominik
On this cluster we use an older version but I will test it another cluster. You can close it. Thanks
Fixed