I noticed this morning that one of our users has a bunch of jobs queued on Pitzer that aren’t running for reason “Reservation”: troy@pitzer-login04:~$ squeue -u srb JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13639 serial-40 1aki.ppn srb PD 0:00 1 (Reservation) 13640 serial-40 1aki.ppn srb PD 0:00 1 (Reservation) 13645 serial-40 1aki.ppn srb PD 0:00 1 (Reservation) 13646 serial-40 1aki.ppn srb PD 0:00 1 (Reservation) 13663 serial-40 1aki.ppn srb PD 0:00 1 (Reservation) 13730 serial-40 c80qc.g1 srb PD 0:00 1 (Reservation) 13733 serial-40 ch2.owen srb PD 0:00 1 (Reservation) 13734 serial-40 ch2.owen srb PD 0:00 1 (Reservation) 13822 serial-40 h2o.30Se srb PD 0:00 1 (Reservation) 13865 serial-40 jac9999. srb PD 0:00 1 (Reservation) 13869 serial-40 jac9999. srb PD 0:00 1 (Reservation) 13721 serial-40 c80omp.o srb PD 0:00 1 (Reservation) 13722 serial-40 c80omp.o srb PD 0:00 1 (Reservation) 13817 serial-40 glidebat srb PD 0:00 1 (Reservation) 13818 serial-40 glidebat srb PD 0:00 1 (Reservation) 13896 serial-40 ni.scf.5 srb PD 0:00 1 (Reservation) 13897 serial-40 ni.scf.5 srb PD 0:00 1 (Reservation) 13934 serial-40 restart srb PD 0:00 1 (Reservation) 13935 serial-40 mynwchem srb PD 0:00 1 (Reservation) 13886 serial-40 molcas srb PD 0:00 1 (Reservation) 13918 serial-40 nono2ch2 srb PD 0:00 1 (Reservation) 13922 serial-40 nono2ch2 srb PD 0:00 1 (Reservation) 13925 serial-40 nono2ch2 srb PD 0:00 1 (Reservation) 13673 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13674 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13675 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13676 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13677 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13678 serial-40 1aki.ser srb PD 0:00 1 (Reservation) 13681 serial-40 8_04.oak srb PD 0:00 1 (Reservation) 13682 serial-40 8_04.oak srb PD 0:00 1 (Reservation) 13802 serial-40 ethylene srb PD 0:00 1 (Reservation) 13808 serial-40 ethylene srb PD 0:00 1 (Reservation) 13841 serial-40 h2o.seri srb PD 0:00 1 (Reservation) 13882 serial-40 methanec srb PD 0:00 1 (Reservation) 13883 serial-40 methane srb PD 0:00 1 (Reservation) 13913 serial-40 nono2ch2 srb PD 0:00 1 (Reservation) 13914 serial-40 nono2ch2 srb PD 0:00 1 (Reservation) 13933 serial-40 restart srb PD 0:00 1 (Reservation) 13931 serial-40 columbus srb PD 0:00 1 (Reservation) 13932 serial-40 columbus srb PD 0:00 1 (Reservation) What seems odd about this is that Slurm seems to think that the shorter of these jobs asked for a reservation that they're not allowed to access: troy@pitzer-login04:~$ scontrol show job 13841 JobId=13841 JobName=h2o.serial.oakley UserId=srb(8056) GroupId=PZS0710(5511) MCS_label=N/A Priority=100500639 Nice=0 Account=pzs0710 QOS=pitzer-all JobState=PENDING Reason=Reservation Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2020-08-14T17:16:03 EligibleTime=2020-08-14T17:16:03 AccrueTime=2020-08-14T17:16:03 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-14T17:16:17 Partition=serial-40core,serial-48core,gpubackfill-serial AllocNode:Sid=pitzer-login04:6343 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4556M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=PZS0708 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.pbs WorkDir=/users/appl/srb/qa/pitzer4slurm8142020 Comment=stdout=/users/appl/srb/qa/pitzer4slurm8142020/%x.o%A StdErr=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.o13841 StdIn=/dev/null StdOut=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.o13841 Power= MailUser=srb@osc.edu MailType=END,FAIL troy@pitzer-login04:~$ scontrol show reservation PZS0708 ReservationName=PZS0708 StartTime=2020-08-17T15:00:00 EndTime=2020-08-17T15:10:00 Duration=00:10:00 Nodes=p0002 NodeCnt=1 CoreCnt=40 Features=c6420 PartitionName=batch Flags=FLEX,DAILY,MAGNETIC TRES=cpu=40 Users=(null) Accounts=PZS0708 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources. Is that possible?
> I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources. Ugh. Baer needs caffeine, badly. What I meant to say was: I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is causing the reservation to attract jobs even if they aren't allowed to access the reservation's resources.
BTW, I should also mention that if I manually remove the reservation setting from these jobs (e.g. "scontrol update jobid=13817 reservation="), they start running immediately.
Hi I can recreate this. I let you know when the fix will be in repo. Dominik
Any updates on this?
Any updates? We have some use cases around classroom HPC usage where working magnetic reservations would be useful.
Troy, a fix for this and other issues with magnetic reservations found while looking at this has been checked into 20.02.6 commits 815751a629b3e3..4948d1adc87efc. If you have any other issues after testing please reopen this or open a new bug on the matter. Thanks for reporting and helping figure it out :).