Ticket 9593

Summary: Magnetic reservations attract jobs not allowed to use them?
Product: Slurm Reporter: Troy Baer <troy>
Component: reservationsAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bas.vandervlies, sts, tdockendorf
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.6 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Troy Baer 2020-08-17 08:44:10 MDT
I noticed this morning that one of our users has a bunch of jobs queued on Pitzer that aren’t running for reason “Reservation”:
troy@pitzer-login04:~$ squeue -u srb
             13639 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13640 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13645 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13646 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13663 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13730 serial-40 c80qc.g1      srb PD       0:00      1 (Reservation)
             13733 serial-40 ch2.owen      srb PD       0:00      1 (Reservation)
             13734 serial-40 ch2.owen      srb PD       0:00      1 (Reservation)
             13822 serial-40 h2o.30Se      srb PD       0:00      1 (Reservation)
             13865 serial-40 jac9999.      srb PD       0:00      1 (Reservation)
             13869 serial-40 jac9999.      srb PD       0:00      1 (Reservation)
             13721 serial-40 c80omp.o      srb PD       0:00      1 (Reservation)
             13722 serial-40 c80omp.o      srb PD       0:00      1 (Reservation)
             13817 serial-40 glidebat      srb PD       0:00      1 (Reservation)
             13818 serial-40 glidebat      srb PD       0:00      1 (Reservation)
             13896 serial-40 ni.scf.5      srb PD       0:00      1 (Reservation)
             13897 serial-40 ni.scf.5      srb PD       0:00      1 (Reservation)
             13934 serial-40  restart      srb PD       0:00      1 (Reservation)
             13935 serial-40 mynwchem      srb PD       0:00      1 (Reservation)
             13886 serial-40   molcas      srb PD       0:00      1 (Reservation)
             13918 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13922 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13925 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13673 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13674 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13675 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13676 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13677 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13678 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13681 serial-40 8_04.oak      srb PD       0:00      1 (Reservation)
             13682 serial-40 8_04.oak      srb PD       0:00      1 (Reservation)
             13802 serial-40 ethylene      srb PD       0:00      1 (Reservation)
             13808 serial-40 ethylene      srb PD       0:00      1 (Reservation)
             13841 serial-40 h2o.seri      srb PD       0:00      1 (Reservation)
             13882 serial-40 methanec      srb PD       0:00      1 (Reservation)
             13883 serial-40  methane      srb PD       0:00      1 (Reservation)
             13913 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13914 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13933 serial-40  restart      srb PD       0:00      1 (Reservation)
             13931 serial-40 columbus      srb PD       0:00      1 (Reservation)
             13932 serial-40 columbus      srb PD       0:00      1 (Reservation)
What seems odd about this is that Slurm seems to think that the shorter of these jobs asked for a reservation that they're not allowed to access:
troy@pitzer-login04:~$ scontrol show job 13841
JobId=13841 JobName=h2o.serial.oakley
   UserId=srb(8056) GroupId=PZS0710(5511) MCS_label=N/A
   Priority=100500639 Nice=0 Account=pzs0710 QOS=pitzer-all
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-08-14T17:16:03 EligibleTime=2020-08-14T17:16:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-14T17:16:17
   Partition=serial-40core,serial-48core,gpubackfill-serial AllocNode:Sid=pitzer-login04:6343
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   MailUser=srb@osc.edu MailType=END,FAIL
troy@pitzer-login04:~$ scontrol show reservation PZS0708
ReservationName=PZS0708 StartTime=2020-08-17T15:00:00 EndTime=2020-08-17T15:10:00 Duration=00:10:00
   Nodes=p0002 NodeCnt=1 CoreCnt=40 Features=c6420 PartitionName=batch Flags=FLEX,DAILY,MAGNETIC
   Users=(null) Accounts=PZS0708 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources.  Is that possible?
Comment 1 Troy Baer 2020-08-17 10:24:16 MDT
> I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources.

Ugh.  Baer needs caffeine, badly.  What I meant to say was:

I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is causing the reservation to attract jobs even if they aren't allowed to access the reservation's resources.
Comment 2 Troy Baer 2020-08-17 13:27:24 MDT
BTW, I should also mention that if I manually remove the reservation setting from these jobs (e.g. "scontrol update jobid=13817 reservation="), they start running immediately.
Comment 4 Dominik Bartkiewicz 2020-08-18 08:58:52 MDT

I can recreate this.
I let you know when the fix will be in repo.

Comment 6 Troy Baer 2020-09-11 07:20:03 MDT
Any updates on this?
Comment 12 Troy Baer 2020-09-18 09:21:58 MDT
Any updates?  We have some use cases around classroom HPC usage where working magnetic reservations would be useful.
Comment 17 Danny Auble 2020-09-21 09:15:56 MDT
Troy, a fix for this and other issues with magnetic reservations found while looking at this has been checked into 20.02.6 commits 815751a629b3e3..4948d1adc87efc.

If you have any other issues after testing please reopen this or open a new bug on the matter.

Thanks for reporting and helping figure it out :).