Bug 9593 - Magnetic reservations attract jobs not allowed to use them?
Summary: Magnetic reservations attract jobs not allowed to use them?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: reservations (show other bugs)
Version: 20.02.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-08-17 08:44 MDT by Troy Baer
Modified: 2020-10-02 16:45 MDT (History)
3 users (show)

See Also:
Site: Ohio State OSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.6 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Troy Baer 2020-08-17 08:44:10 MDT
I noticed this morning that one of our users has a bunch of jobs queued on Pitzer that aren’t running for reason “Reservation”:
 
troy@pitzer-login04:~$ squeue -u srb
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             13639 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13640 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13645 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13646 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13663 serial-40 1aki.ppn      srb PD       0:00      1 (Reservation)
             13730 serial-40 c80qc.g1      srb PD       0:00      1 (Reservation)
             13733 serial-40 ch2.owen      srb PD       0:00      1 (Reservation)
             13734 serial-40 ch2.owen      srb PD       0:00      1 (Reservation)
             13822 serial-40 h2o.30Se      srb PD       0:00      1 (Reservation)
             13865 serial-40 jac9999.      srb PD       0:00      1 (Reservation)
             13869 serial-40 jac9999.      srb PD       0:00      1 (Reservation)
             13721 serial-40 c80omp.o      srb PD       0:00      1 (Reservation)
             13722 serial-40 c80omp.o      srb PD       0:00      1 (Reservation)
             13817 serial-40 glidebat      srb PD       0:00      1 (Reservation)
             13818 serial-40 glidebat      srb PD       0:00      1 (Reservation)
             13896 serial-40 ni.scf.5      srb PD       0:00      1 (Reservation)
             13897 serial-40 ni.scf.5      srb PD       0:00      1 (Reservation)
             13934 serial-40  restart      srb PD       0:00      1 (Reservation)
             13935 serial-40 mynwchem      srb PD       0:00      1 (Reservation)
             13886 serial-40   molcas      srb PD       0:00      1 (Reservation)
             13918 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13922 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13925 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13673 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13674 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13675 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13676 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13677 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13678 serial-40 1aki.ser      srb PD       0:00      1 (Reservation)
             13681 serial-40 8_04.oak      srb PD       0:00      1 (Reservation)
             13682 serial-40 8_04.oak      srb PD       0:00      1 (Reservation)
             13802 serial-40 ethylene      srb PD       0:00      1 (Reservation)
             13808 serial-40 ethylene      srb PD       0:00      1 (Reservation)
             13841 serial-40 h2o.seri      srb PD       0:00      1 (Reservation)
             13882 serial-40 methanec      srb PD       0:00      1 (Reservation)
             13883 serial-40  methane      srb PD       0:00      1 (Reservation)
             13913 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13914 serial-40 nono2ch2      srb PD       0:00      1 (Reservation)
             13933 serial-40  restart      srb PD       0:00      1 (Reservation)
             13931 serial-40 columbus      srb PD       0:00      1 (Reservation)
             13932 serial-40 columbus      srb PD       0:00      1 (Reservation)
 
What seems odd about this is that Slurm seems to think that the shorter of these jobs asked for a reservation that they're not allowed to access:
 
troy@pitzer-login04:~$ scontrol show job 13841
JobId=13841 JobName=h2o.serial.oakley
   UserId=srb(8056) GroupId=PZS0710(5511) MCS_label=N/A
   Priority=100500639 Nice=0 Account=pzs0710 QOS=pitzer-all
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-08-14T17:16:03 EligibleTime=2020-08-14T17:16:03
   AccrueTime=2020-08-14T17:16:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-14T17:16:17
   Partition=serial-40core,serial-48core,gpubackfill-serial AllocNode:Sid=pitzer-login04:6343
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4556M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=PZS0708
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.pbs
   WorkDir=/users/appl/srb/qa/pitzer4slurm8142020
   Comment=stdout=/users/appl/srb/qa/pitzer4slurm8142020/%x.o%A 
   StdErr=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.o13841
   StdIn=/dev/null
   StdOut=/users/appl/srb/qa/pitzer4slurm8142020/h2o.serial.oakley.o13841
   Power=
   MailUser=srb@osc.edu MailType=END,FAIL
 
troy@pitzer-login04:~$ scontrol show reservation PZS0708
ReservationName=PZS0708 StartTime=2020-08-17T15:00:00 EndTime=2020-08-17T15:10:00 Duration=00:10:00
   Nodes=p0002 NodeCnt=1 CoreCnt=40 Features=c6420 PartitionName=batch Flags=FLEX,DAILY,MAGNETIC
   TRES=cpu=40
   Users=(null) Accounts=PZS0708 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
 
I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources.  Is that possible?
Comment 1 Troy Baer 2020-08-17 10:24:16 MDT
> I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is cause the reservation to attract reservations even if they aren't allowed to access the reservation's resources.

Ugh.  Baer needs caffeine, badly.  What I meant to say was:

I'm fairly sure that the user didn't explicitly request this reservation, which makes me suspect that the MAGNETIC (formerly PROMISCUOUS) flag on the reservation is causing the reservation to attract jobs even if they aren't allowed to access the reservation's resources.
Comment 2 Troy Baer 2020-08-17 13:27:24 MDT
BTW, I should also mention that if I manually remove the reservation setting from these jobs (e.g. "scontrol update jobid=13817 reservation="), they start running immediately.
Comment 4 Dominik Bartkiewicz 2020-08-18 08:58:52 MDT
Hi

I can recreate this.
I let you know when the fix will be in repo.

Dominik
Comment 6 Troy Baer 2020-09-11 07:20:03 MDT
Any updates on this?
Comment 12 Troy Baer 2020-09-18 09:21:58 MDT
Any updates?  We have some use cases around classroom HPC usage where working magnetic reservations would be useful.
Comment 17 Danny Auble 2020-09-21 09:15:56 MDT
Troy, a fix for this and other issues with magnetic reservations found while looking at this has been checked into 20.02.6 commits 815751a629b3e3..4948d1adc87efc.

If you have any other issues after testing please reopen this or open a new bug on the matter.

Thanks for reporting and helping figure it out :).