Ticket 12369

Summary: Magnetic reservation does not attract jobs if a job on the magnetic reservation node is in 'CG' state'
Product: Slurm Reporter: Bas van der Vlies <bas.vandervlies>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: jaap.dijkshoorn
Version: 20.02.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12350
Site: SURF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Bas van der Vlies 2021-08-27 06:40:15 MDT
I had to report another issue for magnetic reservation, but it came from:
 * bug 12350, comment 6

And I got a response:
 * bug 12350, comment 11

I will summarize it I create a magnetic reservation on a a 16 core node. The trick to trigger this is not the response test but the test I try to describe:
 * terminal 1: watch -n1 "squeue -u <username> | sort"
 * terminal 2: submit 4 jobs --> these jobs are scheduled on the magnetic reservation node.
 * terminal 1: see if a job is in "CG" state
 * terminal 2: submit quickly another 4 jobs --> These jobs are scheduled on other nodes due to the "CG" state.

If there are no jobs in the "CG" state I can just submit the jobs and they are scheduled to the reservation:
 * terminal 2: submit 4 jobs
 * wait 1 sec 
 * terminal 2: again submit 4 jobs
 * All these jobs end up the magnetic reservation node. 

regards 

Bas
Comment 1 jaap.dijkshoorn@surf.nl 2021-08-27 06:55:39 MDT
This is even triggered with only 1 job on a node. As soon as that job is in the CG state, other jobs are scheduled on another node.
Comment 2 Bas van der Vlies 2021-08-27 08:19:34 MDT
When we add the '--reservation=magentic' on the command line it  will go to the 'PD' state:
 * 2739    shared submit.s      bas PD       0:00      1 (Resources)

and wait till the job with 'CG' state has been finished. Then we can submit jobs again and the node is accepting the jobs till another job is in the 'CG' state. 

The question is is the 'CG' state a blocking state for scheduling jobs on a node and is there option to override it?

Thanks

Bas
Comment 3 Bas van der Vlies 2021-08-30 03:39:56 MDT
I read from the pages that this is the expected behaviour and there are options for it: CompleteWait and reduce_completing_frag, We have to reduce the time spent in the slurmd EpiLog script.

This issue can be closed.

Bas