Ticket 5579 - Scheduling issues (sequential execution) for heterogeneous jobs with uneven number of nodes per packgroup
Summary: Scheduling issues (sequential execution) for heterogeneous jobs with uneven n...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Heterogeneous Jobs (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-08-17 02:50 MDT by Benedikt von St. Vieth
Modified: 2019-03-18 07:06 MDT (History)
4 users (show)

See Also:
Site: Jülich
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.5 19.05.0pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Testsystem slurm.conf (4.88 KB, text/plain)
2018-08-17 06:21 MDT, Benedikt von St. Vieth
Details
topology.conf (414 bytes, text/plain)
2018-08-17 07:57 MDT, Benedikt von St. Vieth
Details
gres.conf (866 bytes, text/plain)
2018-08-17 07:57 MDT, Benedikt von St. Vieth
Details
qos.txt (2.93 KB, text/plain)
2018-08-17 07:57 MDT, Benedikt von St. Vieth
Details
slurmctld.log debug2 (4.92 KB, application/x-bzip2)
2018-08-17 08:22 MDT, Benedikt von St. Vieth
Details
slurmctld.log debug2 backfill (16.06 KB, application/x-bzip2)
2018-08-20 08:06 MDT, Benedikt von St. Vieth
Details
slurmctld.log with backfill-backfillmap-priority-heterojobs (30.69 KB, application/x-bzip2)
2018-08-22 06:36 MDT, Benedikt von St. Vieth
Details
sinfo and squeue output each minute via debug with backfill-backfillmap-priority-heterojobs (967 bytes, application/x-bzip2)
2018-08-22 06:36 MDT, Benedikt von St. Vieth
Details
17.11 patch (9.47 KB, patch)
2018-09-03 08:27 MDT, Alejandro Sanchez
Details | Diff
18.08 bf_max_job_test especial treatment for hetjobs (3.57 KB, patch)
2018-11-05 04:12 MST, Alejandro Sanchez
Details | Diff
17.11 bf_max_job_test especial treatment for hetjobs (1.76 KB, patch)
2018-11-05 05:36 MST, Alejandro Sanchez
Details | Diff
18.08 - hetjob_prio v3 (15.87 KB, patch)
2018-12-21 11:14 MST, Alejandro Sanchez
Details | Diff
17.11 - bf_hetjob_prio (17.79 KB, patch)
2019-01-07 04:48 MST, Alejandro Sanchez
Details | Diff
1711 - standalone combined prio immediate state reason (30.28 KB, patch)
2019-02-06 07:46 MST, Alejandro Sanchez
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Benedikt von St. Vieth 2018-08-17 02:50:33 MDT
Hi

at the moment we are evaluating Slurm 17.11 on our testsystem, focus is on this partition:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
batch*       up 1-00:00:00        16/5/0/21  j3c[006-020,025,035,053-056]

When submitting single-node jobs, they are scheduled fine.

When a jobpack is submitted with "sbatch -N1 : -N1" everything is fine and again 20 nodes are allocated all the time.
(for job in $(seq -w 0 10); do sbatch -J PACK --time=00:60:00 -N 1 --wrap='sleep 120; hostname' : -N 1 --wrap='sleep 120; hostname'; done)

When a jobpack is submitted with "sbatch -N4 : -N1" only 5 and not 20 nodes are allocated.
(for job in $(seq -w 0 10); do sbatch -J PACK --time=00:60:00 -N 4 --wrap='sleep 120; hostname' : -N 1 --wrap='sleep 120; hostname'; done)
Comment 1 Alejandro Sanchez 2018-08-17 04:32:06 MDT
If I use --exclusive, all the nodes are allocated:

alex@ibiza:~/t$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
p1*          up   infinite     20   idle compute[1-20]
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
alex@ibiza:~/t$ for job in $(seq -w 1 4); do sbatch -J PACK --time=00:60:00 --exclusive -N 4 --wrap='sleep 120; hostname' : --exclusive -N 1 --wrap='sleep 120; hostname'; done
Submitted batch job 20036
Submitted batch job 20038
Submitted batch job 20040
Submitted batch job 20042
alex@ibiza:~/t$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
p1*          up   infinite     20  alloc compute[1-20]
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20036+0        p1     PACK     alex  R       0:08      4 compute[1-4]
           20036+1        p1     PACK     alex  R       0:08      1 compute5
           20038+0        p1     PACK     alex  R       0:08      4 compute[6-9]
           20038+1        p1     PACK     alex  R       0:08      1 compute10
           20040+0        p1     PACK     alex  R       0:08      4 compute[11-14]
           20040+1        p1     PACK     alex  R       0:08      1 compute15
           20042+0        p1     PACK     alex  R       0:08      4 compute[16-19]
           20042+1        p1     PACK     alex  R       0:08      1 compute20
alex@ibiza:~/t$

If I don't use --exclusive, then different hetjob components can share node resources and thus I don't get all the nodes allocated, only the ones needed to allocate my request:

alex@ibiza:~/t$ scancel -u alex
alex@ibiza:~/t$ for job in $(seq -w 1 4); do sbatch -J PACK --time=00:60:00 -N 4 --wrap='sleep 120; hostname' : -N 1 --wrap='sleep 120; hostname'; done
Submitted batch job 20044
Submitted batch job 20046
Submitted batch job 20048
Submitted batch job 20050
alex@ibiza:~/t$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
p1*          up   infinite      5  alloc compute[1-5]
p1*          up   infinite     15   idle compute[6-20]
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20044+0        p1     PACK     alex  R       0:07      4 compute[1-4]
           20044+1        p1     PACK     alex  R       0:07      1 compute5
           20046+0        p1     PACK     alex  R       0:07      4 compute[1-4]
           20046+1        p1     PACK     alex  R       0:07      1 compute5
           20048+0        p1     PACK     alex  R       0:07      4 compute[1-4]
           20048+1        p1     PACK     alex  R       0:07      1 compute5
           20050+0        p1     PACK     alex  R       0:07      4 compute[1-4]
           20050+1        p1     PACK     alex  R       0:07      1 compute5
alex@ibiza:~/t$
Comment 2 Benedikt von St. Vieth 2018-08-17 05:11:02 MDT
When i use --exclusive it doe not change the behaviour, i expect your cluster was large enough to schedule all jobs and therefore it worked.

When using "sbatch --exclusive -N2: --exclusive-N1" and submitting 20 jobs (in total 60 nodes, we only have 21) its exactly the same as before, only one jobpack gets scheduled at a time.

When submitting only 5 jobs (with -N2:-N1, 3 nodes each, so in total 15 nodes) all of them are scheduled. Whether i use --exclusive or not is unimportant, it works in both cases.

Nevertheless this is wrong behaviour? When there are more nodes requested then available, it should schedule more then 1 jobpack at a time?
Comment 3 Benedikt von St. Vieth 2018-08-17 05:15:29 MDT
For the whole picture:

We tried to slim down the number of jobs to a minimal working example that shows the problem.
We recognised the issue when four of our users, early adopters in that case, submitted jobs and at some point in time the scheduler was only executing one job at a time.
Comment 4 Alejandro Sanchez 2018-08-17 06:02:17 MDT
Benedikt I'm lowering the severity to 3.

I'm sorry but I don't fully understand what the issue is. Can you attach your slurm.conf and then show a clarifying example of an initial state, a job submission and what you expect after the job submission and what's the final state instead and why it is differing from what you expect?
Comment 5 Benedikt von St. Vieth 2018-08-17 06:20:02 MDT
Start:
> -bash-4.1$ sinfo -s
> PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
> batch*       up 1-00:00:00        0/21/0/21  j3c[006-020,025,035,053-056]

Submit of 20 jobpacks. 3 nodes per jobpack (2/1):

> for job in $(seq -w 0 20); do sbatch -J PACK --time=00:60:00 -N 2 --wrap='sleep 120; hostname' : -N 1 --wrap='sleep 120; hostname'; done
or
> for job in $(seq -w 0 20); do sbatch -J PACK --time=00:60:00 --exclusive -N 2 --wrap='sleep 120; hostname' : --exclusive -N 1 --wrap='sleep 120; hostname'; done

Expectation:
7 jobpacks will run, the other 13 are in the queue for later execution.

Situation:
> -bash-4.1$ squeue --state=R; sinfo -s
>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>         15693426+0     batch     PACK bvsvieth  R       0:11      2 j3c[006-007]
>         15693426+1     batch     PACK bvsvieth  R       0:11      1 j3c008
> PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
> batch*       up 1-00:00:00        3/18/0/21  j3c[006-020,025,035,053-056]060],j3i01,j3k[01-04]

Only one jobpack runs at a time, everything else is blocked.
Comment 6 Benedikt von St. Vieth 2018-08-17 06:21:29 MDT
Created attachment 7631 [details]
Testsystem slurm.conf
Comment 9 Alejandro Sanchez 2018-08-17 07:43:28 MDT
I can't reproduce for now by copying most of your config locally for testing. A few considerations come to my mind that I want you to be aware of:

1. Each component of the job counts as a job with respect to limits.
2. Heterogeneous jobs are only started by the backfill scheduler, so dispatching jobs will be slower than normal and it can be helpful to set the SchedulingParameters appropriately (especially for testing, say set "bf_interval=5")
3. MPI does not work properly if two components of the same het-job are on the same node. There might be some bug in that logic preventing all the jobs from starting. What's this? MpiDefault="pspmi".
4. Just to note bf_window=1440 is lower than your default partition options DefaultTime=06:00:00 and MaxTime=24:00:00, but you are submitting with --time=00:60:00 and that shouldn't be a problem. But keep in mind it.
5. Are you doing anything special in the job submit / spank plugins?
6. Can you attach your topology.conf and your gres.conf?
7. Do you have any limits set in your QOS?
Comment 11 Benedikt von St. Vieth 2018-08-17 07:57:00 MDT
Thanks for the fast feedback/hints.

As long as the number of allocated nodes (in total) is lower then the number of available nodes, everything is scheduled as expected (see my example in comment 2 where i submit 5 jobpacks, using 15 nodes in total, and all jobs start right away).

(In reply to Alejandro Sanchez from comment #9)
> 2. Heterogeneous jobs are only started by the backfill scheduler, so
> dispatching jobs will be slower than normal and it can be helpful to set the
> SchedulingParameters appropriately (especially for testing, say set
> "bf_interval=5")
That it is slower would be okay, but actually the scheduler only starts one job after the other.

> 3. MPI does not work properly if two components of the same het-job are on
> the same node. There might be some bug in that logic preventing all the jobs
> from starting. What's this? MpiDefault="pspmi".
The example only uses sleep/hostname, and when taking a total number of nodes that fits into the number of available nodes its working fine, so i think for now we can ignore our special PMI/MPI version. :)

> 5. Are you doing anything special in the job submit / spank plugins?
No

> 6. Can you attach your topology.conf and your gres.conf?
Attached

> 7. Do you have any limits set in your QOS?
The QOS is not very special, i attached it as file.
Comment 12 Benedikt von St. Vieth 2018-08-17 07:57:28 MDT
Created attachment 7632 [details]
topology.conf
Comment 13 Benedikt von St. Vieth 2018-08-17 07:57:41 MDT
Created attachment 7633 [details]
gres.conf
Comment 14 Benedikt von St. Vieth 2018-08-17 07:57:57 MDT
Created attachment 7634 [details]
qos.txt
Comment 16 Alejandro Sanchez 2018-08-17 08:09:07 MDT
I suspect this is due to the TopologyPlugin=topology/tree. I'm gonna ask you a few more things if you don't mind:

1. Temporarily increase debuglevel to debug2 and submit (while all nodes are idle) this job like before

for job in $(seq -w 0 20); do sbatch -J PACK --time=00:60:00 --exclusive -N 2 --wrap='sleep 120; hostname' : --exclusive -N 1 --wrap='sleep 120; hostname'; done

then attach slurmctld.log, I suspect we will see lines similar to

slurmctld: debug2: backfill: entering _try_sched for job 20703.

2. Do you mind disabling TopologyPlugin and RoutePlugin temporarily and see if that helps with this as a workaround?

3. The Partition Shared option is deprecated. Please use OverSubscribe instead.

Thanks for your collaboration.
Comment 17 Alejandro Sanchez 2018-08-17 08:10:04 MDT
(In reply to Alejandro Sanchez from comment #16)
> then attach slurmctld.log, I suspect we will see lines similar to
> 
> slurmctld: debug2: backfill: entering _try_sched for job 20703.

sorry... lines similar to this

slurmctld: debug:  _job_test_topo: could not find resources for job 20703
Comment 18 Benedikt von St. Vieth 2018-08-17 08:22:14 MDT
(In reply to Alejandro Sanchez from comment #17)
> (In reply to Alejandro Sanchez from comment #16)
> > then attach slurmctld.log, I suspect we will see lines similar to
> > 
> > slurmctld: debug2: backfill: entering _try_sched for job 20703.
This we see very much
 
> sorry... lines similar to this
> 
> slurmctld: debug:  _job_test_topo: could not find resources for job 20703
This not: bzcat slurmctld.log.bz2 | grep _job_test_topo
Comment 19 Benedikt von St. Vieth 2018-08-17 08:22:34 MDT
Created attachment 7635 [details]
slurmctld.log debug2
Comment 20 Benedikt von St. Vieth 2018-08-17 08:35:13 MDT
(In reply to Alejandro Sanchez from comment #16)
> 2. Do you mind disabling TopologyPlugin and RoutePlugin temporarily and see
> if that helps with this as a workaround?

j3b01:slurm # diff slurm.conf slurm.conf.20180817
627,628c627
< #RoutePlugin="route/topology"
< RoutePlugin="route/default"
---
> RoutePlugin="route/topology"
812,813c811
< #TopologyPlugin="topology/tree"
< TopologyPlugin="topology/none"
---
> TopologyPlugin="topology/tree"

I did, unfortunately this did not change the behaviour.
Comment 21 Alejandro Sanchez 2018-08-20 07:49:20 MDT
Can you set the 'backfill' debug flag temporarily

scontrol setdebugflags +backfill

and submit this again?

for job in $(seq -w 0 20); do sbatch -J PACK --time=00:60:00 --exclusive -N 2 --wrap='sleep 120; hostname' : --exclusive -N 1 --wrap='sleep 120; hostname'; done

then attach slurmctld.log? I'm curious to see why only 3 nodes are allocated for you. I can only reproduce with topology/tree enabled, without it you mention you continue having the issue but I can't reproduce locally with the topo disabled.
Comment 22 Benedikt von St. Vieth 2018-08-20 08:06:05 MDT
Created attachment 7646 [details]
slurmctld.log debug2 backfill
Comment 23 Benedikt von St. Vieth 2018-08-20 08:07:26 MDT
I attached the corresponding logs. There you see that the scheduler is increasing the start window very often += 1 hour.
Comment 24 Alejandro Sanchez 2018-08-22 03:42:27 MDT
Benedikt, just to make sure: did you restart the slurmctld and slurmd's all across the cluster when you disabled the topology and route plugins so that changes take effect?
Comment 25 Benedikt von St. Vieth 2018-08-22 05:56:38 MDT
(In reply to Alejandro Sanchez from comment #24)
> Benedikt, just to make sure: did you restart the slurmctld and slurmd's all
> across the cluster when you disabled the topology and route plugins so that
> changes take effect?

I only restarted slurmctld. I took your request to reboot the corresponding daemon on all nodes, but nothing changed.
Comment 26 Alejandro Sanchez 2018-08-22 06:08:22 MDT
All right. I'll ask you more logs if you don't mind. Please enable debug2 and both DebugFlags Backfill, BackfillMap, Priority, HeteroJobs.

scontrol setdebug debug2
scontrol setdebugflags +backfill,backfillmap,priority,heterojobs

then with the 21 nodes in the batch partition idle submit 

for job in $(seq -w 0 20); do sbatch -J PACK --time=00:60:00 --exclusive -N 2 --wrap='sleep 120; hostname' : --exclusive -N 1 --wrap='sleep 120; hostname'; done

then after bf_interval (wait for a backfill cycle to be executed) report sinfo -s, squeue and attach full slurmctld.log without filters

Finally restore back the debug level and setdebugflags to 

scontrol setdebug info
scontrol setdebugflags -backfill,backfillmap,priority,heterojobs

Thanks for your collaboration.
Comment 27 Benedikt von St. Vieth 2018-08-22 06:36:05 MDT
Created attachment 7666 [details]
slurmctld.log with backfill-backfillmap-priority-heterojobs
Comment 28 Benedikt von St. Vieth 2018-08-22 06:36:33 MDT
Created attachment 7667 [details]
sinfo and squeue output each minute via debug with backfill-backfillmap-priority-heterojobs
Comment 29 Alejandro Sanchez 2018-08-22 07:28:13 MDT
Is there something wrong/different with node j3c025?
I see a few oprhan jobs reported on such node
Are all daemons started with the same slurm.conf?
Comment 30 Benedikt von St. Vieth 2018-08-22 07:31:59 MDT
(In reply to Alejandro Sanchez from comment #29)
> Is there something wrong/different with node j3c025?
> I see a few oprhan jobs reported on such node
> Are all daemons started with the same slurm.conf?

For that node there is nothing specific, he is like the others.
Yes, i copied over the slurm.conf and restarted before i did the test.
Comment 34 Alejandro Sanchez 2018-08-22 08:32:07 MDT
We can finally reproduce on our end, sorry for asking you so much information. Moe pretty well suggested what could be triggering the issue here:

The key is in the configuration:
PriorityType="priority/multifactor"
PriorityWeightJobSize=14500

This causes jobs with different sizes to have a different scheduling weight. Each hetjob component is considered independently for backfill scheduling. When a collection of hetjob pairs are submitted with one large and one small job, all of the large hetjob components which are large might be considered for scheduling first and have resources reserved for them (to prevent starvation). Then all of the small components considered. If both components of a hetjob can start, that entire hetjob starts. The small hetjob components that can not start (due to being blocked by the large pending hetjob components) would delay that entire hetjob, even if resources remain idle.

While we think about ideas on how to better address this, the straightforward workaround that comes to my mind is to give less value or entirely disable PriorityWeightJobSize.
Comment 40 Benedikt von St. Vieth 2018-08-22 10:08:45 MDT
(In reply to Alejandro Sanchez from comment #34)
> We can finally reproduce on our end, sorry for asking you so much
> information. Moe pretty well suggested what could be triggering the issue
> here:
No reason for saying sorry.

I have installed the workaround, completely disabling the PriorityWeightJobSize. It works now, at least for one user. We will further stress our testsystem with different users/workloads and see how it behaves.

Looking forward to hear from you and your thoughts about a solution for this issue. :)
Comment 41 Dorian Krause 2018-08-23 05:23:27 MDT
Would it be possible to build a patch that would assign the same weight to all jobs within one pack?

We are under quite some time pressure with our 17.11 deployment and would be interested in such a workaround. We cannot fully disable the size dependent prioritization on our production machines but we might be okay favouring job packs unfairly.

I understand that such a solution may not be acceptable for a release of Slurm but it would enable us to make a step forward in our deployment.

Thanks,
Dorian Krause
Comment 42 Alejandro Sanchez 2018-08-23 06:01:51 MDT
(In reply to Dorian Krause from comment #41)
> Would it be possible to build a patch that would assign the same weight to
> all jobs within one pack?

Note that the stalling problem can be triggered not only due to JobSize weight but due to any multifactor weight and/or partition PriorityTier.

Imagine you even have multifactor disabled but have two partitions:

PartitionName=lowprio PriorityTier=1
PartitionName=highprio PriorityTier=2

Jobs submitted to highprio will always be scheduled first vs lowprio ones. Since hetjob components are scheduled independently, this scenario could trigger the same stalling (hetjob reservation deadlock) problem:

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
lowp*        up   infinite        0/10/0/10  compute[1-10]
highp        up   infinite        0/10/0/10  compute[1-10]
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20015+0     highp     wrap     alex PD       0:00      5 (None)
           20015+1      lowp     wrap     alex PD       0:00      1 (None)
           20017+0     highp     wrap     alex PD       0:00      5 (Priority)
           20017+1      lowp     wrap     alex PD       0:00      1 (Priority)

At this point now more jobs will ever start until one of these two hetjobs is manually killed. This proves that setting JobSize weight to be the same for all hetjob componetns wouldn't be a solution either. Even setting the result of all multifactor weights to be the same on all hetjob components would be a solution either.

We're studying alternatives to address this such as adding a new "bf_hetjob_resv_limit" to limit the total resources that can be reserved by pending hetjobs. For example, if the total resources reserved by all hetjob components processed by the backfill scheduler exceeds 50% of nodes, then we will not consider reserving resources for any hetjob that we haven't already reserved resources for.

We're open for suggestions and flaws on this approach as well.
Comment 43 Alejandro Sanchez 2018-08-23 06:12:06 MDT
On top of that I've also discovered a similar problem that can trigger starvation of all newer jobs:

alex@ibiza:~/t$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
p1*          up   infinite        0/10/0/10  compute[1-10]
alex@ibiza:~/t$ sbatch -N 9 : -N 2 --wrap "sleep 9999"
Submitted batch job 20023
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20023+0        p1     wrap     alex PD       0:00      9 (None)
           20023+1        p1     wrap     alex PD       0:00      2 (None)
alex@ibiza:~/t$ sbatch --wrap "sleep 999"
Submitted batch job 20025
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20023+0        p1     wrap     alex PD       0:00      9 (None)
           20023+1        p1     wrap     alex PD       0:00      2 (None)
             20025        p1     wrap     alex  R       0:02      1 compute10
alex@ibiza:~/t$

The hetjob components requests individually are possible -N < 10, but the sum of them are > 10, and that will never be satisfied unless more nodes are added. The resources are reserved though, and preventing newer jobs to ever start.
Comment 44 Alejandro Sanchez 2018-08-23 06:22:35 MDT
The example would be like this sorry:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20027+0        p1     wrap     alex PD       0:00     10 (None)
           20027+1        p1     wrap     alex PD       0:00      1 (None)
             20029        p1     wrap     alex PD       0:00      1 (Priority)
Comment 45 Alejandro Sanchez 2018-08-23 07:17:32 MDT
*** Ticket 5615 has been marked as a duplicate of this ticket. ***
Comment 48 Dorian Krause 2018-08-23 09:52:35 MDT
(In reply to Alejandro Sanchez from comment #42)
> Even setting the result of all multifactor weights to be the same on all hetjob components would be a solution either.
>

Do you mean it would or would not be a solution?

I had something along that line in my mind. After calculating the priorities, loop over all hetjobs and set the priority of all packs with in the hetjob to the max priority.

Is that feasible? It would imply that there is the same priority difference between each pack within two hetjobs. If I understand the problem correctly that should be sufficient, right?
Comment 49 Alejandro Sanchez 2018-08-24 04:25:16 MDT
(In reply to Dorian Krause from comment #48)
> (In reply to Alejandro Sanchez from comment #42)
> > Even setting the result of all multifactor weights to be the same on all hetjob components would be a solution either.
> >
> 
> Do you mean it would or would not be a solution?

Would _not_ be a solution, sorry.
 
> I had something along that line in my mind. After calculating the
> priorities, loop over all hetjobs and set the priority of all packs with in
> the hetjob to the max priority.
> 
> Is that feasible? It would imply that there is the same priority difference
> between each pack within two hetjobs. If I understand the problem correctly
> that should be sufficient, right?

I personally don't like much that idea. Note that each hetjob component might request multiple partitions (with different PartitionTier) or different size/qos/age and messing up/faking those values (which in turn change dynamically with time) doesn't look like a good approach to me.

We're working on other alternatives and we'll come back to you.
Comment 50 Benedikt von St. Vieth 2018-08-24 07:28:07 MDT
Hi,

i am out of office until 4th of September.

If there are issues with JURECA or JUROPA3/JUAMS, feel free to contact jj-adm@fz-juelich.de

best regards
benedikt


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Comment 51 Dorian Krause 2018-08-28 06:07:58 MDT
(In reply to Alejandro Sanchez from comment #49)
> We're working on other alternatives and we'll come back to you.

Do you expect that you could have a fix by the End of the week?

Thank you.

Best regards,
Dorian
Comment 52 Alejandro Sanchez 2018-08-28 06:14:55 MDT
(In reply to Dorian Krause from comment #51)
> (In reply to Alejandro Sanchez from comment #49)
> > We're working on other alternatives and we'll come back to you.
> 
> Do you expect that you could have a fix by the End of the week?

I'm trying but can't 100% guarantee.
Comment 66 Alejandro Sanchez 2018-09-03 02:34:26 MDT
Hi Dorian. Just to update we have a patch ready since Friday pending for review. We'll give you a fix as soon as possible.
Comment 68 Alejandro Sanchez 2018-09-03 08:27:29 MDT
Created attachment 7744 [details]
17.11 patch

Dorian, will you test the attached patch? You can apply it excluding the NEWS file to avoid any conflicts by using:

$ git apply --exclude=NEWS /path/to/bug_5579_v6.patch

with your working directory at the Slurm source code. Then you need to rebuild Slurm and restart the daemons. There will be two new SchedulerParameters of bf_hetjob_count and bf_hetjob_resv_limit available with the respective description in the slurm.conf man page.

I'd suggest setting bf_hetjob_resv_limit=50 and see if that helps to alleviate the issue presented in this bug. Please, let us know how it goes, thank you.
Comment 69 Alejandro Sanchez 2018-09-03 08:31:55 MDT
Note also that enabling the DebugFlags=Backfill will show potential info() messages like this:

backfill: skipping JobID=X+Y(Z) due to bt_hetjob_count

or

backfill: skipping JobID=X+Y(Z) due to bt_hetjob_resv_limit

and such JobID won't reserve resources in the current backfill cycle if any of these two limits are hit. Let us know if you have any questions.
Comment 70 Benedikt von St. Vieth 2018-09-04 01:06:14 MDT
I can confirm that when using your patch and
> bf_hetjob_resv_limit=50
That up to 50% of the available nodes are used.
Comment 71 Chrysovalantis Paschoulas 2018-09-04 02:33:25 MDT
As far as I understand, the parameters bt_hetjob_count and bt_hetjob_resv_limit solve the scheduling problem by bringing some limits for the jobpacks, right?

Is this really the best approach? I mean, by using those options can we have a jobpack which allocates all partitions (whole cluster)? If not then I really don't like it.

I would also like to propose a solution similar to what Dorian proposed: the whole jobpack should be scheduled like a single job. This of course is not so trivial. For example in different impl. all jobs in the pack could inherit the lowest priority (priority factors) and lowest priority tier (is it possible to override the prio tier of a job in a certain partition?) from the jobs with the lowest values in the pack and this should happen in the beginning during submission so it could be a little bit simplified. It is expected in this case some jobpacks will be heavily degraded scheduling-wise. This approach can be still implemented and work in addition with the parameters bt_hetjob_count and bt_hetjob_resv_limit.

Any thoughts on this?

And of course I have thought another solution: never reserve resources for a jobpack when we cannot reserve resources for all jobs of the pack. But this case cause total starvation for jobpacks :P
Comment 72 Alejandro Sanchez 2018-09-04 03:29:06 MDT
(In reply to Chrysovalantis Paschoulas from comment #71)
> As far as I understand, the parameters bt_hetjob_count and
> bt_hetjob_resv_limit solve the scheduling problem by bringing some limits
> for the jobpacks, right?

Correct.
 
> Is this really the best approach? 

It is a very good balance between code change simplicity and providing means to solve most the issues. It is not perfect, but we think it is an excellent approach.

> I mean, by using those options can we have a jobpack which allocates all
> partitions (whole cluster)? If not then I really don't like it.

Yes, you can. Note that the limit is only imposed when backfill is considering a hetjob component whose _none_ of its other components from the same hetjob have yet reserved resources. If any has previously reserved resources, then no limit will be imposed meaning that a hetjob can perfectly allocate all nodes from all partitions (whole cluster).

The idea is to favor hetjob components whose "partners" have already reserved vs those whose none of their partners have so that the former have more likelihood to make a full reservation (meaning all of the components have the chance to reserve) and thus get the hetjob started.

> I would also like to propose a solution similar to what Dorian proposed: the
> whole jobpack should be scheduled like a single job. This of course is not
> so trivial. For example in different impl. all jobs in the pack could
> inherit the lowest priority (priority factors) and lowest priority tier (is
> it possible to override the prio tier of a job in a certain partition?) from
> the jobs with the lowest values in the pack and this should happen in the
> beginning during submission so it could be a little bit simplified. It is
> expected in this case some jobpacks will be heavily degraded
> scheduling-wise. This approach can be still implemented and work in addition
> with the parameters bt_hetjob_count and bt_hetjob_resv_limit.
> 
> Any thoughts on this?

Thanks for the proposal. I do have many concerns with it though:

1. To begin with, heterogeneous job components are precisely scheduled independently by design. In _many_ parts of the code already existing logic was reused to handle individual hetjob components just by iterating over the list of components from a hetjob and then for each one treat it like a regular job (with some extra treatment for the pack_job_offset and such).

2. Changing the code so that all components inherit the priority and tier is a very complex and involved idea, requiring change in many spots and simply not that feasible. Take into account that a) each individual hetjob component can make multi-partition requests (and each partition contribute differently to the tier and to the multifactor plugin), b) there is no "meta" record to hold information about the whole hetjob and thus the values in all individual components would probably need to be overwritten, loosing current information and c) during the life cycle of a job the state can change (different multifactor terms change with time, config can change, there can be job updates including changes in partitions) so a "truncation" to the highest priority component values isn't that straightforward.
 
> And of course I have thought another solution: never reserve resources for a
> jobpack when we cannot reserve resources for all jobs of the pack. But this
> case cause total starvation for jobpacks :P

I think that would imply potential starvation of hetjob vs regular jobs and/or job arrays as you suggested. It would also imply extra overhead in the backfill logic and altering the current order in which jobs are considrered. I highly prefer what we provided over this.

In any case, thanks for the suggestions and feedback. Please, let us know if you encounter further issues and/or have more questions.
Comment 73 Chrysovalantis Paschoulas 2018-09-04 04:01:14 MDT
(In reply to Alejandro Sanchez from comment #72)
> (In reply to Chrysovalantis Paschoulas from comment #71)
> > As far as I understand, the parameters bt_hetjob_count and
> > bt_hetjob_resv_limit solve the scheduling problem by bringing some limits
> > for the jobpacks, right?
> 
> Correct.
>  
> > Is this really the best approach? 
> 
> It is a very good balance between code change simplicity and providing means
> to solve most the issues. It is not perfect, but we think it is an excellent
> approach.
> 
> > I mean, by using those options can we have a jobpack which allocates all
> > partitions (whole cluster)? If not then I really don't like it.
> 
> Yes, you can. Note that the limit is only imposed when backfill is
> considering a hetjob component whose _none_ of its other components from the
> same hetjob have yet reserved resources. If any has previously reserved
> resources, then no limit will be imposed meaning that a hetjob can perfectly
> allocate all nodes from all partitions (whole cluster).
> 
> The idea is to favor hetjob components whose "partners" have already
> reserved vs those whose none of their partners have so that the former have
> more likelihood to make a full reservation (meaning all of the components
> have the chance to reserve) and thus get the hetjob started.
> 
> > I would also like to propose a solution similar to what Dorian proposed: the
> > whole jobpack should be scheduled like a single job. This of course is not
> > so trivial. For example in different impl. all jobs in the pack could
> > inherit the lowest priority (priority factors) and lowest priority tier (is
> > it possible to override the prio tier of a job in a certain partition?) from
> > the jobs with the lowest values in the pack and this should happen in the
> > beginning during submission so it could be a little bit simplified. It is
> > expected in this case some jobpacks will be heavily degraded
> > scheduling-wise. This approach can be still implemented and work in addition
> > with the parameters bt_hetjob_count and bt_hetjob_resv_limit.
> > 
> > Any thoughts on this?
> 
> Thanks for the proposal. I do have many concerns with it though:
> 
> 1. To begin with, heterogeneous job components are precisely scheduled
> independently by design. In _many_ parts of the code already existing logic
> was reused to handle individual hetjob components just by iterating over the
> list of components from a hetjob and then for each one treat it like a
> regular job (with some extra treatment for the pack_job_offset and such).
> 
> 2. Changing the code so that all components inherit the priority and tier is
> a very complex and involved idea, requiring change in many spots and simply
> not that feasible. Take into account that a) each individual hetjob
> component can make multi-partition requests (and each partition contribute
> differently to the tier and to the multifactor plugin), b) there is no
> "meta" record to hold information about the whole hetjob and thus the values
> in all individual components would probably need to be overwritten, loosing
> current information and c) during the life cycle of a job the state can
> change (different multifactor terms change with time, config can change,
> there can be job updates including changes in partitions) so a "truncation"
> to the highest priority component values isn't that straightforward.
>  
> > And of course I have thought another solution: never reserve resources for a
> > jobpack when we cannot reserve resources for all jobs of the pack. But this
> > case cause total starvation for jobpacks :P
> 
> I think that would imply potential starvation of hetjob vs regular jobs
> and/or job arrays as you suggested. It would also imply extra overhead in
> the backfill logic and altering the current order in which jobs are
> considrered. I highly prefer what we provided over this.
> 
> In any case, thanks for the suggestions and feedback. Please, let us know if
> you encounter further issues and/or have more questions.

Thanks for the quick reply. I now understand better what you have implemented, before just by the name of the parameters I tried to understand their functionality by making also many assumptions ;)
Comment 75 Benedikt von St. Vieth 2018-09-05 01:29:44 MDT
After further testing scheduling still seems to be somewhat strange.

When i submit heterogeneous jobs to one partition (3/1), only 12 out of 21 nodes are used, 3 jobpacks running. When i try to fill up the system with single-node jobs, they are not scheduled.

When i submit heterogeneous jobs spanning across two partitions (5/1), i have 20 nodes in the first and 4 nodes in the second partition used.
Comment 76 Alejandro Sanchez 2018-09-05 04:12:15 MDT
Hi Benedikt. I can reproduce what you describe with 21 idle nodes, but I think this is a different problem because no matter if bf_hetjob_resv_limit=50 or if the option is disabled (no bf_hetjob* limits imposed) if I submit:

$ for job in $(seq -w 0 5); do sbatch --exclusive -N 3 : --exclusive -N 1 --wrap='sleep 9999'; done

12 nodes are allocated and in order to see the schednodes for the PD ones I use 

squeue --start

and it seems different components from same hetjob are incorrectly reserving same nodes or reserving nodes that are currently allocated instead of reserving idle ones, no matter if I wait to subsequent backfill cycles.

Let's see what we find.
Comment 77 Alejandro Sanchez 2018-09-05 04:20:35 MDT
alex@ibiza:~/t$ squeue --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           20485+0        p1     wrap     alex PD 2018-09-05T12:13:57      3 compute[13-15]       (None)
           20487+0        p1     wrap     alex PD 2018-09-05T12:13:57      3 compute[16-18]       (None)
           20489+0        p1     wrap     alex PD 2018-09-05T12:13:57      3 compute[19-21]       (None)
           20485+1        p1     wrap     alex PD 2018-09-06T12:07:43      1 compute1             (Priority)
           20487+1        p1     wrap     alex PD 2018-09-06T12:07:43      1 compute2             (Priority)
           20489+1        p1     wrap     alex PD 2018-09-06T12:07:43      1 compute3             (Priority)
             20491        p1     wrap     alex PD 2018-09-06T12:07:43      1 compute4             (Priority)
alex@ibiza:~/t$

12 nodes are allocated: compute[1-12]
9 nodes are idle: compute[13-21]

Then we have the list of PD jobs and the SCHEDNODES for each one. The problem is since the first component has more priority, the three +0 components are reserving the 9 idles nodes and the three +1 components start reserving compute1, 2 and 3 respectively (which are allocated).
Comment 78 Alejandro Sanchez 2018-09-05 04:32:33 MDT
Lowering bf_hetjob_resv_limit from 50 to 20 makes it possible to allocate 20 nodes for this situation although it takes a few more backfill cycles to get them all allocated (meaning less full hetjobs at once can start within the same backfil cycle).
Comment 80 Alejandro Sanchez 2018-09-07 04:02:30 MDT
Hey Benedikt. Did lowering bf_hetjob_resv_limit to 20 help? alternatively, and as a more drastic approach, you can setup bf_hetjob_count=1 then only components from one hetjob will be able to reserve resources at a time, which would ensure no deadlock due to components from other hetjobs but would delay a few more backfill cycle the initiation of other hetjobs. But at least you would be safe in terms of deadlocking.
Comment 81 Benedikt von St. Vieth 2018-09-07 04:32:56 MDT
Hi Alejandro,

we did not test lowering this value, having a that-low percentage of nodes for heterogeneous jobs was sounding not very reasonable for our approach.
Also, does this describe the different behaviour for inter-partition and cross-partition heterogeneous jobs?
Comment 82 Benedikt von St. Vieth 2018-09-07 04:49:41 MDT
We changed it to 20 now, but i think one has to empty the queue to get rid of a currently blocked scheduler, so the situations did not enhance. Again we are now back to a blocked scheduler where only one/few jobs at a time run.

Some sample output from our scheduler:
[2018-09-07T12:45:05.114] debug2: backfill: entering _try_sched for job 15717041.
[2018-09-07T12:45:05.114] Test job 15717041 at 2018-09-07T14:13:00 on j3c025
[2018-09-07T12:45:05.115] backfill: skipping JobID=15717041+0(15717041) due to bt_hetjob_resv_limit
[2018-09-07T12:45:05.115] Job 15717041+1 (15717042) in partition batch set to start in 31536000 secs
[2018-09-07T12:45:05.115] Job 15717041+1 (15717042) in partition batch expected to start in 31536000 secs
[2018-09-07T12:45:05.115] Job 15717041+2 (15717043) in partition batch set to start in 31536000 secs
[2018-09-07T12:45:05.115] Job 15717041+2 (15717043) in partition batch expected to start in 31536000 secs
...

I submitted some new non-hetero jobs, one got scheduled this way:
[2018-09-07T12:48:05.498] backfill test for JobID=15717066 Prio=100870 Partition=batch
[2018-09-07T12:48:05.498] debug2: backfill: entering _try_sched for job 15717066.
[2018-09-07T12:48:05.498] Test job 15717066 at 2018-09-07T12:48:05 on j3c[018-020,035,053-056]
[2018-09-07T12:48:05.498] backfill: Try later job 15717066 later_start 1536317340
[2018-09-07T12:48:05.498] debug2: backfill: entering _try_sched for job 15717066.
[2018-09-07T12:48:05.498] Test job 15717066 at 2018-09-07T17:39:00 on j3c[012-017]
[2018-09-07T12:48:05.499] Job 15717066 to start at 2018-09-07T17:39:00, end at 2018-09-07T17:49:00 on j3c[012-014]

While there are free nodes:
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
batch*       up 1-00:00:00        9/12/0/21  j3c[006-020,025,035,053-056]
Comment 83 Alejandro Sanchez 2018-09-07 05:20:54 MDT
(In reply to Benedikt von St. Vieth from comment #81)
> Hi Alejandro,
> 
> we did not test lowering this value, having a that-low percentage of nodes
> for heterogeneous jobs was sounding not very reasonable for our approach.

Why not? hetjobs can still reserve all nodes in the cluster, just it takes a few more backfill cycles to start them all.

> Also, does this describe the different behaviour for inter-partition and
> cross-partition heterogeneous jobs?

I guess you are referencing to this from comment 75:

"When i submit heterogeneous jobs spanning across two partitions (5/1), i have 20 nodes in the first and 4 nodes in the second partition used."

Can you show the exact submission command and what you see and what you expect?(In reply to Benedikt von St. Vieth from comment #82)
> We changed it to 20 now, but i think one has to empty the queue to get rid
> of a currently blocked scheduler, so the situations did not enhance. Again
> we are now back to a blocked scheduler where only one/few jobs at a time run.

Note that if i.e: you have bf_hetjob_count=1 this does _not_ impede _running_ more than one hetjob at a time, but instead impedes _reserving_ more than one hetjob at a time, which is very different.
Comment 84 Benedikt von St. Vieth 2018-09-07 05:36:59 MDT
(In reply to Alejandro Sanchez from comment #83)
> Why not? hetjobs can still reserve all nodes in the cluster, just it takes a
> few more backfill cycles to start them all.
I reconfigured it to "bf_hetjob_resv_limit=20" and indeed, after some time, he is now using all available nodes.

> Can you show the exact submission command and what you see and what you
> expect?(In reply to Benedikt von St. Vieth from comment #82)
The system was empty, I submitted Jobs via
* for job in $(seq -w 0 30); do sbatch -p batch -J PACK --time=00:60:00 --exclusive -N 3 --wrap='sleep 120; hostname' : --exclusive -p batch -N 1 --wrap='sleep 120; hostname'; done
* for job in $(seq -w 0 30); do sbatch -p batch -J SGL --time=00:10:00 -N 1 --wrap='sleep 120; hostname'; done

In general 21 of the 21 nodes that build the batch partition should be allocated. But only 12 were used (3 pack jobs).

Next test was: System was emtpy, I submitted Jobs via
* for job in $(seq -w 0 30); do sbatch -p batch -J PACK --time=00:60:00 --exclusive -N 5 --wrap='sleep 120; hostname' : --exclusive -p knl -N 1 --wrap='sleep 120; done
* for job in $(seq -w 0 30); do sbatch -p batch -J SGL --time=00:10:00 -N 1 --wrap='sleep 120; hostname'; done
 hostname'; done

21 of the available 21 nodes were filled with jobs.
Comment 85 Alejandro Sanchez 2018-09-07 06:08:07 MDT
So I guess we are fine checkin the patch in and closing the bug with such results?
Comment 87 Benedikt von St. Vieth 2018-09-12 02:44:28 MDT
We implemented the patch but still see issues.

If we span a job across two partitions which have a different QOS, this could lead to the same issue that we saw with different job-sizes. Differing weight per job of the pack, leading to diverging priorities. Do you agree?

Also, the backfilling still does not work as expected, one question:
We have set "bf_max_job_test=128". When we have 130 jobpacks, is he trying to schedule 260 jobs? If so, it could happen that two jobs of a pack are never scheduled because he  skips the scheduling after he tested 128 jobs.
Comment 88 Alejandro Sanchez 2018-09-12 08:26:25 MDT
(In reply to Benedikt von St. Vieth from comment #87)
> We implemented the patch but still see issues.
> 
> If we span a job across two partitions which have a different QOS, this
> could lead to the same issue that we saw with different job-sizes. Differing
> weight per job of the pack, leading to diverging priorities. Do you agree?

The original problem can be caused not only by PriorityWeightJobSize but by anything contributing to a job's/component priority. That's why the two options provided in the patch help addressing the issue.

Do you have a counter-example job submission where you are encountering more sched deadlocks even if setting bf_hetjob_resv_limit low enough or bf_hetjob_count to 1?
 
> Also, the backfilling still does not work as expected, one question:
> We have set "bf_max_job_test=128". When we have 130 jobpacks, is he trying
> to schedule 260 jobs? If so, it could happen that two jobs of a pack are
> never scheduled because he  skips the scheduling after he tested 128 jobs.

Currently each hetjob component is computed separately, so if you have bf_max_job_test=2 and you submit:

$ sbatch -N1 : -N1 : -N1 --wrap "sleep 999" (3 hetjob components) this job will never be scheduled.
Comment 89 Benedikt von St. Vieth 2018-09-12 20:59:43 MDT
(In reply to Alejandro Sanchez from comment #88)
> Do you have a counter-example job submission where you are encountering more
> sched deadlocks even if setting bf_hetjob_resv_limit low enough or
> bf_hetjob_count to 1?
No, unfortunately no example and fortunately no complete deadlocks.

We will try to find a example, because in the last days we often see packjobs not completely blocking the scheduler but again scheduling jobs in a sequential order while nodes are empty.

> Currently each hetjob component is computed separately, so if you have
> bf_max_job_test=2 and you submit:
> 
> $ sbatch -N1 : -N1 : -N1 --wrap "sleep 999" (3 hetjob components) this job
> will never be scheduled.
Do you work on a solution here/accept this issue? Should we create a new bug report for this?
Comment 92 Alejandro Sanchez 2018-09-13 08:10:35 MDT
(In reply to Benedikt von St. Vieth from comment #89)

> > Currently each hetjob component is computed separately, so if you have
> > bf_max_job_test=2 and you submit:
> > 
> > $ sbatch -N1 : -N1 : -N1 --wrap "sleep 999" (3 hetjob components) this job
> > will never be scheduled.
> Do you work on a solution here/accept this issue? Should we create a new bug
> report for this?

We'll just document in slurm.conf that each component of a heterogeneous job counts as one job with respect to this kind of limits.
Comment 99 Dorian Krause 2018-09-16 08:36:54 MDT
(In reply to Alejandro Sanchez from comment #92)
> > Do you work on a solution here/accept this issue? Should we create a new bug
> > report for this?
> 
> We'll just document in slurm.conf that each component of a heterogeneous job
> counts as one job with respect to this kind of limits.

While the limitation in this specific example (number of components in a heterogeneous job > backfill depth) is not a major problem for us, the underlying reason for it is a major concern as it affects overall scheduler efficiency.
There are workload mixes for which the current scheduling will result in a crtically underutilized system if heterogeneous jobs unnecessarily block resources because one component is too lowly prioritized.
At the moment our assessment is that the quality of the scheduling decisions by Slurm 17.11/18.08 will be significantly lower for our workloads and that the scheduler currently does not meet our expectation.
We would kindly ask for a phone conference in the coming week to discuss the problems we observe and the available options. Is it possible to schedule one in the second half of the coming week (19th or 20th)?

Thank you.

Best regards,
Dorian Krause
Comment 104 Alejandro Sanchez 2018-09-17 12:31:28 MDT
Hi Dorian. Will Wednesday 19th at 5pm CEST work for you? That way Moe and Jacob could also join the teleconf.

Does Google Hangouts work as a tool for audioconf for you? If so, will you send me an e-mail address we would add to the Hangouts Meeting?

Thanks.
Comment 105 Dorian Krause 2018-09-17 12:54:18 MDT
(In reply to Alejandro Sanchez from comment #104)
> Hi Dorian. Will Wednesday 19th at 5pm CEST work for you? That way Moe and
> Jacob could also join the teleconf.
> 
> Does Google Hangouts work as a tool for audioconf for you? If so, will you
> send me an e-mail address we would add to the Hangouts Meeting?
> 
> Thanks.

Thank you for making yourself available for a call and proposing a date. Wednesday would work for me; I am checking with the rest of the team but I am positive that this date fits.
I have no experience with Google Hangout. We'll check if there are restrictions on our side that might be a blocker. Alternatively, if Google Hangout does not work, we can offer a phone dial-in option ourselves. We get back to you on that.
Comment 117 Dorian Krause 2018-10-18 03:55:13 MDT
Hi,

do you have an update for us regarding the analysis and the next steps?

Thanks,
Dorian
Comment 118 Alejandro Sanchez 2018-10-18 04:05:28 MDT
Unfortunately, we're still leveraging ways to address this. The design decision of having each hetjob component as a separate job_record has a lot of undesired consequences. At submit time there's a node bitmap of potentially selected nodes for a job allocation request. We're studying re-using that bitmap as well as the license count for the next hetjob component resource allocation request, instead of using a new independent one. But the code around the allocation requests wasn't thought to re-use resources from previous requests, so it's not easy to accomplish. If we manage to do so at least we would know all components fit in the available resources (at least in terms of nodes and licenses). Even if we manage to fix that, we still need to do something with each component having different priorities and being scheduled independently and potentially blocking the scheduling flow.
Comment 119 Alejandro Sanchez 2018-10-18 04:22:16 MDT
As an example, following is a hetjob allocate request code path:

_slurm_rpc_allocate_pack() <- iterate over hetjob comps. requests. For each one:
  validate_job_create_req()
  job_allocate() <- this would require a signature change to re-use from prev.
    _job_create() <- this too
      license_validate() <- this as well
      gres/bb/other resources validations would be more complex to validate as a whole
      _select_nodes_parts() <- this as well
        job_limits_check()
        select_nodes() <- this as well
          ...
          specific select plugin

This whole path was initially thought to handle independent requests. Now trying to re-use accumulated resources from previous iterations and in such sensitive code path is complex and prone to break other things.

If hetjobs info were held in a single job_record with the job_details holding a list of different components, all this would be easier to handle.

Just telling so you have a better feel of the problem.
Comment 127 Alejandro Sanchez 2018-10-25 11:01:18 MDT
Hi. Sorry if this is taking so long we're having a decent bunch of internal discussion going on around this. 

We're now focusing on bf_max_job_test as you commented that is an issue in comment 87. Since currently each component of a hetjob counts as 1 with regards to this limit, and some components might have been tested within a scheduling cycle and some others might not due to the limit has been reached, would it be useful for you the following idea?

Once we reach the bf_max_job_test, instead of breaking out the scheduling cycle, identify every hetjob that can start now and if any component has not yet been tested, even if the limit is reached, ignore it for such hetjob and test if the rest of components can start immediately.
Comment 128 Dorian Krause 2018-10-25 11:01:41 MDT
Dear Sender,

thank you for your message. I am out-of-office from 19.10 to 30.10 on vacation and a business trip. Please expect a delayed response.

In urgent cases, please include "URGENT" in the subject of your message and add Michael Stephan (m.stephan@fz-juelich.de) on Cc.

Thank you.

Best regards,
Dorian Krause



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Comment 129 Chrysovalantis Paschoulas 2018-10-26 05:45:12 MDT
(In reply to Alejandro Sanchez from comment #127)
> Hi. Sorry if this is taking so long we're having a decent bunch of internal
> discussion going on around this. 
> 
> We're now focusing on bf_max_job_test as you commented that is an issue in
> comment 87. Since currently each component of a hetjob counts as 1 with
> regards to this limit, and some components might have been tested within a
> scheduling cycle and some others might not due to the limit has been
> reached, would it be useful for you the following idea?
> 
> Once we reach the bf_max_job_test, instead of breaking out the scheduling
> cycle, identify every hetjob that can start now and if any component has not
> yet been tested, even if the limit is reached, ignore it for such hetjob and
> test if the rest of components can start immediately.

Yes, it should fix the main problem, and this is more or less what I suggested in our last telco.

I only hope that this will not create any other problems, e.g. what if there are too many components of hetjobs to be tested and it takes that long that we will reach the next backfilling interval. In such case I guess the only solution is to increase that interval, right?

Anyway only with further testing afterwards we can see if all scheduling issues of hetjobs are fixed..
Comment 132 Alejandro Sanchez 2018-11-05 04:12:44 MST
Created attachment 8199 [details]
18.08 bf_max_job_test especial treatment for hetjobs

Hi. Attached patch continues accumulating start times for hetjob components even when bf_max_job_test is reached. I've been testing it and this seems to solve the problem where some components fit in the table but others aren't due to the limit. This applies against 18.08. I'm not sure if you already upgraded to this branch; otherwise, let me know and I can prepare a 17.11 version of this patch. Thank you.
Comment 133 Chrysovalantis Paschoulas 2018-11-05 04:53:14 MST
(In reply to Alejandro Sanchez from comment #132)
> Created attachment 8199 [details]
> 18.08 bf_max_job_test especial treatment for hetjobs
> 
> Hi. Attached patch continues accumulating start times for hetjob components
> even when bf_max_job_test is reached. I've been testing it and this seems to
> solve the problem where some components fit in the table but others aren't
> due to the limit. This applies against 18.08. I'm not sure if you already
> upgraded to this branch; otherwise, let me know and I can prepare a 17.11
> version of this patch. Thank you.

Great news thanks!

But it will take some time until we move to 18.08, so it would be nice if you could backport it (or provide us a patch) for 17.11. Anyway tomorrow we will upgrade to 17.11.12 ;)

Regards
Comment 134 Alejandro Sanchez 2018-11-05 05:36:35 MST
Created attachment 8200 [details]
17.11 bf_max_job_test especial treatment for hetjobs

Please see attached the 17.11 standalone patch.
Comment 135 Alejandro Sanchez 2018-11-19 07:40:54 MST
Hi. Any feedback from the patch? thanks.
Comment 136 Benedikt von St. Vieth 2018-11-19 10:42:13 MST
Hi Alejandro,

we plan to test a Slurm 17.11 version with your patch middle to end of this week. Afterwards we will inform you whether our main issue is solved.

Thanks for coming back to us!
best
benedikt
Comment 137 Benedikt von St. Vieth 2018-11-21 07:46:26 MST
Hi again,

it looks like the situation did not enhance, beside of the fact that the Reason printed with queue enhanced a bit:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        15818038+0       knl sleep 60 bvsvieth PD       0:00      4 (None)
        15818038+1       knl sleep 60 bvsvieth PD       0:00      1 (None)
          15818040       knl     bash bvsvieth PD       0:00      1 (Priority)

We have a jobpack that is not able to be scheduled, but the backfiller is not taking into account the other job with lower prio.

          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
       15818038 knl           104326         20          0       4306     100000
       15818039 knl           101137         20          0       1117     100000
       15818040 knl           101079         16          0       1063     100000

When we boost the prio of job 15818040 to be higher than the one of the modular job, its immediately scheduled.

best
Benedikt
Comment 138 Benedikt von St. Vieth 2018-12-07 08:09:56 MST
Hi Alejandro,

is there anything new regarding this topic? Were you able to reproduce the issues?

best
Benedikt
Comment 139 Alejandro Sanchez 2018-12-07 08:25:25 MST
What you report now looks like a different problem. The patch we provided avoids the problem of a hetjob not starting due to part of the components fitting in bf_max_job_test but not the rest of them. From your previous comment it looks like the problem now is that due to a hetjob not being scheduled, lower priority regular job can't be scheduled because the hetjob is blocking it. Why is the hetjob not starting now? theoretically the bf_max_job_test shouldn't be a problem with the given patch.
Comment 143 Benedikt von St. Vieth 2018-12-10 08:58:12 MST
Hi Alejandro,

the initial problem for this patch is yet to be tested, unfortunately our test-system is in maintenance.
For the "new" problem we described, the job is never able to be scheduled because it is requesting too many nodes (4+1 while the partition only has 4).
Nevertheless it should be denied or not deadlock the system, do you agree?

best
Benedikt
Comment 144 Alejandro Sanchez 2018-12-10 09:16:23 MST
I agree although that is a separate issue tracked in bug 5656. I'd suggest we focus on one issue per bug. Otherwise, we try to tackle/discuss many things at once and we end up not addressing anything.

We're now working on either two measures that had been suggested around:

1. min/ave the priority of all components of a hetjob 
2. try to schedule all components of a hetjob in a row and not interleaved with other jobs/hetjobs as currently happens to avoid deadlocking.
Comment 150 Dorian Krause 2018-12-13 13:09:09 MST
(In reply to Alejandro Sanchez from comment #144)
> I agree although that is a separate issue tracked in bug 5656. I'd suggest
> we focus on one issue per bug. Otherwise, we try to tackle/discuss many
> things at once and we end up not addressing anything.
> 
> We're now working on either two measures that had been suggested around:
> 
> 1. min/ave the priority of all components of a hetjob 
> 2. try to schedule all components of a hetjob in a row and not interleaved
> with other jobs/hetjobs as currently happens to avoid deadlocking.


Dear Alejandro,

one question concering 1: Will this work with `bf_max_job_part != 0`? In our settings, job packs will in most cases be in different partitions that each may have a different homogeneous jobs with their own priority distribution.

If possible, we would also like to have the max option for 1. - if it is possible I would suggest to leave the choice to the administrators via a slurm.conf setting.
In our setting we will likely be more interested in a policy that favors heterogeneous jobs in the scheduler. My understanding is that option 2 would also do that, though it is not quite clear how the interlaving would be avoided. Do you block scheduling of homogeneous job until the heterogeneous jobs are started?

Thanks,
Dorian
Comment 152 Benedikt von St. Vieth 2018-12-21 11:14:50 MST
Hi,

i am out of office until 2nd of January.

If there are issues with JURECA, JUROPA3/JUAMS or JUZEA1, feel free to contact jj-adm@fz-juelich.de

best regards
benedikt


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Comment 177 Alejandro Sanchez 2019-01-07 01:52:17 MST
(In reply to Dorian Krause from comment #150)
> Dear Alejandro,

Hi Dorian,

A new bf_hetjob_prio option has been added, available since 18.08.5 release:

https://github.com/SchedMD/slurm/commit/b24f673ecb6902

this will make it so hetjob components belonging to same hetjob will be attempted to be scheduled consecutively (avoiding the interleaving problem).
 
> one question concering 1: Will this work with `bf_max_job_part != 0`? In our
> settings, job packs will in most cases be in different partitions that each
> may have a different homogeneous jobs with their own priority distribution.

The patch doesn't affect how hetjobs/components are counted with regards to bf_max* options or other limits, just alters the scheduling sort algorithm so that they are considered in a row. There's an internal bug 5957 tracking how hetjobs are counted against limits. 

> If possible, we would also like to have the max option for 1. - if it is
> possible I would suggest to leave the choice to the administrators via a
> slurm.conf setting.

The option accepts any of [min|avg|max]. A more detailed description can be found in the slurm.conf man page.

> In our setting we will likely be more interested in a policy that favors
> heterogeneous jobs in the scheduler. My understanding is that option 2 would
> also do that, though it is not quite clear how the interlaving would be
> avoided. Do you block scheduling of homogeneous job until the heterogeneous
> jobs are started?

bf_hetjob_prio does not favor hetjobs but just alters the sorting algorithm so that all components belonging to same hetjob are attempted consecutively (either by biasing all of them to min|avg|max upon configuration as mentioned).

bf_hetjob_immediate is yet not implemented; we'll get back to you once it is ready. In any case, the original problem described in this bug description (which was sched deadlocking due to interleaved components and more specifically due to PriorityWeightJobSize) and requesting components with different sizes is tackled with bf_hetjob_prio.

Are you still running 17.11? If so, are you planning to upgrade to 18.08 or you need me to attach a 17.11 version of the patch for you to maintain locally?
 
> Thanks,
> Dorian

Thanks to you, and sorry for the delay on this bug resolution.
Comment 178 Benedikt von St. Vieth 2019-01-07 02:58:24 MST
Dear Alejandro,

yes, unfortunately we are still on 17.11.

Because for the time being we already had 2 patch files that you provided, it would be nice to get a back port for this too.
And the question, can we remove some of the old patches?

best
Benedikt
Comment 179 Alejandro Sanchez 2019-01-07 04:48:13 MST
Created attachment 8859 [details]
17.11 - bf_hetjob_prio

See attached a 17.11 standalone patch of bf_hetjob_prio that applies on 17.11 current HEAD.

I'd remove the patch that added bf_hetjob_count and bf_hetjob_resv_limit and apply this one of bf_hetjob_prio, since it's a different (and better) approach of attacking the same problem.

I'd also remove the bf_max_job_test special treatment for hetjobs patch unless you're continuously hitting this. I think how hetjobs/components are counted against either scheduler/accounting limits can globally be tracked as a separate issue in the mentioned bug 5957 (I'll ask if we can make this public). Otherwise we're adding too much overhead here. While bf_max_job_test special treatment patch here could be maintained, I think it is better hetjobs and limits are tackled all at once in the same way in that bug. Attacking only bf_max_job_test alone wouldn't make sense when there are potential issues also with bf_max_job_[assoc|part|user|user_part], slurm.conf MaxJobs or other limits set on different entities like the accounting ones.

So in short I'd remove all the previous patches provided here and just work with bf_hetjob_prio and see if the original deadlocking problem is mitigated. In the meantime we'll work on bf_hetjob_immediate.

Please, let me know if you have further questions. Thanks.
Comment 180 Chrysovalantis Paschoulas 2019-01-18 09:23:11 MST
Dear Alejandro,

I would like to bring some good news, it seems that the latest patch which adds the option bf_hetjob_prio fixes both bugs 5579 and 5957, or at least what we have tested until now.

We had 2 major issues:
1. The first case was that we could create a scheduling deadlock where no jobs could run on a partition when a hetjob was blocked because of unavailable resources. Easily reproducible when a job from the pack allocates a whole partition and other jobs from the pack request nodes from that partition.
2. The second case was that while a hetjob was waiting for Resources and one of its jobs had highest prio in a partition and "reserved" nodes for that job then those "reserved" nodes couldn't be used for backfilling low prio jobs.

Both cases it seems that they are solved with that patch! :D Even with many blocked hetjobs (more than bf_max_test_job, default_queue_depth and other bf_$ limits) the tests passed successfully and low prio jobs could be scheduled! The only concern is that the Reason field has some unexpected values => all blocked hetjob jobs had Resources in the REASON field, is this expected or not?

We still need to do more tests and cover more cases.

Thank you so much for the great work! Cya :)
Comment 181 Chrysovalantis Paschoulas 2019-01-24 08:54:05 MST
Just one remark, when I set bf_max_job_test=2 then no hetjob with more than jobs inside could be scheduled, because BF was testing only the 2 first jobs and never the rest jobs of the hetjob.

This is an extreme corner case and normally doesn't hurt, but it seems to be a restriction on the hetjobs side. I mean if we have that option set to 64 then a hetjob with 65 jobs will never be scheduled!

Is this the intended behaviour?
Comment 182 Alejandro Sanchez 2019-01-28 04:50:13 MST
(In reply to Chrysovalantis Paschoulas from comment #180)
> Dear Alejandro,
> 
> I would like to bring some good news, it seems that the latest patch which
> adds the option bf_hetjob_prio fixes both bugs 5579 and 5957, or at least
> what we have tested until now.
> 
> We had 2 major issues:
> 1. The first case was that we could create a scheduling deadlock where no
> jobs could run on a partition when a hetjob was blocked because of
> unavailable resources. Easily reproducible when a job from the pack
> allocates a whole partition and other jobs from the pack request nodes from
> that partition.
> 2. The second case was that while a hetjob was waiting for Resources and one
> of its jobs had highest prio in a partition and "reserved" nodes for that
> job then those "reserved" nodes couldn't be used for backfilling low prio
> jobs.
> 
> Both cases it seems that they are solved with that patch! :D Even with many
> blocked hetjobs (more than bf_max_test_job, default_queue_depth and other
> bf_$ limits) the tests passed successfully and low prio jobs could be
> scheduled! 

Glad to hear the nwe option is being useful.

> The only concern is that the Reason field has some unexpected
> values => all blocked hetjob jobs had Resources in the REASON field, is this
> expected or not?

Well I don't see all blocked/PD hetjob components always with Reason set to Resources. What I see from my tests is sometimes the Reason is set to Priority, some others to Reason, and some others they are left with no reason set (None). I think there's margin for improvement around this, but I'd vote tracking that as a separate bug.

Currently, there is no Reason associated with PD jobs due to bf_max* or other sched limits. I.e. if bf_max_job_part or similar limit is hit, a message like this is logged to slurmctld.log

info("backfill: have already checked %u jobs for user %u on partition %s; skipping %pJ")

but the job won't change its reason to BackfillMaxJobPart or similar. But this happens for all jobs no matter hetjobs/arrays/regular jobs, and am not even sure it would make sense to set a reason for these cases.
 
> We still need to do more tests and cover more cases.
> 
> Thank you so much for the great work! Cya :)

Thanks for your feedback.

(In reply to Chrysovalantis Paschoulas from comment #181)
> Just one remark, when I set bf_max_job_test=2 then no hetjob with more than
> jobs inside could be scheduled, because BF was testing only the 2 first jobs
> and never the rest jobs of the hetjob.
> 
> This is an extreme corner case and normally doesn't hurt, but it seems to be
> a restriction on the hetjobs side. I mean if we have that option set to 64
> then a hetjob with 65 jobs will never be scheduled!
> 
> Is this the intended behaviour?

That should be addressed alongside the rest of Slurm limits calculation in bug 5957, as I've said a few times in other comments.
Comment 184 Chrysovalantis Paschoulas 2019-01-30 09:23:45 MST
All my tests on a small test cluster went fine, but now that we deployed to a big production cluster we can see a weird behaviour with the scheduling of hetjobs... On both clusters we have Slurm 17.11.12 with the latest patch which adds the bf_hetjob_prio parameter.

On the production cluster we can see that the hetjobs while are pending in the queue they "never" get a reason, but instead they keep having "(None)" as their reason. On the small test cluster even with many normal jobs in queue (I tested with 5k) the hetjobs get immediately the "(Priority)" reason.

Here are the differences in the scheduler parameters, a is the small test cluster and b is the big production one:
{{{
diff --git a/p/home/jusers/paschoulas1/juropa3exp/shared/params_j3exp_sorted b/p/home/jusers/paschoulas1/juropa3exp/shared/params_jumels_sorted
index f939101..d69d1df 100644
--- a/p/home/jusers/paschoulas1/juropa3exp/shared/params_j3exp_sorted
+++ b/p/home/jusers/paschoulas1/juropa3exp/shared/params_jumels_sorted
@@ -1,30 +1,35 @@
 batch_sched_delay=4
 bf_continue
 bf_hetjob_prio=max
-bf_interval=60
+bf_interval=120
+bf_job_part_count_reserve=0
 bf_max_job_array_resv=20
 bf_max_job_part=0
 bf_max_job_start=0
-bf_max_job_test=10
+bf_max_job_test=32
 bf_max_job_user=0
 bf_min_age_reserve=0
 bf_min_prio_reserve=0
-bf_resolution=60
-bf_window=1440
+bf_resolution=180
+bf_window=1500
 bf_yield_interval=2000000
 bf_yield_sleep=500000
 build_queue_timeout=2000000
-default_queue_depth=512
+default_queue_depth=1024
+defer
 enable_hetero_steps
 kill_invalid_depend
-max_depend_depth=16
-max_rpc_cnt=0
+max_depend_depth=64
+max_rpc_cnt=128
 max_sched_time=2
 max_script_size=4194304
+max_switch_wait=300
 no_env_cache
+nohold_on_prolog_fail
 partition_job_depth=0
 preempt_reorder_count=1
 requeue_setup_env_fail
+salloc_wait_nodes
 sched_interval=60
 sched_max_job_start=0
 sched_min_interval=1000000
}}}

In general it looks to us like the hetjobs are not being scheduled as "fast" as we would expect even with bf_hetjob_prio=max, but we saw that a small hetjob was actually scheduled in the end after many hours in the queue. So we cannot say that the scheduling of the hetjobs is completely blocked, it is just "slower" than expected.

Any ideas why the Reason field differs between those 2 clusters?
Comment 185 Alejandro Sanchez 2019-02-05 02:27:43 MST
Hi Chrysovalantis,

I can reproduce hetjobs state reason remaining as WAIT_NO_REASON (None) even after backfill cycle evaluation. Despite this separate issue should be tracked in a separate bug, I've prepared a patch that seems to update the state reason when appropriate and is currently pending for review. We'll come back to you shortly.
Comment 191 Alejandro Sanchez 2019-02-06 07:46:02 MST
Created attachment 9094 [details]
1711 - standalone combined prio immediate state reason

Attached patch combines bf_hetjob_prio, bf_hetjob_immediate and the fix for hetjobs state reason to be used / maintained as a standalone 17.11 patch. Please, try this out and let us know if the state reason is now set appropriately.

Next week we'll give you more information about the commits related to this in 18.08 and potential changes on default values for future 19.05.

Thanks.
Comment 192 Alejandro Sanchez 2019-02-12 02:41:49 MST
Any updates on this? is the PD reason properly updated now? thanks.
Comment 193 Chrysovalantis Paschoulas 2019-02-12 03:11:00 MST
(In reply to Alejandro Sanchez from comment #192)
> Any updates on this? is the PD reason properly updated now? thanks.

Hi Alejandro,

it seems that it has improved but still there are some issues.

In our small test cluster I submitted hetjobs which requested nodes from 2 different partitions and some normal jobs and I mixed them over time.

What happened was:

-- REASONS --
1. The hetjobs were now getting a Reason while pending, but not immediately like the normal jobs. Some got fast a Reason while other hetjobs had for a "long period" the None Reason (all jobs in the pack).

2. For most hetjobs the Reason for +0 pack jobs was Resources and for +1 pack jobs was None, while for all eligible normal jobs it was Priority. Only a few times I saw a hetjob to have Priority as its Reason.

-- SCHEDULING --
3. Even with scheduler parameters bf_hetjob_prio=max,bf_hetjob_immediate the normal jobs were always scheduled first even when hetjobs had Resources as their Reason. For example the queue had only hetjobs some of them with Resurces as their Reason and then I submitted some low prio small normal jobs and the backfiller scheduled the normal jobs instead of the high prio hetjobs.

I think the third issue is really the most important! But the Reasons is nice to be updated and my idea is that all Reasons must always be updated even for example when a job is not checked then update the reason to Unchecked instead of leaving it to None..
Comment 202 Alejandro Sanchez 2019-02-26 11:27:21 MST
I've opened bug 6593 and bug 6594 to track comment 193 new issues.

We're still pending review bf_hetjob_immediate in here and another minor patch and once checked in we will close this one, leaving these two separate bugs to track the rest of the new concerns.
Comment 204 Alejandro Sanchez 2019-03-18 07:06:38 MDT
Hi,

I'm gonna go ahead and close this bug as per the fixes in:

https://github.com/SchedMD/slurm/commit/44ed6bc3b3

and

https://github.com/SchedMD/slurm/commit/b24f673ecb

I've also opened 3 separate bugs to track other issues with hetjobs and decouple them from here:

1. Make bf_max_job_start sensitive to hetjobs (bug 6710)
2. Main scheduler preferring regular jobs vs hetjobs when it shouldn't (bug 6593)
3. Improve pending hetjobs state_reason (bug 6594)