Ticket 6384 - Would like to have NodeWeight functional in a topology/tree-configured environment
Summary: Would like to have NodeWeight functional in a topology/tree-configured enviro...
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-01-18 13:16 MST by Ryan Novosielski
Modified: 2020-02-14 09:01 MST (History)
3 users (show)

See Also:
Site: Rutgers
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name: perceval
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.19 KB, text/plain)
2019-01-23 15:46 MST, Ryan Novosielski
Details
topology.conf (336 bytes, text/plain)
2019-01-23 15:46 MST, Ryan Novosielski
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Novosielski 2019-01-18 13:16:43 MST
We are aware of the following limitation (from https://slurm.schedmd.com/topology.html):

"NOTE:Slurm first identifies the network switches which provide the best fit for pending jobs and then selectes the nodes with the lowest "weight" within those switches. If optimizing resource selection by node weight is more important than optimizing network topology then do NOT use the topology/tree plugin."

This is problematic in our environment, as coincidentally the topology/tree plugin largely ensures that the nodes we have configured with the highest weights will be used first, or nearly first:

# COMPUTE NODES
NodeName=node[001-130] CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN
NodeName=node[131-132] Weight=2 CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN
NodeName=cuda[001-008] Weight=4 CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4
NodeName=memnode[001] Weight=8 CPUs=48 RealMemory=1548076 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN

Our highest weight node is on a switch by itself with the storage array to provide the fastest access to the storage. Our GPU nodes are also grouped together on a single switch, and that switch also contains the login and infrastructure nodes that are not accounted for by the scheduler, so there are only 20 nodes on that switch vs. 24 on the others. The result appears to be that single node jobs will always prefer this equipment, and there are always single node jobs running.

Any suggestion for a clever workaround would be helpful too, but I suspect this conflict makes both topology/tree and the node weight options much less useful to many sites.
Comment 3 Alejandro Sanchez 2019-01-23 04:07:45 MST
Hi,

(In reply to Ryan Novosielski from comment #0)
> We are aware of the following limitation (from
> https://slurm.schedmd.com/topology.html):
> 
> "NOTE:Slurm first identifies the network switches which provide the best fit
> for pending jobs and then selectes the nodes with the lowest "weight" within
> those switches. If optimizing resource selection by node weight is more
> important than optimizing network topology then do NOT use the topology/tree
> plugin."

This note was added in commit:

https://github.com/SchedMD/slurm/commit/f1a3e958e19c1cf34663a0

as a result of the analysis in bug 1979.
 
> This is problematic in our environment, as coincidentally the topology/tree
> plugin largely ensures that the nodes we have configured with the highest
> weights will be used first, or nearly first:
> 
> # COMPUTE NODES
> NodeName=node[001-130] CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=1 State=UNKNOWN
> NodeName=node[131-132] Weight=2 CPUs=24 RealMemory=128502 Sockets=2
> CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN
> NodeName=cuda[001-008] Weight=4 CPUs=24 RealMemory=128502 Sockets=2
> CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4
> NodeName=memnode[001] Weight=8 CPUs=48 RealMemory=1548076 Sockets=4
> CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN
> 
> Our highest weight node is on a switch by itself with the storage array to
> provide the fastest access to the storage. Our GPU nodes are also grouped
> together on a single switch, and that switch also contains the login and
> infrastructure nodes that are not accounted for by the scheduler, so there
> are only 20 nodes on that switch vs. 24 on the others. The result appears to
> be that single node jobs will always prefer this equipment, and there are
> always single node jobs running.

I'm doing some tests locally and what you describe isn't what I'm seeing. My simple config contains:

slurm.conf:
TopologyPlugin=topology/tree
NodeName=compute1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7000 NodeHostname=polaris State=UNKNOWN Port=61201 Weight=2
NodeName=compute[2-3] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=4000 NodeHostname=polaris State=UNKNOWN Port=61202-61203 Weight=1
PartitionName=p1 Nodes=ALL Default=YES State=UP DefMemPerCPU=100

topology.conf:
SwitchName=root Switches=s[0-1]
SwitchName=s0 Nodes=compute1
SwitchName=s1 Nodes=compute[2-3]

compute[2-3] are low memory nodes and have a Weight=1. compute1 is higher memory node and has a Weight=2 (highest defined weight) and is on a switch by itself (trying to emulate your memnode[001] node).

alex@polaris:~/t$ sbatch -N1 --switch=1 --exclusive --wrap "sleep 9999"
Submitted batch job 20016
alex@polaris:~/t$ sbatch -N1 --switch=1 --exclusive --wrap "sleep 9999"
Submitted batch job 20017
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20017        p1     wrap     alex  R       0:01      1 compute3
             20016        p1     wrap     alex  R       0:02      1 compute2
alex@polaris:~/t$

It looks like lowest Weight nodes are selected first.

Could you attach your slurm.conf, topology.conf and an example job submission request?

I'm wondering if the single node jobs landing on memnode001 are being selected this node due to the resources request only being satisfied in this node and not in other nodes (for instance memory); or if the other nodes are full and this one is free, and they land here. I don't have this information yet so just guessing for now.
 
> Any suggestion for a clever workaround would be helpful too, but I suspect
> this conflict makes both topology/tree and the node weight options much less
> useful to many sites.

I'm not seeing the behavior you describe. But in any case there are a few workarounds that come to my mind:

1. Consider using TopologyParam=TopoOptional[1].

2. Put nodes with different hardware in separate partitions.

3. Make use of Features i.e. highmem,switchX and then use -C with 'Matching or'[2] i.e. -C [s0|s1|s2].

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_TopoOptional

[2] https://slurm.schedmd.com/sbatch.html#OPT_Matching-OR
Comment 4 Ryan Novosielski 2019-01-23 12:45:15 MST
I've not looked at the source code of the plugin so I'm not exactly sure that this is what happens, but it seems to be how it plays out. I'm going to attach my configuration, and maybe it will make it clearer.

However, I would think that someone who /is/ familiar with the source code of the plugin would be able to tell fairly easily what the logic should dictate. To me, it appears that for a single node job, the logic appears to target the switch that is the least likely to be able to host a multi-node job. If I'm not correct about that, something else is going on.

I'll attach the configs now.
Comment 5 Ryan Novosielski 2019-01-23 15:46:08 MST
Created attachment 8995 [details]
slurm.conf

slurm.conf from perceval, the cluster where I noticed this behavior
Comment 6 Ryan Novosielski 2019-01-23 15:46:40 MST
Created attachment 8996 [details]
topology.conf

topology.conf from perceval, a cluster where I've noticed this behavior
Comment 7 Ryan Novosielski 2019-01-23 15:47:35 MST
FYI, we run SLURM 17.11.7. I've put 18.04.4 in this request because my understanding is that nothing has changed in this area. If that's not true, feel free to modify the version.
Comment 8 Ryan Novosielski 2019-01-23 15:48:02 MST
FYI, we run SLURM 17.11.7. I've put 18.04.4 in this request because my understanding is that nothing has changed in this area. If that's not true, feel free to modify the version.
Comment 9 Alejandro Sanchez 2019-01-24 05:19:09 MST
I've built a Slurm 17.11.7 with same nodes/partitions/topology as you attached but can't reproduce what you describe:

alex@polaris:~/t$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
bg           up 3-00:00:00    141   idle cuda[001-008],memnode001,node[001-132]
oarc         up 3-00:00:00    141   idle cuda[001-008],memnode001,node[001-132]
testing      up       2:00      2   idle node[131-132]
main*        up 3-00:00:00    141   idle cuda[001-008],memnode001,node[001-132]
perceval     up 7-00:00:00    141   idle cuda[001-008],memnode001,node[001-132]
largemem     up 2-00:00:00      1   idle memnode001
gpu          up 2-00:00:00      8   idle cuda[001-008]
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
alex@polaris:~/t$ sbatch -N1 --wrap "sleep 9999"
Submitted batch job 20004
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20004      main     wrap     alex  R       0:02      1 node073
alex@polaris:~/t$

Job is launched by default to main partition which includes nodes Nodes=node[001-132],cuda[001-008],memnode[001] and is allocated to node073 which is one of the nodes with lowest Weight (1 by default).

It doesn't look like "the logic appears to target the switch that is the least likely to be able to host a multi-node job" as you speculated. I guess there are other reasons for single-node jobs landing on this node.

Can you give me an example 'sinfo' state followed by an example job submission request landing on memnode001?

What partition(s)/resources(cpus,mem,etc) are requesting jobs allocated to memnode001?
Comment 10 Alejandro Sanchez 2019-02-08 02:54:32 MST
Hi, any updates on this?
Comment 11 Ryan Novosielski 2019-02-08 07:13:45 MST
Sorry, not at this time. Unfortunately I've been out sick a number of days and otherwise engaged during the days I was in the office. This is still an important issue for us and I hope to be able to return to it in the near future (just came up as a question on our support line in the last day or two).
Comment 12 Alejandro Sanchez 2019-03-28 04:35:38 MDT
Hi Ryan,

any further information related to this bug? thank you.
Comment 13 Alejandro Sanchez 2019-05-01 06:05:02 MDT
Hi,

We've not had any feedback for more than month now. I'm inclined to tag this as timedout. Please, feel free to reopen when you can provide more information.