We are aware of the following limitation (from https://slurm.schedmd.com/topology.html): "NOTE:Slurm first identifies the network switches which provide the best fit for pending jobs and then selectes the nodes with the lowest "weight" within those switches. If optimizing resource selection by node weight is more important than optimizing network topology then do NOT use the topology/tree plugin." This is problematic in our environment, as coincidentally the topology/tree plugin largely ensures that the nodes we have configured with the highest weights will be used first, or nearly first: # COMPUTE NODES NodeName=node[001-130] CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN NodeName=node[131-132] Weight=2 CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN NodeName=cuda[001-008] Weight=4 CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 NodeName=memnode[001] Weight=8 CPUs=48 RealMemory=1548076 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Our highest weight node is on a switch by itself with the storage array to provide the fastest access to the storage. Our GPU nodes are also grouped together on a single switch, and that switch also contains the login and infrastructure nodes that are not accounted for by the scheduler, so there are only 20 nodes on that switch vs. 24 on the others. The result appears to be that single node jobs will always prefer this equipment, and there are always single node jobs running. Any suggestion for a clever workaround would be helpful too, but I suspect this conflict makes both topology/tree and the node weight options much less useful to many sites.
Hi, (In reply to Ryan Novosielski from comment #0) > We are aware of the following limitation (from > https://slurm.schedmd.com/topology.html): > > "NOTE:Slurm first identifies the network switches which provide the best fit > for pending jobs and then selectes the nodes with the lowest "weight" within > those switches. If optimizing resource selection by node weight is more > important than optimizing network topology then do NOT use the topology/tree > plugin." This note was added in commit: https://github.com/SchedMD/slurm/commit/f1a3e958e19c1cf34663a0 as a result of the analysis in bug 1979. > This is problematic in our environment, as coincidentally the topology/tree > plugin largely ensures that the nodes we have configured with the highest > weights will be used first, or nearly first: > > # COMPUTE NODES > NodeName=node[001-130] CPUs=24 RealMemory=128502 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=1 State=UNKNOWN > NodeName=node[131-132] Weight=2 CPUs=24 RealMemory=128502 Sockets=2 > CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN > NodeName=cuda[001-008] Weight=4 CPUs=24 RealMemory=128502 Sockets=2 > CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:4 > NodeName=memnode[001] Weight=8 CPUs=48 RealMemory=1548076 Sockets=4 > CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN > > Our highest weight node is on a switch by itself with the storage array to > provide the fastest access to the storage. Our GPU nodes are also grouped > together on a single switch, and that switch also contains the login and > infrastructure nodes that are not accounted for by the scheduler, so there > are only 20 nodes on that switch vs. 24 on the others. The result appears to > be that single node jobs will always prefer this equipment, and there are > always single node jobs running. I'm doing some tests locally and what you describe isn't what I'm seeing. My simple config contains: slurm.conf: TopologyPlugin=topology/tree NodeName=compute1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7000 NodeHostname=polaris State=UNKNOWN Port=61201 Weight=2 NodeName=compute[2-3] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=4000 NodeHostname=polaris State=UNKNOWN Port=61202-61203 Weight=1 PartitionName=p1 Nodes=ALL Default=YES State=UP DefMemPerCPU=100 topology.conf: SwitchName=root Switches=s[0-1] SwitchName=s0 Nodes=compute1 SwitchName=s1 Nodes=compute[2-3] compute[2-3] are low memory nodes and have a Weight=1. compute1 is higher memory node and has a Weight=2 (highest defined weight) and is on a switch by itself (trying to emulate your memnode[001] node). alex@polaris:~/t$ sbatch -N1 --switch=1 --exclusive --wrap "sleep 9999" Submitted batch job 20016 alex@polaris:~/t$ sbatch -N1 --switch=1 --exclusive --wrap "sleep 9999" Submitted batch job 20017 alex@polaris:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20017 p1 wrap alex R 0:01 1 compute3 20016 p1 wrap alex R 0:02 1 compute2 alex@polaris:~/t$ It looks like lowest Weight nodes are selected first. Could you attach your slurm.conf, topology.conf and an example job submission request? I'm wondering if the single node jobs landing on memnode001 are being selected this node due to the resources request only being satisfied in this node and not in other nodes (for instance memory); or if the other nodes are full and this one is free, and they land here. I don't have this information yet so just guessing for now. > Any suggestion for a clever workaround would be helpful too, but I suspect > this conflict makes both topology/tree and the node weight options much less > useful to many sites. I'm not seeing the behavior you describe. But in any case there are a few workarounds that come to my mind: 1. Consider using TopologyParam=TopoOptional[1]. 2. Put nodes with different hardware in separate partitions. 3. Make use of Features i.e. highmem,switchX and then use -C with 'Matching or'[2] i.e. -C [s0|s1|s2]. [1] https://slurm.schedmd.com/slurm.conf.html#OPT_TopoOptional [2] https://slurm.schedmd.com/sbatch.html#OPT_Matching-OR
I've not looked at the source code of the plugin so I'm not exactly sure that this is what happens, but it seems to be how it plays out. I'm going to attach my configuration, and maybe it will make it clearer. However, I would think that someone who /is/ familiar with the source code of the plugin would be able to tell fairly easily what the logic should dictate. To me, it appears that for a single node job, the logic appears to target the switch that is the least likely to be able to host a multi-node job. If I'm not correct about that, something else is going on. I'll attach the configs now.
Created attachment 8995 [details] slurm.conf slurm.conf from perceval, the cluster where I noticed this behavior
Created attachment 8996 [details] topology.conf topology.conf from perceval, a cluster where I've noticed this behavior
FYI, we run SLURM 17.11.7. I've put 18.04.4 in this request because my understanding is that nothing has changed in this area. If that's not true, feel free to modify the version.
I've built a Slurm 17.11.7 with same nodes/partitions/topology as you attached but can't reproduce what you describe: alex@polaris:~/t$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST bg up 3-00:00:00 141 idle cuda[001-008],memnode001,node[001-132] oarc up 3-00:00:00 141 idle cuda[001-008],memnode001,node[001-132] testing up 2:00 2 idle node[131-132] main* up 3-00:00:00 141 idle cuda[001-008],memnode001,node[001-132] perceval up 7-00:00:00 141 idle cuda[001-008],memnode001,node[001-132] largemem up 2-00:00:00 1 idle memnode001 gpu up 2-00:00:00 8 idle cuda[001-008] alex@polaris:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) alex@polaris:~/t$ sbatch -N1 --wrap "sleep 9999" Submitted batch job 20004 alex@polaris:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20004 main wrap alex R 0:02 1 node073 alex@polaris:~/t$ Job is launched by default to main partition which includes nodes Nodes=node[001-132],cuda[001-008],memnode[001] and is allocated to node073 which is one of the nodes with lowest Weight (1 by default). It doesn't look like "the logic appears to target the switch that is the least likely to be able to host a multi-node job" as you speculated. I guess there are other reasons for single-node jobs landing on this node. Can you give me an example 'sinfo' state followed by an example job submission request landing on memnode001? What partition(s)/resources(cpus,mem,etc) are requesting jobs allocated to memnode001?
Hi, any updates on this?
Sorry, not at this time. Unfortunately I've been out sick a number of days and otherwise engaged during the days I was in the office. This is still an important issue for us and I hope to be able to return to it in the near future (just came up as a question on our support line in the last day or two).
Hi Ryan, any further information related to this bug? thank you.
Hi, We've not had any feedback for more than month now. I'm inclined to tag this as timedout. Please, feel free to reopen when you can provide more information.