Ticket 1979 - node weights ignored?
Summary: node weights ignored?
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.11.9
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-09-27 09:46 MDT by rl303f
Modified: 2015-09-29 04:38 MDT (History)
4 users (show)

See Also:
Site: NIH
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (12.17 KB, text/plain)
2015-09-27 09:46 MDT, rl303f
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description rl303f 2015-09-27 09:46:24 MDT
Created attachment 2258 [details]
slurm.conf

We seem to be having a problem where:

- we have an 'interactive' partition including multiple ranges of nodes

- the various ranges of nodes have different weights (some lower, some higher)

- we expect jobs submitted to this interactive partition to be allocated
  to the range of nodes (with available resources) having the lowest weight

- we find that the lowest weight nodes with available resources are ignored
  and skipped over and that jobs are instead inappropriately allocated to
  nodes having much higher weights 

We are at a loss to explain this behaviour.  We set SlurmctldDebug=debug5
and submitted a job to try to get some explanation for this behavior but
the log contents did not shed any light on the matter.

We seem to have either a misconfiguration issue or stumbled upon bug.

Thanks for any help you can provide!
Comment 1 Moe Jette 2015-09-27 10:12:24 MDT
Could you attach some logs and perhaps squeue output showing this? For jobs started on undesirable nodes also include output of 'scontrol show job' so we can check for job constraints  (they might explicitly be requesting large memory nodes or something of that sort).

On September 27, 2015 2:46:24 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1979
>
>              Site: NIH
>            Bug ID: 1979
>           Summary: node weights ignored?
>           Product: Slurm
>           Version: 14.11.9
>          Hardware: Linux
>                OS: Linux
>            Status: UNCONFIRMED
>          Severity: 2 - High Impact
>          Priority: ---
>         Component: Scheduling
>          Assignee: brian@schedmd.com
>          Reporter: rl303f@nih.gov
>              CC: brian@schedmd.com, da@schedmd.com, david@schedmd.com,
>                    jette@schedmd.com
>
>Created attachment 2258 [details]
>  --> http://bugs.schedmd.com/attachment.cgi?id=2258&action=edit
>slurm.conf
>
>We seem to be having a problem where:
>
>- we have an 'interactive' partition including multiple ranges of nodes
>
>- the various ranges of nodes have different weights (some lower, some
>higher)
>
>- we expect jobs submitted to this interactive partition to be
>allocated
>to the range of nodes (with available resources) having the lowest
>weight
>
>- we find that the lowest weight nodes with available resources are
>ignored
>and skipped over and that jobs are instead inappropriately allocated to
>  nodes having much higher weights 
>
>We are at a loss to explain this behaviour.  We set
>SlurmctldDebug=debug5
>and submitted a job to try to get some explanation for this behavior
>but
>the log contents did not shed any light on the matter.
>
>We seem to have either a misconfiguration issue or stumbled upon bug.
>
>Thanks for any help you can provide!
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 2 rl303f 2015-09-28 01:32:45 MDT
Many thanks, Moe.  As a simplified example, we define two NodeName's
(Weight=40 and Weight=300) and put them both in the same partition:

NodeName=cn[0001-0310] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129022 TmpDisk=6451 State=UNKNOWN Weight=40  Feature=cpu32,core16,g128,ssd800,x2650,10g Gres=lscratch:800

NodeName=cn[0603-0622] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129007 TmpDisk=6450 State=UNKNOWN Weight=300 Feature=cpu32,core16,g128,ssd800,x2650,gpuk20x Gres=gpu:k20x:2,lscratch:800

PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL AllowQos=staff,interactive AllocNodes=ALL Default=NO DefaultTime=08:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] Priority=5000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=10560 TotalNodes=330 SelectTypeParameters=N/A DefMemPerCPU=768 MaxMemPerNode=UNLIMITED

Next, we see that the very first node in the cluster has
    30 free CPUs   (CPUTot=32 less CPUAlloc=2) and
125950 free Memory (RealMemory=129022 less AllocMem=3072):

# scontrol show node cn0001
NodeName=cn0001 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=2 CPUErr=0 CPUTot=32 CPULoad=1.00 Features=cpu32,core16,g128,ssd800,x2650,10g
   Gres=lscratch:800
   GresDrain=N/A
   GresUsed=gpu:0,lscratch:0
   NodeAddr=cn0001 NodeHostName=cn0001 Version=14.11
   OS=Linux RealMemory=129022 AllocMem=3072 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=6451 Weight=40
   BootTime=2015-08-19T09:18:11 SlurmdStartTime=2015-09-25T15:13:57
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Then we submit as follows: 

salloc --partition=interactive --job-name=interactive srun --pty --preserve-env --x11=first /bin/bash

Then the job ends up on a node with Weight=300 when there is available
a node (cn0001) with lower Weight=40 and 30 CPUs and 123 GB memory?

JobId=2933567 JobName=interactive
   UserId=testuser(12345) GroupId=staff(49)
   Priority=604246 Nice=0 Account=sb QOS=interactive
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:33 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2015-09-27T21:11:40 EligibleTime=2015-09-27T21:11:40
   StartTime=2015-09-27T21:11:40 EndTime=2015-09-28T05:11:40
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=interactive AllocNode:Sid=biowulf2:23977
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cn0615
   BatchHost=cn0615
   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=cn0615 CPU_IDs=8-9 Mem=1536
   MinCPUsNode=1 MinMemoryCPU=768M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/spin1/users/testuser/slurm/sinteractive/debug

Below are 'debug5' log messages from time of submission
to where it is allocated the wrong node:

[2015-09-27T21:11:40.444] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=12345
[2015-09-27T21:11:40.474] job_submit.lua: NIH_job_submit: job from testuser
[2015-09-27T21:11:40.474] job_submit.lua: NIH_job_submit: job from testuser, partition interactive, setting default qos: interactive
[2015-09-27T21:11:40.474] Job submit request: account:(null) begin_time:0 dependency:(null) name:interactive partition:interactive qos:interactive submit_uid:12345 time_limit:429496
7294 user_id:12345
[2015-09-27T21:11:40.474] debug3: JobDesc: user_id=12345 job_id=N/A partition=interactive name=interactive
[2015-09-27T21:11:40.474] debug3:    cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1
[2015-09-27T21:11:40.474] debug3:    -N min-[max]: 1-[4294967294]:65534:65534:65534
[2015-09-27T21:11:40.474] debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2015-09-27T21:11:40.474] debug3:    immediate=0 features=(null) reservation=(null)
[2015-09-27T21:11:40.474] debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
[2015-09-27T21:11:40.474] debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2015-09-27T21:11:40.474] debug3:    kill_on_node_fail=-1 script=(null)
[2015-09-27T21:11:40.474] debug3:    stdin=(null) stdout=(null) stderr=(null)
[2015-09-27T21:11:40.474] debug3:    work_dir=/spin1/users/testuser/slurm/sinteractive/debug alloc_node:sid=biowulf2:23977
[2015-09-27T21:11:40.474] debug3:    resp_host=10.1.201.9 alloc_resp_port=38630  other_port=37619
[2015-09-27T21:11:40.474] debug3:    dependency=(null) account=(null) qos=interactive comment=(null)
[2015-09-27T21:11:40.474] debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2015-09-27T21:11:40.474] debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2015-09-27T21:11:40.474] debug3:    end_time=Unknown signal=0@0 wait_all_nodes=-1
[2015-09-27T21:11:40.474] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
[2015-09-27T21:11:40.474] debug3:    mem_bind=65534:(null) plane_size:65534
[2015-09-27T21:11:40.474] debug3:    array_inx=(null)
[2015-09-27T21:11:40.474] debug3: found correct user
[2015-09-27T21:11:40.474] debug:  we are looking for a user association
[2015-09-27T21:11:40.474] debug:  we are looking for a user association
[2015-09-27T21:11:40.474] debug:  not the right user 12345 != 36469
[2015-09-27T21:11:40.474] debug3: found correct association
[2015-09-27T21:11:40.474] debug3: found correct qos
[2015-09-27T21:11:40.474] debug3: acct_policy_validate: MPC: job_memory set to 768
[2015-09-27T21:11:40.475] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294
[2015-09-27T21:11:40.475] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294
[2015-09-27T21:11:40.475] debug2: initial priority for job 2933567 is 604246
[2015-09-27T21:11:40.476] debug2: found 310 usable nodes from config containing cn[0001-0310]
[2015-09-27T21:11:40.476] debug2: found 20 usable nodes from config containing cn[0603-0622]
[2015-09-27T21:11:40.476] debug3: _pick_best_nodes: job 2933567 idle_nodes 815 share_nodes 1676
[2015-09-27T21:11:40.476] debug2: select_p_job_test for job 2933567
[2015-09-27T21:11:40.479] debug3: acct_policy_job_runnable_post_select: job 2933567: MPC: job_memory set to 1536
[2015-09-27T21:11:40.479] debug3: cons_res: _add_job_to_res: job 2933567 act 0 
[2015-09-27T21:11:40.479] debug3: cons_res: adding job 2933567 to part interactive row 0
[2015-09-27T21:11:40.479] debug2: _adjust_limit_usage: job 2933567: MPC: job_memory set to 1536
[2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc sb grp_used_cpu_run_secs is 57600
[2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc sb grp_used_cpu_run_secs is 57600
[2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc root grp_used_cpu_run_secs is 68560632669758
[2015-09-27T21:11:40.479] debug2: Spawning RPC agent for msg_type REQUEST_LAUNCH_PROLOG
[2015-09-27T21:11:40.479] debug2: _adjust_limit_usage: job 2933567: MPC: job_memory set to 1536
[2015-09-27T21:11:40.479] debug2: sched: JobId=2933567 allocated resources: NodeList=cn0615
[2015-09-27T21:11:40.479] sched: _slurm_rpc_allocate_resources JobId=2933567 NodeList=cn0615 usec=35225

We're not quite sure why we end up on the Weight=300 node
when the preferred Weight=40 node is available.  Does this
provided information help explain it?  Is there any other
data we can provide?
Comment 3 Moe Jette 2015-09-28 04:07:39 MDT
Could you attach your topology.conf file?

==============================

The nodes shown in partition interactive differ in the the log and configuration file. How did that happen?

Log:
PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL
MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] 

slurm.conf:
PartitionName=interactive Nodes=cn[0001-0310,1125-1884] State=UP Default=NO  Priority=5000 DefaultTime=08:00:00  MaxTime=36:00:00  DefMemPerCPU=768 AllowQOS=staff,interactive

==============================

Your jobs running consistently, just not on the desired nodes, correct?
Comment 4 Moe Jette 2015-09-28 04:21:31 MDT
I see what is happening. Slurm is first identify the network switches which provide the best fit for a job and then selecting the nodes with the lowest "weight" within those switches. If optimizing resource selection by node weight is more important than optimizing network topology, just comment out the "TopologyPlugin=topology/tree" line in slurm.conf.
Comment 5 Moe Jette 2015-09-28 04:33:48 MDT
(In reply to Moe Jette from comment #4)
> I see what is happening. Slurm is first identify the network switches which
> provide the best fit for a job and then selecting the nodes with the lowest
> "weight" within those switches. If optimizing resource selection by node
> weight is more important than optimizing network topology, just comment out
> the "TopologyPlugin=topology/tree" line in slurm.conf.

Documented above behaviour here:
https://github.com/SchedMD/slurm/commit/f1a3e958e19c1cf34663a074d6a156b28ad55cfa
Comment 6 rl303f 2015-09-28 05:01:50 MDT
Below is our topology.conf:

SwitchName=sw0  Switches=sw[1-12] 
SwitchName=sw1  Nodes=cn[0411-0426,0523-0530] 
SwitchName=sw2  Nodes=cn[0427-0450]
SwitchName=sw3  Nodes=cn[0451-0474] 
SwitchName=sw4  Nodes=cn[0475-0498]
SwitchName=sw5  Nodes=cn[0499-0522]
SwitchName=sw6  Nodes=cn[0531-0554]
SwitchName=sw7  Nodes=cn[0555-0578]
SwitchName=sw8  Nodes=cn[0579-0602] 
SwitchName=sw9  Nodes=cn[0603-0626]
SwitchName=sw10 Nodes=cn[0001-0410,0627-1884]
SwitchName=sw11 Nodes=cn[1885-2134]
SwitchName=sw12 Nodes=cn[2135-2150]

================================================================

Yes, the nodes shown in partition interactive differ in the log and
configuration file.  Good catch and sorry for the confusion.  That is
why I tried to provide the complete details of the test.  What we
currently have in slurm.conf is hopefully only a temporary workaround
to keep interactive sessions from brokenly defaulting to IB or GPU
nodes (which are our highest weight nodes).  Please note the commented
line 'BROKEN' which is our desired config.

Solely for the purpose of providing a simplified example along with
debugging output, we used 'scontrol update' to set partition interactive
to only two 'NodeName' of differing 'Weight':

PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL AllowQos=staff,interactive AllocNodes=ALL Default=NO DefaultTime=08:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] Priority=5000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=10560 TotalNodes=330 SelectTypeParameters=N/A DefMemPerCPU=768 MaxMemPerNode=UNLIMITED

and then performed the test.  Afterwards we used 'scontrol reconfig'
to revert partition interactive back to slurm.conf settings.  We used
this method because it is faster/easier than modifying slurm.conf and
shutdown/restart and we did not want to chance users getting on the
wrong nodes again.  From what we have seen it matters not whether
partition interactive is defined via slurm.conf or 'scontrol update':
the results are the same and jobs wrongly end up on IB nodes.

Another way of explaining this would be if we comment out our current
partition interactive and uncomment the 'BROKEN' line and bounce slurm,
the same thing happens: IDLE/MIXED low weight nodes with available
resources are skipped over and jobs end up on high weight nodes.

================================================================

Yes, our jobs are running consistently, just not on the desired nodes.
Comment 7 Moe Jette 2015-09-28 05:39:26 MDT
(In reply to rl303f from comment #6)
> Below is our topology.conf:

Your topology is consistent with my analysis. Slurm is doing a best-fit on the leaf switches and then picking the lowest weight nodes within that leaf switch. If you want to optimized resource allocation based upon node weights rather than network topology, you'll need to comment out the "TopologyPlugin=topology/tree" line in slurm.conf.
Comment 8 Moe Jette 2015-09-29 04:38:00 MDT
Closing ticket. Please re-open if problem persists after removing configuration option for topology optimization.