Created attachment 2258 [details] slurm.conf We seem to be having a problem where: - we have an 'interactive' partition including multiple ranges of nodes - the various ranges of nodes have different weights (some lower, some higher) - we expect jobs submitted to this interactive partition to be allocated to the range of nodes (with available resources) having the lowest weight - we find that the lowest weight nodes with available resources are ignored and skipped over and that jobs are instead inappropriately allocated to nodes having much higher weights We are at a loss to explain this behaviour. We set SlurmctldDebug=debug5 and submitted a job to try to get some explanation for this behavior but the log contents did not shed any light on the matter. We seem to have either a misconfiguration issue or stumbled upon bug. Thanks for any help you can provide!
Could you attach some logs and perhaps squeue output showing this? For jobs started on undesirable nodes also include output of 'scontrol show job' so we can check for job constraints (they might explicitly be requesting large memory nodes or something of that sort). On September 27, 2015 2:46:24 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1979 > > Site: NIH > Bug ID: 1979 > Summary: node weights ignored? > Product: Slurm > Version: 14.11.9 > Hardware: Linux > OS: Linux > Status: UNCONFIRMED > Severity: 2 - High Impact > Priority: --- > Component: Scheduling > Assignee: brian@schedmd.com > Reporter: rl303f@nih.gov > CC: brian@schedmd.com, da@schedmd.com, david@schedmd.com, > jette@schedmd.com > >Created attachment 2258 [details] > --> http://bugs.schedmd.com/attachment.cgi?id=2258&action=edit >slurm.conf > >We seem to be having a problem where: > >- we have an 'interactive' partition including multiple ranges of nodes > >- the various ranges of nodes have different weights (some lower, some >higher) > >- we expect jobs submitted to this interactive partition to be >allocated >to the range of nodes (with available resources) having the lowest >weight > >- we find that the lowest weight nodes with available resources are >ignored >and skipped over and that jobs are instead inappropriately allocated to > nodes having much higher weights > >We are at a loss to explain this behaviour. We set >SlurmctldDebug=debug5 >and submitted a job to try to get some explanation for this behavior >but >the log contents did not shed any light on the matter. > >We seem to have either a misconfiguration issue or stumbled upon bug. > >Thanks for any help you can provide! > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Many thanks, Moe. As a simplified example, we define two NodeName's (Weight=40 and Weight=300) and put them both in the same partition: NodeName=cn[0001-0310] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129022 TmpDisk=6451 State=UNKNOWN Weight=40 Feature=cpu32,core16,g128,ssd800,x2650,10g Gres=lscratch:800 NodeName=cn[0603-0622] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129007 TmpDisk=6450 State=UNKNOWN Weight=300 Feature=cpu32,core16,g128,ssd800,x2650,gpuk20x Gres=gpu:k20x:2,lscratch:800 PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL AllowQos=staff,interactive AllocNodes=ALL Default=NO DefaultTime=08:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] Priority=5000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=10560 TotalNodes=330 SelectTypeParameters=N/A DefMemPerCPU=768 MaxMemPerNode=UNLIMITED Next, we see that the very first node in the cluster has 30 free CPUs (CPUTot=32 less CPUAlloc=2) and 125950 free Memory (RealMemory=129022 less AllocMem=3072): # scontrol show node cn0001 NodeName=cn0001 Arch=x86_64 CoresPerSocket=8 CPUAlloc=2 CPUErr=0 CPUTot=32 CPULoad=1.00 Features=cpu32,core16,g128,ssd800,x2650,10g Gres=lscratch:800 GresDrain=N/A GresUsed=gpu:0,lscratch:0 NodeAddr=cn0001 NodeHostName=cn0001 Version=14.11 OS=Linux RealMemory=129022 AllocMem=3072 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=6451 Weight=40 BootTime=2015-08-19T09:18:11 SlurmdStartTime=2015-09-25T15:13:57 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Then we submit as follows: salloc --partition=interactive --job-name=interactive srun --pty --preserve-env --x11=first /bin/bash Then the job ends up on a node with Weight=300 when there is available a node (cn0001) with lower Weight=40 and 30 CPUs and 123 GB memory? JobId=2933567 JobName=interactive UserId=testuser(12345) GroupId=staff(49) Priority=604246 Nice=0 Account=sb QOS=interactive JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:33 TimeLimit=08:00:00 TimeMin=N/A SubmitTime=2015-09-27T21:11:40 EligibleTime=2015-09-27T21:11:40 StartTime=2015-09-27T21:11:40 EndTime=2015-09-28T05:11:40 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=interactive AllocNode:Sid=biowulf2:23977 ReqNodeList=(null) ExcNodeList=(null) NodeList=cn0615 BatchHost=cn0615 NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=cn0615 CPU_IDs=8-9 Mem=1536 MinCPUsNode=1 MinMemoryCPU=768M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/spin1/users/testuser/slurm/sinteractive/debug Below are 'debug5' log messages from time of submission to where it is allocated the wrong node: [2015-09-27T21:11:40.444] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=12345 [2015-09-27T21:11:40.474] job_submit.lua: NIH_job_submit: job from testuser [2015-09-27T21:11:40.474] job_submit.lua: NIH_job_submit: job from testuser, partition interactive, setting default qos: interactive [2015-09-27T21:11:40.474] Job submit request: account:(null) begin_time:0 dependency:(null) name:interactive partition:interactive qos:interactive submit_uid:12345 time_limit:429496 7294 user_id:12345 [2015-09-27T21:11:40.474] debug3: JobDesc: user_id=12345 job_id=N/A partition=interactive name=interactive [2015-09-27T21:11:40.474] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2015-09-27T21:11:40.474] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 [2015-09-27T21:11:40.474] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 [2015-09-27T21:11:40.474] debug3: immediate=0 features=(null) reservation=(null) [2015-09-27T21:11:40.474] debug3: req_nodes=(null) exc_nodes=(null) gres=(null) [2015-09-27T21:11:40.474] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2015-09-27T21:11:40.474] debug3: kill_on_node_fail=-1 script=(null) [2015-09-27T21:11:40.474] debug3: stdin=(null) stdout=(null) stderr=(null) [2015-09-27T21:11:40.474] debug3: work_dir=/spin1/users/testuser/slurm/sinteractive/debug alloc_node:sid=biowulf2:23977 [2015-09-27T21:11:40.474] debug3: resp_host=10.1.201.9 alloc_resp_port=38630 other_port=37619 [2015-09-27T21:11:40.474] debug3: dependency=(null) account=(null) qos=interactive comment=(null) [2015-09-27T21:11:40.474] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2015-09-27T21:11:40.474] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2015-09-27T21:11:40.474] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 [2015-09-27T21:11:40.474] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2015-09-27T21:11:40.474] debug3: mem_bind=65534:(null) plane_size:65534 [2015-09-27T21:11:40.474] debug3: array_inx=(null) [2015-09-27T21:11:40.474] debug3: found correct user [2015-09-27T21:11:40.474] debug: we are looking for a user association [2015-09-27T21:11:40.474] debug: we are looking for a user association [2015-09-27T21:11:40.474] debug: not the right user 12345 != 36469 [2015-09-27T21:11:40.474] debug3: found correct association [2015-09-27T21:11:40.474] debug3: found correct qos [2015-09-27T21:11:40.474] debug3: acct_policy_validate: MPC: job_memory set to 768 [2015-09-27T21:11:40.475] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2015-09-27T21:11:40.475] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2015-09-27T21:11:40.475] debug2: initial priority for job 2933567 is 604246 [2015-09-27T21:11:40.476] debug2: found 310 usable nodes from config containing cn[0001-0310] [2015-09-27T21:11:40.476] debug2: found 20 usable nodes from config containing cn[0603-0622] [2015-09-27T21:11:40.476] debug3: _pick_best_nodes: job 2933567 idle_nodes 815 share_nodes 1676 [2015-09-27T21:11:40.476] debug2: select_p_job_test for job 2933567 [2015-09-27T21:11:40.479] debug3: acct_policy_job_runnable_post_select: job 2933567: MPC: job_memory set to 1536 [2015-09-27T21:11:40.479] debug3: cons_res: _add_job_to_res: job 2933567 act 0 [2015-09-27T21:11:40.479] debug3: cons_res: adding job 2933567 to part interactive row 0 [2015-09-27T21:11:40.479] debug2: _adjust_limit_usage: job 2933567: MPC: job_memory set to 1536 [2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc sb grp_used_cpu_run_secs is 57600 [2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc sb grp_used_cpu_run_secs is 57600 [2015-09-27T21:11:40.479] debug4: acct_policy_job_begin: after adding job 2933567, assoc root grp_used_cpu_run_secs is 68560632669758 [2015-09-27T21:11:40.479] debug2: Spawning RPC agent for msg_type REQUEST_LAUNCH_PROLOG [2015-09-27T21:11:40.479] debug2: _adjust_limit_usage: job 2933567: MPC: job_memory set to 1536 [2015-09-27T21:11:40.479] debug2: sched: JobId=2933567 allocated resources: NodeList=cn0615 [2015-09-27T21:11:40.479] sched: _slurm_rpc_allocate_resources JobId=2933567 NodeList=cn0615 usec=35225 We're not quite sure why we end up on the Weight=300 node when the preferred Weight=40 node is available. Does this provided information help explain it? Is there any other data we can provide?
Could you attach your topology.conf file? ============================== The nodes shown in partition interactive differ in the the log and configuration file. How did that happen? Log: PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] slurm.conf: PartitionName=interactive Nodes=cn[0001-0310,1125-1884] State=UP Default=NO Priority=5000 DefaultTime=08:00:00 MaxTime=36:00:00 DefMemPerCPU=768 AllowQOS=staff,interactive ============================== Your jobs running consistently, just not on the desired nodes, correct?
I see what is happening. Slurm is first identify the network switches which provide the best fit for a job and then selecting the nodes with the lowest "weight" within those switches. If optimizing resource selection by node weight is more important than optimizing network topology, just comment out the "TopologyPlugin=topology/tree" line in slurm.conf.
(In reply to Moe Jette from comment #4) > I see what is happening. Slurm is first identify the network switches which > provide the best fit for a job and then selecting the nodes with the lowest > "weight" within those switches. If optimizing resource selection by node > weight is more important than optimizing network topology, just comment out > the "TopologyPlugin=topology/tree" line in slurm.conf. Documented above behaviour here: https://github.com/SchedMD/slurm/commit/f1a3e958e19c1cf34663a074d6a156b28ad55cfa
Below is our topology.conf: SwitchName=sw0 Switches=sw[1-12] SwitchName=sw1 Nodes=cn[0411-0426,0523-0530] SwitchName=sw2 Nodes=cn[0427-0450] SwitchName=sw3 Nodes=cn[0451-0474] SwitchName=sw4 Nodes=cn[0475-0498] SwitchName=sw5 Nodes=cn[0499-0522] SwitchName=sw6 Nodes=cn[0531-0554] SwitchName=sw7 Nodes=cn[0555-0578] SwitchName=sw8 Nodes=cn[0579-0602] SwitchName=sw9 Nodes=cn[0603-0626] SwitchName=sw10 Nodes=cn[0001-0410,0627-1884] SwitchName=sw11 Nodes=cn[1885-2134] SwitchName=sw12 Nodes=cn[2135-2150] ================================================================ Yes, the nodes shown in partition interactive differ in the log and configuration file. Good catch and sorry for the confusion. That is why I tried to provide the complete details of the test. What we currently have in slurm.conf is hopefully only a temporary workaround to keep interactive sessions from brokenly defaulting to IB or GPU nodes (which are our highest weight nodes). Please note the commented line 'BROKEN' which is our desired config. Solely for the purpose of providing a simplified example along with debugging output, we used 'scontrol update' to set partition interactive to only two 'NodeName' of differing 'Weight': PartitionName=interactive AllowGroups=ALL AllowAccounts=ALL AllowQos=staff,interactive AllocNodes=ALL Default=NO DefaultTime=08:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[0001-0310,0603-0622] Priority=5000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=10560 TotalNodes=330 SelectTypeParameters=N/A DefMemPerCPU=768 MaxMemPerNode=UNLIMITED and then performed the test. Afterwards we used 'scontrol reconfig' to revert partition interactive back to slurm.conf settings. We used this method because it is faster/easier than modifying slurm.conf and shutdown/restart and we did not want to chance users getting on the wrong nodes again. From what we have seen it matters not whether partition interactive is defined via slurm.conf or 'scontrol update': the results are the same and jobs wrongly end up on IB nodes. Another way of explaining this would be if we comment out our current partition interactive and uncomment the 'BROKEN' line and bounce slurm, the same thing happens: IDLE/MIXED low weight nodes with available resources are skipped over and jobs end up on high weight nodes. ================================================================ Yes, our jobs are running consistently, just not on the desired nodes.
(In reply to rl303f from comment #6) > Below is our topology.conf: Your topology is consistent with my analysis. Slurm is doing a best-fit on the leaf switches and then picking the lowest weight nodes within that leaf switch. If you want to optimized resource allocation based upon node weights rather than network topology, you'll need to comment out the "TopologyPlugin=topology/tree" line in slurm.conf.
Closing ticket. Please re-open if problem persists after removing configuration option for topology optimization.