19722 – Constraints ignored when determining node suitability

Ticket 19722 - Constraints ignored when determining node suitability

Summary: Constraints ignored when determining node suitability

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Cray Shasta Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:	Alejandro Sanchez

URL:

Depends on:
Blocks:

Reported:	2024-04-29 00:35 MDT by William Davey
Modified:	2024-05-08 10:02 MDT (History)
CC List:	3 users (show)

See Also:	5240 4895 4976
Site:	Pawsey
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	SUSE
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configuration 2024-05-01 (38.95 KB, text/plain) 2024-05-01 02:06 MDT, William Davey	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description William Davey 2024-04-29 00:35:24 MDT

We are currently running slurm 22.05.11 due to a limitation from HPE and our supported version of CSM/COS

As part of our daily system tests we run a number of jobs across the system.
For this we would ideally like to be able to use a hidden partition that has a higher priority than the others so the tests are run in a reasonably expedient manner.

The issue I'm seeing is that because this hidden partition contains all of the nodes in our system, some jobs lodged with constraints are seeing a "Requested node configuration is not available" message even though there are nodes in the partition with the constraints defined that do meet the requirements set in the job.

To me it seems like the constraints are not being evaluated when determining suitability for the sbatch request, only the partition as a whole is looked at. If any of the nodes in the partition cannot meet the requirements, then the job is rejected.

I am seeing some oddities with this also as described below.
For example we have copy nodes, worker nodes and gpu nodes:
---
sinfo --partition copy -o "%m %f"
MEMORY AVAIL_FEATURES
120000 AMD_EPYC_7502P,copy
--
sinfo --partition work -o "%m %f"
MEMORY AVAIL_FEATURES
245000 AMD_EPYC_7763,cpu,work
--
sinfo --partition gpu -o "%m %f"
MEMORY AVAIL_FEATURES
245000 AMD_EPYC_7A53,AMD_INSTINCT_MI200,gpu,work
---

If a job is launched with a constraint of "cpu&work", requesting 128 tasks per node and mem-per-cpu of 1840M I would expect the scheduler to use the nodes in the work partition. Instead it seems to evaluate that the copy nodes cannot meet this requirement and we get the "Requested node configuration is not available" error. If I drop the tasks per node to 64 or change the memory requirement to 920M per cpu then the job will launch. Oddly the 'ntasks-per-node' requirement that cannot be met by the copy partition nodes is still accepted so long as the 'mem-per-cpu' requirement can be.

Is there some way of working around this or some other option available to allow us to blanket priortise those jobs over others?

Regards,
William Davey

Comment 3 Marcin Stolarek 2024-04-30 04:26:49 MDT

William,

I'm not sure I fully understand the issue. Could you please share the configuration and commands you use?

Looking at what you shared - if the job is submiettd with "cpu&work" it will require the node that has both cpu and work constraints,i.e. if there is no node with both features the job is rejected since it can't every run.

From sinfo commands you shared it looks like this job won't be able to run in copy and gpu partitions. Based on what we have I can't say 

>requesting 128 tasks per node and mem-per-cpu of 1840M I would expect the scheduler to use the nodes in the work partition.
You need to specify the partition where you want to run a job or default partition will be used. 

>If I drop the tasks per node to 64 or change the memory requirement to 920M per cpu then the job will launch. Oddly the 'ntasks-per-node' requirement that cannot be met by the copy partition nodes is still accepted so long as the 'mem-per-cpu' requirement can be.
This may be a nuance related to allocation of tasks on individual cores, while to threads per core(cpu) are used. Depending on the configuration details number of CPUs taken to multiplication by --mem-per-cpu may not be equal to number of tasks (e.g. each task makes use of 2 CPUs).

cheers,
Marcin

Comment 4 Marcin Stolarek 2024-04-30 04:28:31 MDT

PS. Sorry pressed send to early:

I wrote:
>[...] Based on what we have I can't say 
Should be "Based on the sinfo" output job should be able to run on "work" partition from the perspective of nodes feautres.

Comment 5 William Davey 2024-04-30 19:37:05 MDT

Thanks for the response Marcin, I'll try clarify.

We have a partition called 'Acceptance'. It contains all nodes in the system.
It also has a PriorityJobFactor of 2 so it gets a higher priority by default when a job is launched on it.
---
wdavey@setonix-07:~> scontrol show partition acceptance
PartitionName=acceptance
   AllowGroups=ALL AllowAccounts=pawsey0001,pawsey0001-gpu,pawsey0006,pawsey0006-gpu,pawsey0012,pawsey0012-gpu,pawsey0014,pawsey0014-gpu,benchmark AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=YES
   MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=casda-an01,nid[001000-002056,002058,002060,002062,002064,002066,002068,002070,002072,002074,002076,002078,002080,002082,002084,002086,002088,002090,002092,002094,002096,002098,002100,002102,002104,002106,002108,002110,002112,002114,002116,002118,002120,002122,002124,002126,002128,002130,002132,002134,002136,002138,002140,002142,002144,002146,002148,002150,002152,002154,002156,002158,002160,002162,002164,002166,002168,002170,002172,002174,002176,002178,002180,002182,002184,002186,002188,002190,002192,002194,002196,002198,002200,002202,002204,002206,002208,002210,002212,002214,002216,002218,002220,002222,002224,002226,002228,002230,002232,002234,002236,002238,002240,002242,002244,002246,002280-002824,002826,002828,002830,002832,002834,002836,002838,002840,002842,002844,002846,002848,002850,002852,002854,002856,002858,002860,002862,002864,002866,002868,002870,002872,002874,002876,002878,002880,002882,002884,002886,002888,002890,002892,002894,002896,002898,002900,002902,002904,002906,002908,002910,002912,002914,002916,002918,002920,002922,002924,002926,002928,002930,002932,002934,002936,002938,002940,002942,002944,002946,002948,002950,002952,002954,002956,002958,002960,002962,002964,002966,002968,002970,002972,002974,002976,002978,002980,002982,002984,002986,002988,002990,002992,002994,002996,002998,003000,003002,003004,003006,003008,003010,003012,003014],setonix-dm[01-04,07]
   PriorityJobFactor=2 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=434560 TotalNodes=1798 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRES=cpu=434560,mem=455270000M,node=1798,billing=434560,gres/gpu=1536
   TRESBillingWeights=CPU=1
---

For our daily system check I would like to use this partition, but use constraints to specify which nodes on the partition should be selected for a particular job/check. Each partition in our system has unique constraints so I should be able to use the acceptance partition but restrict the job to nodes in one of the other partitions using constraints.

Here is an example job:
---
> cat rfm_tConvolveMPI_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_tConvolveMPI_job"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --output=rfm_tConvolveMPI_job.out
#SBATCH --error=rfm_tConvolveMPI_job.err
#SBATCH --time=0:5:0
#SBATCH --exclusive
#SBATCH --constraint=cpu&work
#SBATCH --partition=acceptance
#SBATCH --mem=230G
module load PrgEnv-gnu
module load cray-mpich/8.1.25
./copy.sh
srun ./tConvolveMPI.amd

> sbatch rfm_tConvolveMPI_job.sh
sbatch: error: Batch job submission failed: Requested partition configuration not available now
---

This job should happily run on the nodes the make up the work partition (which is a subset of the nodes in the acceptance partition), as they all have enough cpus and memory:
---
sinfo --partition work -o "%m %c %f"
MEMORY CPUS AVAIL_FEATURES
245000 256 AMD_EPYC_7763,cpu,work
--
scontrol show partition work
PartitionName=work
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid[001000-001503,001512-002055,002280-002595,002792-002823]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=357376 TotalNodes=1396 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRES=cpu=357376,mem=342020000M,node=1396,billing=357376
   TRESBillingWeights=CPU=1
---

If I change the job script to use the work partition, which again is a subset of the acceptance partition, the job will be scheduled.
---
> cat rfm_tConvolveMPI_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_tConvolveMPI_job"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --output=rfm_tConvolveMPI_job.out
#SBATCH --error=rfm_tConvolveMPI_job.err
#SBATCH --time=0:5:0
#SBATCH --exclusive
#SBATCH --constraint=cpu&work
#SBATCH --partition=work
#SBATCH --mem=230G
module load PrgEnv-gnu
module load cray-mpich/8.1.25
./copy.sh
srun ./tConvolveMPI.amd

> sbatch rfm_tConvolveMPI_job.sh
Submitted batch job 11400744
---

Likewise if I leave the partition as acceptance, but change the memory down to something that the lowest common node has available to it (112G for example), the job will schedule fine.
---
> cat rfm_tConvolveMPI_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_tConvolveMPI_job"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --output=rfm_tConvolveMPI_job.out
#SBATCH --error=rfm_tConvolveMPI_job.err
#SBATCH --time=0:5:0
#SBATCH --exclusive
#SBATCH --constraint=cpu&work
#SBATCH --partition=acceptance
#SBATCH --mem=115G
module load PrgEnv-gnu
module load cray-mpich/8.1.25
./copy.sh
srun ./tConvolveMPI.amd

> sbatch rfm_tConvolveMPI_job.sh
Submitted batch job 11401070
---

My questions are;
- Why are the constraints not taken into consideration at schedule time?
- If this is intended, is there some other way you can suggest where we're able to launch a job and have it automatically have a higher priority in the queue?

Comment 6 Marcin Stolarek 2024-05-01 01:51:31 MDT

Could you please attach your slurm.conf, I'd like to reproduce it to better understand the mechanics below. The way you describe it looks like incorrect behavior.

Could you please additionally check if adding -N1 or directly specifying a node where the job could run by -w option changes the behavior?

cheers,
Marcin

Comment 7 William Davey 2024-05-01 02:05:04 MDT

Hello Marcin,

I tried the same script with both --nodes=1 and --nodelist=nid001495 but I get the same result:
---
> cat rfm_tConvolveMPI_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_tConvolveMPI_job"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --output=rfm_tConvolveMPI_job.out
#SBATCH --error=rfm_tConvolveMPI_job.err
#SBATCH --time=0:5:0
#SBATCH --exclusive
#SBATCH --constraint=cpu&work
#SBATCH --partition=acceptance
#SBATCH --mem=230G
#SBATCH --nodelist=nid001495
module load PrgEnv-gnu
module load cray-mpich/8.1.25
./copy.sh
srun ./tConvolveMPI.amd

> sbatch rfm_tConvolveMPI_job.sh
sbatch: error: Batch job submission failed: Requested partition configuration not available now

I've attached the currently slurm.conf from the system.

Comment 8 William Davey 2024-05-01 02:06:46 MDT

Created attachment 36165 [details]
Slurm configuration 2024-05-01

Comment 9 Marcin Stolarek 2024-05-01 06:46:53 MDT

William,

I see where the behavior comes from. It's _valid_pn_min_mem[1] implementing MaxMemPerCPU specification for partition. In case of job with per node specification and the per CPU limit set on partition, we attempt to estimate the per node maximum based on number of CPUs on the first node of the partition times MaxMemPerCPU. In your case, the first node in acceptance partition is casda-an01 which has only 64 CPUs.

While it looks like wrong assumption. I'll have to study the code and its history a little bit more to see if and how we can improve the behavior. A simple workaround for you would be to remove MaxMemPerCPU from acceptance partition. At the same time, I'd like to make a step back and ask why you use MaxMemPerCPU[2] setting on partition?

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-23-11-6-1/src/slurmctld/job_mgr.c#L10195-L10224
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU

Comment 10 Ashley Chew 2024-05-01 21:50:40 MDT

I'm going to use this notion

LC = Logical Core (Ie Real Core, SMT core or HT core)
C = Real Physical Core
T = Thread Core / Virtual Core of the Physical Core

DefMemPerCPU / MaxMemPerCPU applies to Logical Core
* You cannot allocate a "Virtual" core, "Virtual Core" will be allocated along with Real Physical Core regardless in slurm
* Ie If the thread is one (You going to be allocated 2xLC but "T" is unused this cannot be made available seperately to a seperate job)
* Ie if the thread is two (2xLC but both C/T is used from the same unit)

Because it's a Cray EX, HT/SMT cannot be turned off (I'm not going to go into this)

For typical jobs (From what I remember)
* threads-per-core=1 is set and you use the whole node, the memory available is halved because the DefMemPerCPU applies to "LC" cores but you just using the RealPhysical Core
* By setting MaxMemCPU it allows it to flex to use entire memory of the node when threads-per-core=1 is set.

I'd imagine you wouldn't have this issue if HT/SMT could just be turned off where a Logical core would just be a real core as by default threads-per-core=1 is by default hardware wise

Comment 11 Sam Yates 2024-05-01 23:59:14 MDT

On the use of DefMemPerCPU in our configuration, I can also point you towards our Lua cli plugin[1] (specifically lines 237-249) which uses this value to determine the appropriate proportional memory allocation for an allocation request when the request is for less than a full node.

[1]https://raw.githubusercontent.com/PawseySC/pawsey-slurm-plugin-configuration/main/lua-plugins/cli_filter.lua.in

Comment 12 William Davey 2024-05-02 00:14:09 MDT

Sam, that section of the cli_filter plugin is ignored when the acceptance partition is used, check the if condition. The logic wouldn't be valid for a heterogeneous node partition like the acceptance one.

I think what Marcin is wanting to know is what do we hope the gain by setting the MaxMemPerCPU value on the partition. For the acceptance partition is adds nothing as only Pawsey staff will have access to it. On other partitions I believe we set it as our fairshare policy uses CPU as the accounting factor, so without a max memory per CPU a user would be able to request a single CPU and use all the memory on the node and only be charged at a single CPU rate as our nodes are not exclusive.

Marcin, I've tested the job with --mem=230 and MaxMemPerCPU=0 (UNLIMITED) and the job runs as you suggested, so as a workaround this is fine. I'll need to perform some further testing but I think it should suffice. Thank you.

Regards,
William

Comment 14 Marcin Stolarek 2024-05-07 03:16:30 MDT

I'm sending a patch to the reported bug to our Q/A team. I'll keep you posted on the progress.

cheers,
Marcin