We are currently running slurm 22.05.11 due to a limitation from HPE and our supported version of CSM/COS As part of our daily system tests we run a number of jobs across the system. For this we would ideally like to be able to use a hidden partition that has a higher priority than the others so the tests are run in a reasonably expedient manner. The issue I'm seeing is that because this hidden partition contains all of the nodes in our system, some jobs lodged with constraints are seeing a "Requested node configuration is not available" message even though there are nodes in the partition with the constraints defined that do meet the requirements set in the job. To me it seems like the constraints are not being evaluated when determining suitability for the sbatch request, only the partition as a whole is looked at. If any of the nodes in the partition cannot meet the requirements, then the job is rejected. I am seeing some oddities with this also as described below. For example we have copy nodes, worker nodes and gpu nodes: --- sinfo --partition copy -o "%m %f" MEMORY AVAIL_FEATURES 120000 AMD_EPYC_7502P,copy -- sinfo --partition work -o "%m %f" MEMORY AVAIL_FEATURES 245000 AMD_EPYC_7763,cpu,work -- sinfo --partition gpu -o "%m %f" MEMORY AVAIL_FEATURES 245000 AMD_EPYC_7A53,AMD_INSTINCT_MI200,gpu,work --- If a job is launched with a constraint of "cpu&work", requesting 128 tasks per node and mem-per-cpu of 1840M I would expect the scheduler to use the nodes in the work partition. Instead it seems to evaluate that the copy nodes cannot meet this requirement and we get the "Requested node configuration is not available" error. If I drop the tasks per node to 64 or change the memory requirement to 920M per cpu then the job will launch. Oddly the 'ntasks-per-node' requirement that cannot be met by the copy partition nodes is still accepted so long as the 'mem-per-cpu' requirement can be. Is there some way of working around this or some other option available to allow us to blanket priortise those jobs over others? Regards, William Davey
William, I'm not sure I fully understand the issue. Could you please share the configuration and commands you use? Looking at what you shared - if the job is submiettd with "cpu&work" it will require the node that has both cpu and work constraints,i.e. if there is no node with both features the job is rejected since it can't every run. From sinfo commands you shared it looks like this job won't be able to run in copy and gpu partitions. Based on what we have I can't say >requesting 128 tasks per node and mem-per-cpu of 1840M I would expect the scheduler to use the nodes in the work partition. You need to specify the partition where you want to run a job or default partition will be used. >If I drop the tasks per node to 64 or change the memory requirement to 920M per cpu then the job will launch. Oddly the 'ntasks-per-node' requirement that cannot be met by the copy partition nodes is still accepted so long as the 'mem-per-cpu' requirement can be. This may be a nuance related to allocation of tasks on individual cores, while to threads per core(cpu) are used. Depending on the configuration details number of CPUs taken to multiplication by --mem-per-cpu may not be equal to number of tasks (e.g. each task makes use of 2 CPUs). cheers, Marcin
PS. Sorry pressed send to early: I wrote: >[...] Based on what we have I can't say Should be "Based on the sinfo" output job should be able to run on "work" partition from the perspective of nodes feautres.
Thanks for the response Marcin, I'll try clarify. We have a partition called 'Acceptance'. It contains all nodes in the system. It also has a PriorityJobFactor of 2 so it gets a higher priority by default when a job is launched on it. --- wdavey@setonix-07:~> scontrol show partition acceptance PartitionName=acceptance AllowGroups=ALL AllowAccounts=pawsey0001,pawsey0001-gpu,pawsey0006,pawsey0006-gpu,pawsey0012,pawsey0012-gpu,pawsey0014,pawsey0014-gpu,benchmark AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=YES MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=casda-an01,nid[001000-002056,002058,002060,002062,002064,002066,002068,002070,002072,002074,002076,002078,002080,002082,002084,002086,002088,002090,002092,002094,002096,002098,002100,002102,002104,002106,002108,002110,002112,002114,002116,002118,002120,002122,002124,002126,002128,002130,002132,002134,002136,002138,002140,002142,002144,002146,002148,002150,002152,002154,002156,002158,002160,002162,002164,002166,002168,002170,002172,002174,002176,002178,002180,002182,002184,002186,002188,002190,002192,002194,002196,002198,002200,002202,002204,002206,002208,002210,002212,002214,002216,002218,002220,002222,002224,002226,002228,002230,002232,002234,002236,002238,002240,002242,002244,002246,002280-002824,002826,002828,002830,002832,002834,002836,002838,002840,002842,002844,002846,002848,002850,002852,002854,002856,002858,002860,002862,002864,002866,002868,002870,002872,002874,002876,002878,002880,002882,002884,002886,002888,002890,002892,002894,002896,002898,002900,002902,002904,002906,002908,002910,002912,002914,002916,002918,002920,002922,002924,002926,002928,002930,002932,002934,002936,002938,002940,002942,002944,002946,002948,002950,002952,002954,002956,002958,002960,002962,002964,002966,002968,002970,002972,002974,002976,002978,002980,002982,002984,002986,002988,002990,002992,002994,002996,002998,003000,003002,003004,003006,003008,003010,003012,003014],setonix-dm[01-04,07] PriorityJobFactor=2 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=434560 TotalNodes=1798 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRES=cpu=434560,mem=455270000M,node=1798,billing=434560,gres/gpu=1536 TRESBillingWeights=CPU=1 --- For our daily system check I would like to use this partition, but use constraints to specify which nodes on the partition should be selected for a particular job/check. Each partition in our system has unique constraints so I should be able to use the acceptance partition but restrict the job to nodes in one of the other partitions using constraints. Here is an example job: --- > cat rfm_tConvolveMPI_job.sh #!/bin/bash #SBATCH --job-name="rfm_tConvolveMPI_job" #SBATCH --ntasks=128 #SBATCH --ntasks-per-node=128 #SBATCH --output=rfm_tConvolveMPI_job.out #SBATCH --error=rfm_tConvolveMPI_job.err #SBATCH --time=0:5:0 #SBATCH --exclusive #SBATCH --constraint=cpu&work #SBATCH --partition=acceptance #SBATCH --mem=230G module load PrgEnv-gnu module load cray-mpich/8.1.25 ./copy.sh srun ./tConvolveMPI.amd > sbatch rfm_tConvolveMPI_job.sh sbatch: error: Batch job submission failed: Requested partition configuration not available now --- This job should happily run on the nodes the make up the work partition (which is a subset of the nodes in the acceptance partition), as they all have enough cpus and memory: --- sinfo --partition work -o "%m %c %f" MEMORY CPUS AVAIL_FEATURES 245000 256 AMD_EPYC_7763,cpu,work -- scontrol show partition work PartitionName=work AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid[001000-001503,001512-002055,002280-002595,002792-002823] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=357376 TotalNodes=1396 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRES=cpu=357376,mem=342020000M,node=1396,billing=357376 TRESBillingWeights=CPU=1 --- If I change the job script to use the work partition, which again is a subset of the acceptance partition, the job will be scheduled. --- > cat rfm_tConvolveMPI_job.sh #!/bin/bash #SBATCH --job-name="rfm_tConvolveMPI_job" #SBATCH --ntasks=128 #SBATCH --ntasks-per-node=128 #SBATCH --output=rfm_tConvolveMPI_job.out #SBATCH --error=rfm_tConvolveMPI_job.err #SBATCH --time=0:5:0 #SBATCH --exclusive #SBATCH --constraint=cpu&work #SBATCH --partition=work #SBATCH --mem=230G module load PrgEnv-gnu module load cray-mpich/8.1.25 ./copy.sh srun ./tConvolveMPI.amd > sbatch rfm_tConvolveMPI_job.sh Submitted batch job 11400744 --- Likewise if I leave the partition as acceptance, but change the memory down to something that the lowest common node has available to it (112G for example), the job will schedule fine. --- > cat rfm_tConvolveMPI_job.sh #!/bin/bash #SBATCH --job-name="rfm_tConvolveMPI_job" #SBATCH --ntasks=128 #SBATCH --ntasks-per-node=128 #SBATCH --output=rfm_tConvolveMPI_job.out #SBATCH --error=rfm_tConvolveMPI_job.err #SBATCH --time=0:5:0 #SBATCH --exclusive #SBATCH --constraint=cpu&work #SBATCH --partition=acceptance #SBATCH --mem=115G module load PrgEnv-gnu module load cray-mpich/8.1.25 ./copy.sh srun ./tConvolveMPI.amd > sbatch rfm_tConvolveMPI_job.sh Submitted batch job 11401070 --- My questions are; - Why are the constraints not taken into consideration at schedule time? - If this is intended, is there some other way you can suggest where we're able to launch a job and have it automatically have a higher priority in the queue?
Could you please attach your slurm.conf, I'd like to reproduce it to better understand the mechanics below. The way you describe it looks like incorrect behavior. Could you please additionally check if adding -N1 or directly specifying a node where the job could run by -w option changes the behavior? cheers, Marcin
Hello Marcin, I tried the same script with both --nodes=1 and --nodelist=nid001495 but I get the same result: --- > cat rfm_tConvolveMPI_job.sh #!/bin/bash #SBATCH --job-name="rfm_tConvolveMPI_job" #SBATCH --ntasks=128 #SBATCH --ntasks-per-node=128 #SBATCH --output=rfm_tConvolveMPI_job.out #SBATCH --error=rfm_tConvolveMPI_job.err #SBATCH --time=0:5:0 #SBATCH --exclusive #SBATCH --constraint=cpu&work #SBATCH --partition=acceptance #SBATCH --mem=230G #SBATCH --nodelist=nid001495 module load PrgEnv-gnu module load cray-mpich/8.1.25 ./copy.sh srun ./tConvolveMPI.amd > sbatch rfm_tConvolveMPI_job.sh sbatch: error: Batch job submission failed: Requested partition configuration not available now I've attached the currently slurm.conf from the system.
Created attachment 36165 [details] Slurm configuration 2024-05-01
William, I see where the behavior comes from. It's _valid_pn_min_mem[1] implementing MaxMemPerCPU specification for partition. In case of job with per node specification and the per CPU limit set on partition, we attempt to estimate the per node maximum based on number of CPUs on the first node of the partition times MaxMemPerCPU. In your case, the first node in acceptance partition is casda-an01 which has only 64 CPUs. While it looks like wrong assumption. I'll have to study the code and its history a little bit more to see if and how we can improve the behavior. A simple workaround for you would be to remove MaxMemPerCPU from acceptance partition. At the same time, I'd like to make a step back and ask why you use MaxMemPerCPU[2] setting on partition? cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-23-11-6-1/src/slurmctld/job_mgr.c#L10195-L10224 [2]https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU
I'm going to use this notion LC = Logical Core (Ie Real Core, SMT core or HT core) C = Real Physical Core T = Thread Core / Virtual Core of the Physical Core DefMemPerCPU / MaxMemPerCPU applies to Logical Core * You cannot allocate a "Virtual" core, "Virtual Core" will be allocated along with Real Physical Core regardless in slurm * Ie If the thread is one (You going to be allocated 2xLC but "T" is unused this cannot be made available seperately to a seperate job) * Ie if the thread is two (2xLC but both C/T is used from the same unit) Because it's a Cray EX, HT/SMT cannot be turned off (I'm not going to go into this) For typical jobs (From what I remember) * threads-per-core=1 is set and you use the whole node, the memory available is halved because the DefMemPerCPU applies to "LC" cores but you just using the RealPhysical Core * By setting MaxMemCPU it allows it to flex to use entire memory of the node when threads-per-core=1 is set. I'd imagine you wouldn't have this issue if HT/SMT could just be turned off where a Logical core would just be a real core as by default threads-per-core=1 is by default hardware wise
On the use of DefMemPerCPU in our configuration, I can also point you towards our Lua cli plugin[1] (specifically lines 237-249) which uses this value to determine the appropriate proportional memory allocation for an allocation request when the request is for less than a full node. [1]https://raw.githubusercontent.com/PawseySC/pawsey-slurm-plugin-configuration/main/lua-plugins/cli_filter.lua.in
Sam, that section of the cli_filter plugin is ignored when the acceptance partition is used, check the if condition. The logic wouldn't be valid for a heterogeneous node partition like the acceptance one. I think what Marcin is wanting to know is what do we hope the gain by setting the MaxMemPerCPU value on the partition. For the acceptance partition is adds nothing as only Pawsey staff will have access to it. On other partitions I believe we set it as our fairshare policy uses CPU as the accounting factor, so without a max memory per CPU a user would be able to request a single CPU and use all the memory on the node and only be charged at a single CPU rate as our nodes are not exclusive. Marcin, I've tested the job with --mem=230 and MaxMemPerCPU=0 (UNLIMITED) and the job runs as you suggested, so as a workaround this is fine. I'll need to perform some further testing but I think it should suffice. Thank you. Regards, William
I'm sending a patch to the reported bug to our Q/A team. I'll keep you posted on the progress. cheers, Marcin