5240 – Important memory request regression in 17.11.7

Ticket 5240 - Important memory request regression in 17.11.7

Summary: Important memory request regression in 17.11.7

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Duplicates (5):	5269 5283 5382 5592 6183 (view as ticket list)
Depends on:
Blocks:

Reported:	2018-05-31 14:21 MDT by Kilian Cavalotti
Modified:	2024-05-01 06:46 MDT (History)
CC List:	8 users (show)

See Also:	5331 6183 7876 19722
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.8 18.08.0pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Current Sherlock config (45.34 KB, text/plain) 2018-05-31 14:59 MDT, Kilian Cavalotti	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2018-05-31 14:21:28 MDT

[This is for 17.11.7, even though the ticket is filled for version 17.11.6, as 17.11.7 doesn't seem to be available in the Bugzilla version list yet]

Hi!

We upgraded from 17.11.6 to 17.11.7 yesterday and are now seeing al lot of our users' jobs requesting memory being rejected with the following reason:

"Unable to allocate resources: Requested partition configuration not available now"

All of those jobs were scheduled and dispatched correctly with 17.11.6.


Here are some details:

-- 8< --------------------------------------------------------------------
$ scontrol show partition test
PartitionName=test
   AllowGroups=sh_sysadm AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=sh-101-[59-60],sh-06-[04,34],sh-09-06
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=88 TotalNodes=5 SelectTypeParameters=NONE
   DefMemPerCPU=4000 MaxMemPerCPU=6400

$ scontrol show node sh-101-60
NodeName=sh-101-60 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.01
   AvailableFeatures=CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz
   ActiveFeatures=CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz
   Gres=(null)
   NodeAddr=sh-101-60 NodeHostName=sh-101-60 Version=17.11
   OS=Linux 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018
   RealMemory=128000 AllocMem=0 FreeMem=125852 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=102161 Owner=N/A MCS_label=N/A
   Partitions=test
   BootTime=2018-05-24T12:40:04 SlurmdStartTime=2018-05-30T14:19:49
   CfgTRES=cpu=20,mem=125G,billing=20
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ srun -p test --mem 125G -w sh-101-60 -n 20 -N 1 --pty bash
srun: error: Unable to allocate resources: Requested partition configuration not available now
-- 8< --------------------------------------------------------------------


It appears I can submit jobs with --mem=102400, but --mem=102401 fails:

-- 8< --------------------------------------------------------------------
$ srun -p test --mem 102400 -w sh-101-60 -n 20 -N 1 --pty bash
sh-101-60 $ scontrol -dd show job $SLURM_JOBID
JobId=19068504 JobName=bash
   UserId=kilian(215845) GroupId=ruthm(32264) MCS_label=N/A
   Priority=4233 Nice=0 Account=ruthm QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:04 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2018-05-31T13:20:43 EligibleTime=2018-05-31T13:20:43
   StartTime=2018-05-31T13:20:43 EndTime=2018-06-02T13:20:43 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-05-31T13:20:43
   Partition=test AllocNode:Sid=sh-ln03:32434
   ReqNodeList=sh-101-60 ExcNodeList=(null)
   NodeList=sh-101-60
   BatchHost=sh-101-60
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=100G,node=1,billing=20
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=sh-101-60 CPU_IDs=0-19 Mem=102400 GRES_IDX=
   MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/users/kilian
   Power=


$ srun -p test --mem 102401 -w sh-101-60 -n 20 -N 1 --pty bash
srun: error: Unable to allocate resources: Requested partition configuration not available now
-- 8< --------------------------------------------------------------------

It looks like this commit https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69
may have introduced some sort of breaking change or regression. And we would be very grateful if you could look into it, as it's pretty impacting to our users right now.

Thanks!
-- 
Kilian

Comment 1 Tim Wickberg 2018-05-31 14:41:53 MDT

Can you attach the current config?

If you remove the MaxMemPerCPU limit on the partition temporarily does that serve as an effective workaround?

Comment 2 Kilian Cavalotti 2018-05-31 14:59:22 MDT

Hi Tim, 

Yes, removing MaxMemPerCPU from the partition definition works around the problem:

$ scontrol show partition test
PartitionName=test
   AllowGroups=sh_sysadm AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=sh-101-[59-60],sh-06-[04,34],sh-09-06
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=88 TotalNodes=5 SelectTypeParameters=NONE
   DefMemPerCPU=4000 MaxMemPerNode=UNLIMITED

$ srun -p test --mem 125G -w sh-101-60 -n 20 -N 1 --pty bash
sh-101-60 $

Configuration file is on its way.

Thanks!
-- 
Kilian

Comment 3 Kilian Cavalotti 2018-05-31 14:59:41 MDT

Created attachment 6972 [details]
Current Sherlock config

Comment 4 Tim Wickberg 2018-05-31 15:05:59 MDT

> Yes, removing MaxMemPerCPU from the partition definition works around the
> problem:

Okay. I'll assume that's alright for the time being, and let Alex take over from here.

Comment 5 Kilian Cavalotti 2018-05-31 15:10:16 MDT

Hi Tim, 

(In reply to Tim Wickberg from comment #4)
> > Yes, removing MaxMemPerCPU from the partition definition works around the
> > problem:
> 
> Okay. I'll assume that's alright for the time being, and let Alex take over
> from here.

Well, thanks for suggesting the workaround, but as you initially wrote, it can only be temporary: removing MaxMemPerCPU will stop adjusting the amount of CPUs in jobs based on the amount of memory requested, and we'll end up with un-allocatable idle cores (the typical example of jobs requesting 1 CPU and all the memory on a node). Which is why we use MaxMemPerCPU in the first place.

Actually, for us, moving back to 17.11.6 while this is sorted out may be a better workaround, unfortunately. :(

Cheers,
-- 
Kilian

Comment 6 Alejandro Sanchez 2018-06-01 04:05:47 MDT

Hi Kilian,

that commit stops Slurm (among other things) from doing automatic adjustments under the covers when the --mem[-per-cpu] requested by the job exceeds the cluster/partition MaxMemPer[CPU|Node]. These previous behavior adjustments included increasing pn_min_cpus (even incorrectly beyond the number of cpus available on the nodes as reported in other bugs) or different tricks increasing cpus_per_task and decreasing mem_per_cpu (which is incorrect also since the limit defined shouldn't be touched at all).

Another consideration to take into account is that you are defining a partition with heterogeneous nodes with respect not only to the number of cores but also with respect to the amount of memory, which we usually discourage to do:

alex@ibiza:~/t$ sinfo -N -o "%N %z %c %m %P %w"
NODELIST S:C:T CPUS MEMORY PARTITION WEIGHT
sh-06-04 2:8:1 16 64000 test* 100081
sh-06-34 2:8:1 16 64000 test* 100081
sh-09-06 2:8:1 16 64000 test* 140081
sh-101-59 2:10:1 20 128000 test* 102161
sh-101-60 2:10:1 20 128000 test* 102161
alex@ibiza:~/t$

alex@ibiza:~/t$ scontrol show part test | grep MemPer
   DefMemPerCPU=4000 MaxMemPerCPU=6400
alex@ibiza:~/t$

alex@ibiza:~/t$ scontrol show conf | grep EnforcePartLimits
EnforcePartLimits       = ANY
alex@ibiza:~/t$

Note that the partition MaxMemPer[CPU|Node] limit applies equally to all the nodes in the partition, so if the partition contains heterogeneous nodes that can be problematic and that's why we discourage having partitions like that. 

With these premises, the new logic what it does is to go find the node with the lowest amount of allocatable memory in the partition and do the calculations for all the nodes in such partition based off that.

So if I request like you did:

alex@ibiza:~/t$ salloc -p test --mem 125G -w sh-101-60 -n 20 -N 1 bash
salloc: error: Job submit/allocate failed: Requested partition configuration not available now
salloc: Job allocation 20001 has been revoked.
alex@ibiza:~/t$

this is logged from the new validation logic:
slurmctld: debug2: _valid_pn_min_mem: job 20001 mem_per_node=128000M > MaxMemPerNode=102400M in partition test

Note that you have defined MaxMemPerCPU=6400. Since MaxMemPer* limits apply equally to ALL the nodes on the partition because it is a partition-wide limit, 6400M * 16cores = 102400M is the equivalent conversion of your defined MaxMemPerCPU to MaxMemPerNode so that Slurm can compare your --mem=125G request against the partition limit transformed to PerNode form. 125GiB*1024Mib/1GiB=128000M and that request exceeds your partition-wide defined 102400M.

Does it make sense?

Comment 7 Alejandro Sanchez 2018-06-01 05:12:40 MDT

You can still allocate a full 125G node in this partition with this config:

alex@ibiza:~/t$ salloc -p test -n20 --mem-per-cpu=6400 bash
salloc: Granted job allocation 20013
alex@ibiza:~/t$ scontrol show job 20013 | grep -i tres
   TRES=cpu=20,mem=125G,node=2,billing=20
alex@ibiza:~/t$ exit
salloc: Relinquishing job allocation 20013

But you can't exceed your MaxMemPerCPU=6400M:

alex@ibiza:~/t$ salloc -p test -n20 --mem-per-cpu=6401 bash
salloc: error: Job submit/allocate failed: Requested partition configuration not available now
salloc: Job allocation 20014 has been revoked.
alex@ibiza:~/t$

The problem is when you have a MaxMemPerCPU limit but your request is --mem (per node, or viceversa different types) so Slurm needs to have both request and limit have the same type, so it tries to transform the per-cpu component to per-node:

alex@ibiza:~/t$ salloc -p test -n20 --mem 125G bash
salloc: error: Job submit/allocate failed: Requested partition configuration not available now
salloc: Job allocation 20015 has been revoked.
alex@ibiza:~/t$

slurmctld: debug2: _valid_pn_min_mem: job 20015 mem_per_node=128000M > MaxMemPerNode=102400M in partition test

In this scenario, we have a --mem 125G request (per-node request) and Slurm needs to compare with MaxMemPerCPU=6400M. 

In order to have the same type, the logic goes and grabs the node with the lowest allocatable memory in the partition (lowest RealMemory - MemSpecLimit value), so it calls:

lowest_mem = _part_node_lowest_mem(part_ptr);

which returns 64GiB. Then it grabs the number of cpus per node in this partition, and to simplify (since the node could end up being allocated nodes with different cores/memory for multi-node submissions) it takes the number of cpus per node from the first node in the partition:

avail_cpus_per_node = _cpus_per_node_part(part_ptr);

which returns 16.

Then it does:

/*
 * Job has per-node form and limit has per-cpu one. Use the
 * avail_cpus_per_node as a factor to obtain the limit per-node
 * form and compare it with the job memory.
 */
max_mem &= (~MEM_PER_CPU); // Transform MaxMemPerCPU to MaxMemPerNode
max_mem *= avail_cpus_per_node;

MaxMemPerNode = 6400MiB (old MaxMemPerCPU) * 16 (avail_cpus_per_node) = 102400MiB = 100GiB

So job --mem 125 exceeds MaxMemPerNode=100GiB, which is what is logged:

slurmctld: debug2: _valid_pn_min_mem: job 20015 mem_per_node=128000M > MaxMemPerNode=102400M in partition test

Comment 8 Alejandro Sanchez 2018-06-01 09:03:15 MDT

Here's an example with your config but with Slurm 17.11.6 which I think is totally incorrect:

Create a reservation containing 1 node of 125G and 20 cores:

NODELIST S:C:T CPUS MEMORY PARTITION WEIGHT
sh-101-60 2:10:1 20 128000 test* 102161

alex@ibiza:~/t$ scontrol create reservationname=test start=now end=now+1day nodes=sh-101-60 users=alex
Reservation created: test
alex@ibiza:~/t$ scontrol show res
ReservationName=test StartTime=2018-06-01T16:58:06 EndTime=2018-06-02T16:58:06 Duration=1-00:00:00
   Nodes=sh-101-60 NodeCnt=1 CoreCnt=20 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=20
   Users=alex Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

Now launch a job requesting 1MB more than the available:
alex@ibiza:~/t$ sbatch -p test --mem=128001 --wrap "true" --reservation=test
Submitted batch job 20052
alex@ibiza:~/t$

Here you see the automatic adjustments:
slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000
slurmctld: debug:  Setting job's pn_min_cpus to 21 due to memory limit
slurmctld: debug2: initial priority for job 20052 is 25000
slurmctld: _build_node_list: No nodes satisfy job 20052 requirements in partition test

Note Slurm was trying to increase pn_min_cpus so that it could satisfy the MaxMemPer, but it was giving pn_min_cpus beyond the number of CPUs in that node, which is 20, so to 21 in this case (if I requested even more memory, you would see an even greater increase of pn_min_cpus).

And this is incorrect. This is one of the 4 issues fixed in that commit.

Comment 9 Alejandro Sanchez 2018-06-01 09:30:47 MDT

Another issue that was incorrect in 17.11.6:

alex@ibiza:~/t$ sinfo -N -n sh-06-04 -o "%N %z %c %m %P %w"
NODELIST S:C:T CPUS MEMORY PARTITION WEIGHT
sh-06-04 2:8:1 16 64000 test* 100081
alex@ibiza:~/t$ scontrol show part test | grep -i MaxMemPerCPU
   DefMemPerCPU=3000 MaxMemPerCPU=3500 <-- Lowered to 3500
alex@ibiza:~/t$ sbatch -p test -w sh-06-04 -n 16 --mem-per-cpu=0 --wrap "sleep 9999"
Submitted batch job 20063
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20063      test     wrap     alex  R       0:00      1 sh-06-04
alex@ibiza:~/t$ scontrol show job 20063 | grep -i tre
   TRES=cpu=16,mem=62.50G,node=1,billing=16
alex@ibiza:~/t$ echo "(62.5 * 1024) / 16" | bc -l
4000.00000000000000000000
alex@ibiza:~/t$

When requesting --mem[-per-cpu]=0 this special case was ignoring MaxMemPer[CPU|Node] and with the commit in .7 that is fixed as well.

There's another issue with regards to multi-partition requests that is also solved there as well.

I'm open to talk about why do you think that the commit in .7 is an important regression. Thanks!

Comment 10 Kilian Cavalotti 2018-06-01 13:48:23 MDT

HI Alejandro, 

First of all, thanks a lot for taking the time to walk me through the change. I understand parts of it, but I strongly disagree with others, and I'll get back to that in detail.

But most importantly: this is a MAJOR change in behavior, that has not been announced, nor documented in any way, and which has been introduced in a minor version. This is extremely unsettling, and to summarize, this breaks our setup and we're not too happy about that.

We've been relying on [Def|Max]MemPerCPU since the first 2.6 versions we installed on Sherlock about 4 years ago, and that behavior has been consistent over time. Having it change completely between 17.11.6 and 17.11.7 is completely unexpected, and has not been very well received, to say the least.


Now to the technical part of the issue:

> that commit stops Slurm (among other things) from doing automatic
> adjustments under the covers when the --mem[-per-cpu] requested by
> the job exceeds the cluster/partition MaxMemPer[CPU|Node]. 

I understand this could have caused issues in certain cases, but disabling the behavior entirely seems a bit of a strong move. 

If the behavior that automatically adjusts the number of CPUs when the amount of requested memory exceeds the MaxMemPerCPU value, what is the point of having a MaxMemPerCPU option at all?

And again, I really have a hard time understanding how a behavior that has been in place for years, and is still explicitly described in the documentation, could be radically changed in a minor version.
For reference, the slurm.conf man page says, under "MaxMemPerCPU"

    NOTE: If a job specifies a memory per CPU limit that exceeds this system 
    limit, that job's count of CPUs per task will automatically be increased. 
    This may result in the job failing due to CPU count limits.

Also, in the "sbatch" man page:

    Note that if the job's --mem-per-cpu value exceeds the configured 
    MaxMemPerCPU, then the user's limit will be treated as a memory limit per
    task; --mem-per-cpu will be reduced to a value no larger than MaxMemPerCPU; 
    --cpus-per-task will be set and the value of --cpus-per-task multiplied by 
    the new --mem-per-cpu value will equal the original --mem-per-cpu value  
    specified  by  the  user.

All of this has been documented and consistent across versions, and we've been relying on it for literally years.


> you are defining a partition with heterogeneous nodes with respect 
> not only to the number of cores but also with respect to the amount 
> of memory, which we usually discourage to do:

Well, yes, we are, and heterogeneous nodes within a cluster is an unfortunate reality that we have to live with. I guess only National Labs and large CSPs have the luxury of homogeneous hardware these days. But even though, the administrative boundaries (ownership, permissions...) don't necessarily match the hardware specs, with hardware evolving over time, computing needs varying with applications, and so on. Assuming that partitions offer a unique type of hardware seems a bit of a stretch, especially in academic environments.

So yes, we have to and will continue to regroup nodes with different memory per core ratios under the same partitions. Heterogeneous partitions are a thing that Slurm supported pretty well so far, thanks to its great flexibility, and they're not going away, quite the contrary, actually. So if Slurm were to stop working, that would be a considerable step backwards. And I'm pretty sure we wouldn't be the only ones affected.


> Note that the partition MaxMemPer[CPU|Node] limit applies equally to
> all the nodes in the partition, so if the partition contains
> heterogeneous nodes that can be problematic and that's why we
> discourage having partitions like that.

Yes, that's why we have a DefMemPerCPU, which for us defines the lowest memory/core ratio in the partition, and a MaxMemPerCPU, which defines the highest ratio. And again, it's been working quite well, with the notable exception of jobs submitted to multiple partitions, because, I believe, the memory/cpu adjustments are done before the partition where to run the job is chosen. So to me, a fix would have been to choose a partition first and then adjust the memory/cpu limits next, rather than disabling the behavior entirely.

Which leads to another question: how does that new logic work out for jobs submitted to multiple partitions, when each partition has a different memory/core ratio? Doesn't it completely reproduce the situation where you have heterogeneous nodes within a single partition?


> With these premises, the new logic what it does is to go find the
> node with the lowest amount of allocatable memory in the partition
> and do the calculations for all the nodes in such partition based off
> that.

Why? That would assume that all the nodes in the partition have the same specs, and that's not something that can be taken for granted.

If I understand things correctly, MaxMemPer[CPU|Node] is now really a partition-wide limit, that applies consistently to all the nodes in the partition, whereas before, it was not, and was more of a per-node value. Is that correct?

There's more that I don't get:

> the logic goes and grabs the node
> with the lowest allocatable memory in the partition (lowest
> RealMemory - MemSpecLimit value)
> [...]
> Then it grabs the number of cpus per node in this partition, and [...]
> it takes the number of cpus per node from the first node in the partition:

I have several concerns with this:

1. This logic assumes that all the nodes in the partition have the same amount of CPUs. Why? That would effectively forbid heterogeneous partitions.

2. Why grabbing the memory info from one node, and then the CPU count info from another node? Why not getting everything from the same node? That seems to be introducing a bit of randomness, and certainly a risk of unpredictable results.

3. Why using the first node in the partition as a reference for the number of CPUs? Wouldn't it be better to use the node with the lowest (or highest) number of CPUs in the partition? The first one (and in which order, actually?) is not necessarily significant in any way.


So, again, this all assumes that all the nodes in a partition have the same amount of CPUs, and the same amount of memory. This is not necessarily the case.



Regarding the 17.11.6 examples you kindly provided, I agree there are some issues, that would be better fixed, but maybe differently from what 17.11.7 does.

> Slurm was trying to increase pn_min_cpus so that it could
> satisfy the MaxMemPer, but it was giving pn_min_cpus beyond the
> number of CPUs in that node, which is 20, so to 21 in this case

Absolutely. And after doing so, it should have rejected the job, instead of keeping it in queue. Since the --mem=128001 argument could not be satisfied by any node in the partition/reservation, the job should not even be accepted at all, and no CPU adjustment would even be necessary. 
(That's actually what happens when you submit that type of job to the partition directly, without using a reservation. I'm not sure why the behavior with a reservation is different). Here, it looks like the job is accepted and stays in queue despite having two unsatisfiable requests: memory *and* CPU count.

So to me, a natural solution would be to reject this un-runnable job at submission, rather than removing the CPU count adjustment.


> When requesting --mem[-per-cpu]=0 this special case was ignoring
> MaxMemPer[CPU|Node] and with the commit in .7 that is fixed as well.

The "sbatch" man pages says this about "--mem"

    NOTE: A memory size specification of zero is treated as a special case and 
    grants the job access to all of the memory on each node

That sounds to me like MaxMemPer[CPU|Node] *should* be ignored with --mem[-per-cpu]=0, doesn't it?



I'm sorry for the long rant-like comment, and I appreciate the willingness to make things better and fix issues, as well as the transparency of explaining the new behavior here, but this is pretty concerning for us, in several ways:

* that commit seems to have introduced major behavior changes in a minor version, without any announce nor documentation. Actually some of them seem to be *against* the existing documentation.

* the new logic seems to strongly assumes that nodes within a partition are completely homogeneous in terms of number of CPUs and memory amount. This is definitely not the case in our setup, and I don't think it could be used as a general premise. That would great reduce Slurm's flexibility and be a significant step backwards.

* it looks like the issues addressed in that commit are real, but somewhat on the corner-case side, and that the way that was chosen to fix them was to simply remove a documented behavior that has been in place and working fine for years.


Right now, we have downgraded Sherlock to 17.11.6, despite the known vulnerabilities, and we won't be able to move to 17.11.7 with that new logic and our current configuration.

So, to move forward. here are our requirements:

1. we need to continue having heterogeneous nodes in a single partition

2. we need to be able to allocate nodes in "slices", with a fixed per-node memory/core ratio. We don't want situations where a job allocate 1 CPU and all the memory of a node, and all the remaining CPUs show up idle, despite being unusable for lack of available memory. 
That means, on a 16-CPU, 64GB node, that each requested CPU should automatically allocates 4GB of memory, and each additional chunk of 4GB of requested memory should automatically allocate one more CPU. And I understand this would be better defined at the node-level rather than at the partition-level.

So given that, how do you suggest we modify our configuration to satisfy those requirements in 17.11.7?


Thanks again for your explanations and taking the time to look into this.

Cheers,
-- 
Kilian

Comment 11 Alejandro Sanchez 2018-06-04 06:49:24 MDT

Hi Kilian,

(In reply to Kilian Cavalotti from comment #10)
> HI Alejandro, 
> 
> First of all, thanks a lot for taking the time to walk me through the
> change. I understand parts of it, but I strongly disagree with others, and
> I'll get back to that in detail.
> 
> But most importantly: this is a MAJOR change in behavior, that has not been
> announced, nor documented in any way, and which has been introduced in a
> minor version. This is extremely unsettling, and to summarize, this breaks
> our setup and we're not too happy about that.
> 
> We've been relying on [Def|Max]MemPerCPU since the first 2.6 versions we
> installed on Sherlock about 4 years ago, and that behavior has been
> consistent over time. Having it change completely between 17.11.6 and
> 17.11.7 is completely unexpected, and has not been very well received, to
> say the least.

thanks for your sharing your insights and the detailed feedback about your concerns with regards to the commit changes.

Independently on the technical details, I agree that perhaps trying to fix bug 4895 and bug 4976 I removed the automatic adjustments when job mem exceeded MaxMemPer* and that should at least had landed in 18.08 + the documentation be changed accordingly for the slurm.conf man MaxMemPerCPU + submission commands --mem-per-cpu option.

In any case, I apologize if stopped Slurm from behaving for this specific case in the way you expected to do. I'll discuss internally with the team on how to proceed... one option we've leveraged is to restore the adjustments in 17.11 and discuss what to do for 18.08, but we've not decided anything yet. We'll let you know before committing anything.

I'm gonna share my thinking about the changes introduced in .7 and the reasoning behind them.
 
> Now to the technical part of the issue:
> 
> > that commit stops Slurm (among other things) from doing automatic
> > adjustments under the covers when the --mem[-per-cpu] requested by
> > the job exceeds the cluster/partition MaxMemPer[CPU|Node]. 
> 
> I understand this could have caused issues in certain cases, but disabling
> the behavior entirely seems a bit of a strong move. 
> 
> If the behavior that automatically adjusts the number of CPUs when the
> amount of requested memory exceeds the MaxMemPerCPU value, what is the point
> of having a MaxMemPerCPU option at all?
> 
> And again, I really have a hard time understanding how a behavior that has
> been in place for years, and is still explicitly described in the
> documentation, could be radically changed in a minor version.
> For reference, the slurm.conf man page says, under "MaxMemPerCPU"
> 
>     NOTE: If a job specifies a memory per CPU limit that exceeds this system 
>     limit, that job's count of CPUs per task will automatically be
> increased. 
>     This may result in the job failing due to CPU count limits.
> 
> Also, in the "sbatch" man page:
> 
>     Note that if the job's --mem-per-cpu value exceeds the configured 
>     MaxMemPerCPU, then the user's limit will be treated as a memory limit per
>     task; --mem-per-cpu will be reduced to a value no larger than
> MaxMemPerCPU; 
>     --cpus-per-task will be set and the value of --cpus-per-task multiplied
> by 
>     the new --mem-per-cpu value will equal the original --mem-per-cpu value  
>     specified  by  the  user.
> 
> All of this has been documented and consistent across versions, and we've
> been relying on it for literally years.

My reasoning behind this was that I considered that Slurm should try to allocate what users requested, and validate the request against the admin defined limits, but do not modify the request. If users see that their original request ends up being allocated something different than what they requested, that might be confusing to them. If admins wanted to modify the user request, they have the Job Submit plugins available to do so, and then every site could justify their specific modifications to their users as desired.

For instance, in bug 4976 they didn't understand why Slurm was modifying the user request so that it was automatically requesting more cpus than available on the partition node.

I understand though that for the specific case of requesting memory, ntasks and cpus per task, perhaps the majority of users (and probably admin sites too) prefer to offload these tricky adjustments/logic from a Job Submit Plugin to directly let Slurm do so under the covers and the job perhaps end up being allocated with such adjustments over directly rejecting the request if it exceeded the defined limits.

Also, Slurm was behaving like this since many versions ago, I understand that changing this behavior now in a micro release hasn't been a good decision.
 
> 
> > you are defining a partition with heterogeneous nodes with respect 
> > not only to the number of cores but also with respect to the amount 
> > of memory, which we usually discourage to do:
> 
> Well, yes, we are, and heterogeneous nodes within a cluster is an
> unfortunate reality that we have to live with. I guess only National Labs
> and large CSPs have the luxury of homogeneous hardware these days. But even
> though, the administrative boundaries (ownership, permissions...) don't
> necessarily match the hardware specs, with hardware evolving over time,
> computing needs varying with applications, and so on. Assuming that
> partitions offer a unique type of hardware seems a bit of a stretch,
> especially in academic environments.
> 
> So yes, we have to and will continue to regroup nodes with different memory
> per core ratios under the same partitions. Heterogeneous partitions are a
> thing that Slurm supported pretty well so far, thanks to its great
> flexibility, and they're not going away, quite the contrary, actually. So if
> Slurm were to stop working, that would be a considerable step backwards. And
> I'm pretty sure we wouldn't be the only ones affected.

I'm not saying it is a strict rule that partitions can't have heterogeneous nodes, just that from our experience that can cause issues. And we work everyday towards improving Slurm flexibility and functionality.
 
> 
> > Note that the partition MaxMemPer[CPU|Node] limit applies equally to
> > all the nodes in the partition, so if the partition contains
> > heterogeneous nodes that can be problematic and that's why we
> > discourage having partitions like that.
> 
> Yes, that's why we have a DefMemPerCPU, which for us defines the lowest
> memory/core ratio in the partition, and a MaxMemPerCPU, which defines the
> highest ratio. And again, it's been working quite well, with the notable
> exception of jobs submitted to multiple partitions, because, I believe, the
> memory/cpu adjustments are done before the partition where to run the job is
> chosen. So to me, a fix would have been to choose a partition first and then
> adjust the memory/cpu limits next, rather than disabling the behavior
> entirely.
> 
> Which leads to another question: how does that new logic work out for jobs
> submitted to multiple partitions, when each partition has a different
> memory/core ratio? Doesn't it completely reproduce the situation where you
> have heterogeneous nodes within a single partition?

Since the function that "supposedly" validates the job memory request against the limit (_valid_pn_min_mem), and I say "supposedly" because it's a function that doesn't just validate but also change the request in some cases, then the idea of adding extra members to the job_record raised:

* Fields subject to change and their original values are as follows:
2416		 * min_cpus		orig_min_cpus
2417		 * max_cpus		orig_max_cpus
2418		 * cpus_per_task 	orig_cpus_per_task
2419		 * pn_min_cpus		orig_pn_min_cpus
2420		 * pn_min_memory	orig_pn_min_memory

Problem with this function is that since it not only validates but also changes, then when looping on the partitions on a multi-partition request, when checking on one partition the user request could be modified, then when validating the next partition the limits weren't validated against the original request anymore but rather against a modified version of them suited for the previous partition that forced such adjustments.

> 
> > With these premises, the new logic what it does is to go find the
> > node with the lowest amount of allocatable memory in the partition
> > and do the calculations for all the nodes in such partition based off
> > that.
> 
> Why? That would assume that all the nodes in the partition have the same
> specs, and that's not something that can be taken for granted.
> 
> If I understand things correctly, MaxMemPer[CPU|Node] is now really a
> partition-wide limit, that applies consistently to all the nodes in the
> partition, whereas before, it was not, and was more of a per-node value. Is
> that correct?
> 
> There's more that I don't get:
> 
> > the logic goes and grabs the node
> > with the lowest allocatable memory in the partition (lowest
> > RealMemory - MemSpecLimit value)
> > [...]
> > Then it grabs the number of cpus per node in this partition, and [...]
> > it takes the number of cpus per node from the first node in the partition:
> 
> I have several concerns with this:
> 
> 1. This logic assumes that all the nodes in the partition have the same
> amount of CPUs. Why? That would effectively forbid heterogeneous partitions.

The problem is you have EnforcePartLimits option which among other limits, it enforces MaxMemPer* limits at submit time if set to ANY and/or ALL.

So at submission time, when slurmctld still doesn't have the information about which nodes are selected in a partition for the job request, if the job memory request is per-node and the partition limit is per-cpu, since we have to compare apples with apples then we have to convert the limit to a per-node limit as well.

Now if you have to convert the limit to a per-node, you have to make a ratio of memory / cpu... but if you still don't know which nodes is the job to be allocated to, and the partition can contain nodes with different memory / node size, then one approach is to make simplifications like using the node memory with lowest size, or use the cpus from the first node in the partition.
 
> 2. Why grabbing the memory info from one node, and then the CPU count info
> from another node? Why not getting everything from the same node? That seems
> to be introducing a bit of randomness, and certainly a risk of unpredictable
> results.

That is a good point and is something I already had in mind subject to be changed.

> 3. Why using the first node in the partition as a reference for the number
> of CPUs? Wouldn't it be better to use the node with the lowest (or highest)
> number of CPUs in the partition? The first one (and in which order,
> actually?) is not necessarily significant in any way.

I agree this can be improved taking into account the considerations about the lack of information about node selection at submission time. I also want to remark that _valid_pn_min_mem() called a function named _cpus_per_node_part() to retrieve the number of CPUs in the first node of the partition and that logic was there before my changes for .7, so in my willingness to fix things leaving untouched as many things as possible (except the adjustments) I didn't modify this either.

But yes, I agree using the first node in the partition is a simplification. But also want to let you understand that without knowing what type of nodes will the job end up allocated (even it could be allocated nodes of different types), making ratios for the automatic adjustments is a problem.

> 
> So, again, this all assumes that all the nodes in a partition have the same
> amount of CPUs, and the same amount of memory. This is not necessarily the
> case.
> 
> 
> 
> Regarding the 17.11.6 examples you kindly provided, I agree there are some
> issues, that would be better fixed, but maybe differently from what 17.11.7
> does.
> 
> > Slurm was trying to increase pn_min_cpus so that it could
> > satisfy the MaxMemPer, but it was giving pn_min_cpus beyond the
> > number of CPUs in that node, which is 20, so to 21 in this case
> 
> Absolutely. And after doing so, it should have rejected the job, instead of
> keeping it in queue. Since the --mem=128001 argument could not be satisfied
> by any node in the partition/reservation, the job should not even be
> accepted at all, and no CPU adjustment would even be necessary. 
> (That's actually what happens when you submit that type of job to the
> partition directly, without using a reservation. I'm not sure why the
> behavior with a reservation is different). Here, it looks like the job is
> accepted and stays in queue despite having two unsatisfiable requests:
> memory *and* CPU count.
> 
> So to me, a natural solution would be to reject this un-runnable job at
> submission, rather than removing the CPU count adjustment.

We will study if we can restore the old behavior with all my considerations above.
 
> 
> > When requesting --mem[-per-cpu]=0 this special case was ignoring
> > MaxMemPer[CPU|Node] and with the commit in .7 that is fixed as well.
> 
> The "sbatch" man pages says this about "--mem"
> 
>     NOTE: A memory size specification of zero is treated as a special case
> and 
>     grants the job access to all of the memory on each node
> 
> That sounds to me like MaxMemPer[CPU|Node] *should* be ignored with
> --mem[-per-cpu]=0, doesn't it?
> 

This specific case can be sent back again from _valid_pn_min_mem() to deep in select/cons_res.c so that it gives all the memory on the node.

If the request should ignore the limit that is arguably correct, even if the documentation says that it's a special case which grants the job access to all of the memory on each node.

If an admin sets a MaxMemPerNode and a user can bypass such limit just by doing --mem=0 then there's no point about having such limit available. In any case I can discuss this internally and hear more opinions.

But this is not the big problem, the main thing are the automatic adjustments.
 
> 
> I'm sorry for the long rant-like comment, and I appreciate the willingness
> to make things better and fix issues, as well as the transparency of
> explaining the new behavior here, but this is pretty concerning for us, in
> several ways:
> 
> * that commit seems to have introduced major behavior changes in a minor
> version, without any announce nor documentation. Actually some of them seem
> to be *against* the existing documentation.

I agree, and apologize for that.

> * the new logic seems to strongly assumes that nodes within a partition are
> completely homogeneous in terms of number of CPUs and memory amount. This is
> definitely not the case in our setup, and I don't think it could be used as
> a general premise. That would great reduce Slurm's flexibility and be a
> significant step backwards.

I explaiend this a few paragraphs above.

> * it looks like the issues addressed in that commit are real, but somewhat
> on the corner-case side, and that the way that was chosen to fix them was to
> simply remove a documented behavior that has been in place and working fine
> for years.

Well, multi-partition requests aren't so corner case and submitting jobs against reservations either. But I agree perhaps I went too far with the fix.

> 
> Right now, we have downgraded Sherlock to 17.11.6, despite the known
> vulnerabilities, and we won't be able to move to 17.11.7 with that new logic
> and our current configuration.
> 
> So, to move forward. here are our requirements:
> 
> 1. we need to continue having heterogeneous nodes in a single partition
> 
> 2. we need to be able to allocate nodes in "slices", with a fixed per-node
> memory/core ratio. We don't want situations where a job allocate 1 CPU and
> all the memory of a node, and all the remaining CPUs show up idle, despite
> being unusable for lack of available memory. 
> That means, on a 16-CPU, 64GB node, that each requested CPU should
> automatically allocates 4GB of memory, and each additional chunk of 4GB of
> requested memory should automatically allocate one more CPU. And I
> understand this would be better defined at the node-level rather than at the
> partition-level.
> 
> So given that, how do you suggest we modify our configuration to satisfy
> those requirements in 17.11.7?

We're gonna work to study what's the best thing to do and come back to you.

> 
> Thanks again for your explanations and taking the time to look into this.
> 
> Cheers,
> -- 
> Kilian

Thanks to you for sharing your behavior expectations.

Alex

Comment 12 Kilian Cavalotti 2018-06-04 06:49:47 MDT

Hi!

I'm out of the office and will return on June 21st.

For urgent matters, please contact srcc-support@stanford.edu

Cheers,
--
Kilian

Comment 15 Alejandro Sanchez 2018-06-06 15:01:40 MDT

*** Ticket 5269 has been marked as a duplicate of this ticket. ***

Comment 16 Alejandro Sanchez 2018-06-08 06:08:51 MDT

*** Ticket 5283 has been marked as a duplicate of this ticket. ***

Comment 24 Alejandro Sanchez 2018-06-27 03:04:46 MDT

Hi,

the change in functionality has been reverted in here:

d52d8f4f0ce1a5b86bb0691630da0dc3dace1683

and we added this commit on top of the revert:

f07f53fc138b22485e7c26903968fa470cc9d98f

to fix a problem on multi-partition requests. They will be in 17.11.8 and onwards. I'm gonna go ahead and close this; sorry for the delayed response. Thank you.

Comment 25 Alejandro Sanchez 2018-07-04 04:53:31 MDT

*** Ticket 5382 has been marked as a duplicate of this ticket. ***

Comment 26 Kilian Cavalotti 2018-07-05 11:53:27 MDT

Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #24)
> the change in functionality has been reverted in here:
> d52d8f4f0ce1a5b86bb0691630da0dc3dace1683
> 
> and we added this commit on top of the revert:
> f07f53fc138b22485e7c26903968fa470cc9d98f
> 
> to fix a problem on multi-partition requests. They will be in 17.11.8 and
> onwards. I'm gonna go ahead and close this; sorry for the delayed response.
> Thank you.

Thank you very much for sharing the reasonning behind the changes introduced in .7. It definitely makes much more sens now.

I particularly wasn't aware of the specifics of multi-partition submissions, and the fact that the limits were not validated against the original requests but isntead against a value already modifid by the last check in the previous partition. That indeed sounds like the root of the problem reported in #4976 and that you addressed in f07f53fc138b22485e7c26903968fa470cc9d98f 

Thanks again for reverting the changes in .8, we will be sure to test and validate the allocation behavior on partitions with multiple memory/core ratios once it's released.

Cheers,
-- 
Kilian

Comment 27 Felip Moll 2018-08-21 05:35:08 MDT

*** Ticket 5592 has been marked as a duplicate of this ticket. ***

Comment 28 Martin Siegert 2018-08-21 05:35:19 MDT

I am on vacation from Aug 10 to 24 and may not not be able to respond to your message before I return.

For help related to the Cedar/Compute Canada systems send email to support@computecanada.ca.

For matters concerning research computing at SFU send email to research-support@sfu.ca.

--
Martin Siegert
Director, Research Computing, IT Services
WestGrid/ComputeCanada Site Lead
Simon Fraser University | ASB 10977
8888 University Dr., Burnaby, BC V5A1S6
T: 778.782.4691 | sfu.ca/itservices

Comment 29 Nate Rini 2018-12-06 10:54:33 MST

*** Ticket 6183 has been marked as a duplicate of this ticket. ***