Ticket 5269 - --mem=0 assigns incorrect amount of memory to job
Summary: --mem=0 assigns incorrect amount of memory to job
Status: RESOLVED DUPLICATE of ticket 5240
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-06 14:32 MDT by Martin Siegert
Modified: 2018-06-06 15:01 MDT (History)
1 user (show)

See Also:
Site: Simon Fraser University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Martin Siegert 2018-06-06 14:32:12 MDT
We have partitions that contain mixed node types, e.g.,

NodeName=cdr[41-96,98-103,119-148,165-196,213-244,909,918,921] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=128000 Weight=116 TmpDisk=864097 Feature=broadwell
NodeName=cdr[1001-1092,1108-1109,1112,1128] Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=192000 Weight=116 TmpDisk=864097 Feature=skylake
NodeName=cdr[633,672,736,745,747-748,762,766,769,790,792,799,802-804,810,812,818,820-821,827-828,830,836,853,858,860,871-872,874,907,915] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257000 Weight=216 TmpDisk=864097 Feature=broadwell
PartitionName=cpubase_bycore_b6 Nodes=cdr[41-96,98-103,119-148,165-196,213-244,633,672,736,745,747-748,762,766,769,790,792,799,802-804,810,812,818,820-821,827-828,830,836,853,858,860,871-872,874,907,909,915,918,921,1001-1092,1108-1109,1112,1128] MaxTime=672:00:00 PriorityJobFactor=1 TRESBillingWeights=CPU=1.0,Mem=0.25G Default=no MinNodes=1 AllowGroups=ALL PriorityTier=10 DisableRootJobs=NO RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=256 AllowAccounts=ALL AllowQos=ALL AllocNodes=cedar1,cedar5,gateway,lcg-ce[1-4],cdr[1-1999] DefaultTime=1:00:00 ExclusiveUser=NO

Jobs are constraint to run on one nodetype (either broadwell or skylake); the default constraint is features="[broadwell|skylake]" set in job_submit.lua; users can specify --constraint=broadwell or --constraint=skylake. Other constraints are rejected. When a job is submitted with --mem=0 --ntasks-per-node=48, i.e., that job can only run on the skylake nodes, the amount of memory that gets assigned to the job is 128000MB, instead of the expected 192000MB. Apparently, slurm assigns the minimum amount of memory of any of the nodes in the partition, regardless of the node the job is actually running on. This does not even change when the job is explicitly submitted with --constraint=skylake. The jobs gets oom-killed as soon as it uses more than 128000MB of memory even though there is lots of memory available on the nodes. It's just the cgroup setting that is overly restrictive.

Another use case would be running a truly heterogeneous job, see, e.g.,
https://software.intel.com/en-us/mkl-linux-developer-guide-heterogeneous-support-in-the-intel-distribution-for-linpack-benchmark
To support that case --mem=0 would need to assign the maximum amount of memory to the job's cgroup available on each node, e.g., 128000MB for our broadwell nodes and 192000MB on the skylake nodes (we tested that a few days ago with the constraints removed and were able to use a max. of 128000MB/node only; side remark: it would be really helpful for running such jobs if there would be an option --cpus-per-task=0 that would similarly assign all available cores to the job on each node).

- Martin
Comment 1 Alejandro Sanchez 2018-06-06 15:01:40 MDT
Hi Martin,

this is a duplicate of bug 5240 and we're working on partially reverting the commit mentioned there so that we move back the --mem=0 logic back where it was before.

*** This ticket has been marked as a duplicate of ticket 5240 ***