We have partitions that contain mixed node types, e.g., NodeName=cdr[41-96,98-103,119-148,165-196,213-244,909,918,921] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=128000 Weight=116 TmpDisk=864097 Feature=broadwell NodeName=cdr[1001-1092,1108-1109,1112,1128] Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=192000 Weight=116 TmpDisk=864097 Feature=skylake NodeName=cdr[633,672,736,745,747-748,762,766,769,790,792,799,802-804,810,812,818,820-821,827-828,830,836,853,858,860,871-872,874,907,915] Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257000 Weight=216 TmpDisk=864097 Feature=broadwell PartitionName=cpubase_bycore_b6 Nodes=cdr[41-96,98-103,119-148,165-196,213-244,633,672,736,745,747-748,762,766,769,790,792,799,802-804,810,812,818,820-821,827-828,830,836,853,858,860,871-872,874,907,909,915,918,921,1001-1092,1108-1109,1112,1128] MaxTime=672:00:00 PriorityJobFactor=1 TRESBillingWeights=CPU=1.0,Mem=0.25G Default=no MinNodes=1 AllowGroups=ALL PriorityTier=10 DisableRootJobs=NO RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=256 AllowAccounts=ALL AllowQos=ALL AllocNodes=cedar1,cedar5,gateway,lcg-ce[1-4],cdr[1-1999] DefaultTime=1:00:00 ExclusiveUser=NO Jobs are constraint to run on one nodetype (either broadwell or skylake); the default constraint is features="[broadwell|skylake]" set in job_submit.lua; users can specify --constraint=broadwell or --constraint=skylake. Other constraints are rejected. When a job is submitted with --mem=0 --ntasks-per-node=48, i.e., that job can only run on the skylake nodes, the amount of memory that gets assigned to the job is 128000MB, instead of the expected 192000MB. Apparently, slurm assigns the minimum amount of memory of any of the nodes in the partition, regardless of the node the job is actually running on. This does not even change when the job is explicitly submitted with --constraint=skylake. The jobs gets oom-killed as soon as it uses more than 128000MB of memory even though there is lots of memory available on the nodes. It's just the cgroup setting that is overly restrictive. Another use case would be running a truly heterogeneous job, see, e.g., https://software.intel.com/en-us/mkl-linux-developer-guide-heterogeneous-support-in-the-intel-distribution-for-linpack-benchmark To support that case --mem=0 would need to assign the maximum amount of memory to the job's cgroup available on each node, e.g., 128000MB for our broadwell nodes and 192000MB on the skylake nodes (we tested that a few days ago with the constraints removed and were able to use a max. of 128000MB/node only; side remark: it would be really helpful for running such jobs if there would be an option --cpus-per-task=0 that would similarly assign all available cores to the job on each node). - Martin
Hi Martin, this is a duplicate of bug 5240 and we're working on partially reverting the commit mentioned there so that we move back the --mem=0 logic back where it was before. *** This ticket has been marked as a duplicate of ticket 5240 ***