Bug 9010

Summary: sacct reporting maxrss higher than the node available memory
Product: Slurm Reporter: dl_support-hpc
Component: OtherAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.1   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8656
Site: WEHI Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description dl_support-hpc 2020-05-07 23:41:39 MDT
Hi Support,

One of our user reported that his job maxrss number was high, higher than what the node  memory configuration. What could be the explanation for this?

 sacct -j 34256 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,ReqMem,MaxVMSize,nnodes,ncpus,nodelist
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS     ReqMem  MaxVMSize   NNodes      NCPUS        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- ---------- --------------- 
   bedo.j 34256        bionix-xH+    regular     FAILED 1-00:00:00 2020-05-08T14:27:33 2020-05-08T14:56:44   00:29:11                  15Gn                   1          1  milton-sml-002 
          34256.extern     extern             COMPLETED            2020-05-08T14:27:33 2020-05-08T14:56:44   00:29:11       888K       15Gn    142576K        1          1  milton-sml-002 
          34256.0          chroot            OUT_OF_ME+            2020-05-08T14:27:36 2020-05-08T14:56:44   00:29:08 320389892K       15Gn 372041024K        1          1  milton-sml-002 
[root@milton-login01 ~]# scontrol show node milton-sml-002
NodeName=milton-sml-002 Arch=x86_64 CoresPerSocket=28 
   CPUAlloc=26 CPUTot=56 CPULoad=7.61
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=milton-sml-002 NodeHostName=milton-sml-002 Version=20.02.1
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 
   RealMemory=110000 AllocMem=78848 FreeMem=80128 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1000 Owner=N/A MCS_label=N/A
   Partitions=regular,long,debug 
   BootTime=2020-03-25T09:38:26 SlurmdStartTime=2020-04-20T13:05:12
   CfgTRES=cpu=56,mem=110000M,billing=270
   AllocTRES=cpu=26,mem=77G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Thanks and Regards,
Laszlo
Comment 3 Albert Gil 2020-05-12 07:50:06 MDT
Hi Laszlo,

These values look wrong.

We have an open issue regarding the accounting with cgroups on bug 8656, but I'm not certain if that's your case.
I'm wondering if the fact that the step was OUT_OF_MEMORY maybe related to the wrong value, do you face a similar scenario with any other jobs?

Could you attach your slurm.conf, slurmdbd.conf and cgroups.conf in case you use cgroups?
Also, could you post the output of:

$ sacct -j 34256 --long -p

I'll need to reproduce your issue, so any hint that you can provide to reproduce it will help.

Thanks,
Albert
Comment 4 dl_support-hpc 2020-05-13 00:33:29 MDT
Created attachment 14213 [details]
slurm.conf
Comment 5 dl_support-hpc 2020-05-13 00:37:57 MDT
Created attachment 14214 [details]
slurm.conf
Comment 6 dl_support-hpc 2020-05-13 00:38:47 MDT
Created attachment 14215 [details]
cgroup.conf
Comment 7 dl_support-hpc 2020-05-13 00:42:34 MDT
Created attachment 14216 [details]
slurmdbd.conf
Comment 8 dl_support-hpc 2020-05-13 00:48:12 MDT
Hi Albert,

As requested

[root@milton-login01 ~]# sacct -j 34256 --long -p
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
34256|34256|bionix-xHLA|regular||||||||||||||||||1|00:29:11|FAILED|1:0||Unknown|Unknown|Unknown|15Gn|0|||||||||||billing=31,cpu=1,mem=15G,node=1|billing=31,cpu=1,mem=15G,node=1||||||||||||||
34256.extern|34256.extern|extern||142576K|milton-sml-002|0|142576K|888K|milton-sml-002|0|888K|0|milton-sml-002|0|0|00:00:00|milton-sml-002|0|00:00:00|1|1|00:29:11|COMPLETED|0:0|2.59M|0|0|0|15Gn|0|0.00M|milton-sml-002|0|0.00M|0|milton-sml-002|0|0||||billing=31,cpu=1,mem=15G,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|energy=0,fs/disk=0|energy=milton-sml-002,fs/disk=milton-sml-002|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
34256.0|34256.0|chroot||372041024K|milton-sml-002|0|372041024K|320389892K|milton-sml-002|0|320389892K|24172|milton-sml-002|0|19|00:12:10|milton-sml-002|0|00:00:00|1|1|00:29:08|OUT_OF_MEMORY|0:125|2.59M|Unknown|Unknown|Unknown|15Gn|0|4161.86M|milton-sml-002|0|4161.86M|4578.07M|milton-sml-002|0|4578.07M||||cpu=1,mem=15G,node=1|cpu=00:00:00,energy=0,fs/disk=4364022996,mem=320389892K,pages=19,vmem=372041024K|cpu=00:12:10,energy=0,fs/disk=4364022996,mem=320389892K,pages=24172,vmem=372041024K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:12:10,energy=0,fs/disk=4364022996,mem=320389892K,pages=24172,vmem=372041024K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=4364022996,mem=320389892K,pages=19,vmem=372041024K|energy=0,fs/disk=4800456113|energy=milton-sml-002,fs/disk=milton-sml-002|fs/disk=0|energy=0,fs/disk=4800456113|energy=0,fs/disk=4800456113|

We did not see this behaviour before, we do not have many users using the system so it could be slip under the rocks.

The requested files are attached.

Thanks,
Laszlo
Comment 11 Albert Gil 2020-05-19 07:43:12 MDT
Hi Laszlo,

I'm still not certain about what may be going on.
Can you attach the slurmd logs of milton-sml-002 of 2020-05-08?
The logs of slurmctld and slurmdbd may also help.

What is the actual phisycal memory of milton-sml-002?
Is the job actually running a chroot command?

Are you able to reproduce the issue?

Thanks,
Albert

PS: While looking at your config files, I've detected a couple of minor issues totally unrelated to the bug:

- JobAcctGatherParams=NoOverMemoryKill was deprecated on 19.05 (https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50)
- DefaultStorage* are unnecessary because you're already using AccountingStorage* and JobCompType=jobcomp/none.
Comment 12 dl_support-hpc 2020-05-28 01:22:24 MDT
Hi Albert,

I do not have the logs on the milton-sml-002, this is what I got from slurmctld

[2020-05-08T14:27:33.160] sched: _slurm_rpc_allocate_resources JobId=34256 NodeList=milton-sml-002 usec=296
[2020-05-08T14:27:33.160] prolog_running_decr: Configuration for JobId=34256 is complete
[2020-05-08T14:56:44.586] _job_complete: JobId=34256 WEXITSTATUS 1
[2020-05-08T14:56:44.586] _job_complete: JobId=34256 done

no details in slurmdbd logs

The memory on milton-sml-002

[root@milton-sml-002 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:            113           5         105           0           2         106
Swap:             4           0           4

We do not use chroot environment.

Thanks,
Laszlo
Comment 13 Albert Gil 2020-05-28 05:26:24 MDT
Hi Laszlo,

> I do not have the logs on the milton-sml-002, this is what I got from
> slurmctld
> 
> [2020-05-08T14:27:33.160] sched: _slurm_rpc_allocate_resources JobId=34256
> NodeList=milton-sml-002 usec=296
> [2020-05-08T14:27:33.160] prolog_running_decr: Configuration for JobId=34256
> is complete
> [2020-05-08T14:56:44.586] _job_complete: JobId=34256 WEXITSTATUS 1
> [2020-05-08T14:56:44.586] _job_complete: JobId=34256 done
> 
> no details in slurmdbd logs
> 
> The memory on milton-sml-002
> 
> [root@milton-sml-002 ~]# free -g
>               total        used        free      shared  buff/cache  
> available
> Mem:            113           5         105           0           2        
> 106
> Swap:             4           0           4

Thanks for the information.
Please note that the problem seems to be located while gathering information from the node milton-sml-002 OS while the job was running on it.

You are using jobacct_gather/linux, meaning that the info was gathered from /proc.
The job is marked as OUT_OF_MEMORY, and as you are using task/cgroup and ConstrainRAMSpace=yes, it means that it was constrained and killed by cgroups.
This combination works just fine, ans is actually the recommended one.

Without knowing what was going on on milton-sml-002, and without being able to reproduce it, I'm afraid that we won't be able to provide a clear explanation.

> We do not use chroot environment.

Ok.
I was just curios because the out-of-memory step (34256.0) has the JobName "chroot", and if the user didn't set it manually it is the command name by default.

If we are not able to reproduce the issue, we should close it as "cannot reprodeuce".

Regards,
Albert
Comment 14 dl_support-hpc 2020-06-04 21:56:05 MDT
Hi Albert,

Let's close it, if it reocures I will open a ticket.

Thanks,
Laszlo
Comment 15 Albert Gil 2020-06-08 03:32:30 MDT
Ok Laszlo,

Closing as infogiven.

Regards,
Albert
Comment 16 Albert Gil 2020-06-18 07:13:58 MDT
Hi Laszlo,

I don't know how I missed it, but I just realized that you are not using any of thses two JobAcctGatherParams:

NoShared
  Exclude shared memory from accounting.
UsePss
  Use PSS value instead of RSS to calculate real usage of memory. The PSS value will be saved as RSS.

A higher values of MaxRSS may be related to a job/step with multiple process that are sharing some memory and, as the accounting is done by default by pid as the /proc reports it, that shared memory may be aggregated several times.

Maybe you want to use one or both options.

Regards,
Albert
Comment 17 dl_support-hpc 2020-07-21 01:23:18 MDT
Hi Albert,

I am doing this:

salloc -n 1
salloc: Pending job allocation 172612
salloc: job 172612 queued and waiting for resources
salloc: job 172612 has been allocated resources
salloc: Granted job allocation 172612
bash-4.2$ scontrol show job 172612
JobId=172612 JobName=sh
   UserId=kun.l(3899) GroupId=allstaff(10908) MCS_label=N/A
   Priority=456 Nice=0 Account=wehi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:22 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-21T17:13:42 EligibleTime=2020-07-21T17:13:42
   AccrueTime=2020-07-21T17:13:42
   StartTime=2020-07-21T17:14:04 EndTime=2020-07-23T17:14:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-21T17:14:04
   Partition=regular AllocNode:Sid=slurm-login:23983
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=milton-sml-006
   BatchHost=milton-sml-006
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=10M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/stornext/Home/data/allstaff/k/kun.l
   Power=
   MailUser=(null) MailType=NONE


sacct -o reqmem,maxrss,maxvm,elapsed -j 172612
    ReqMem     MaxRSS  MaxVMSize    Elapsed 
---------- ---------- ---------- ---------- 
      10Mc                         00:03:34 
      10Mc                         00:03:34 
      10Mc                         00:03:31 

bash-4.2$ scontrol show config | grep -i def
AcctGatherNodeFreq      = 0 sec
CpuFreqDef              = Unknown
DefMemPerCPU            = 10

bash-4.2$ scontrol show config | grep -i JobAcctGatherParams
JobAcctGatherParams     = NoShared,UsePss

Why I am not seeing any value for MaxRSS or MaxVMSize?

Thanks,
Laszlo
Comment 18 Albert Gil 2020-07-21 08:33:43 MDT
Hi Laszlo,

> Why I am not seeing any value for MaxRSS or MaxVMSize?

This probably belongs to a new ticket, just to keep each ticket focused on one main issue or question.
But anyway, I'll try to answer it here for now.

The MaxRSS and MaxVMSize are values gathered "on tasks".
So, for example if I do:

>> salloc -n 1
salloc: Granted job allocation 16152
>> sleep 5
>> srun sleep 10
>> sleep 15                                                                                                     
>> srun sleep 20
>> exit                                                                                                         
exit
salloc: Relinquishing job allocation 16152

Then I get:

>> sacct -o jobid,jobname,reqmem,maxrss,maxvm,elapsed -j 16152
       JobID    JobName     ReqMem     MaxRSS  MaxVMSize    Elapsed 
------------ ---------- ---------- ---------- ---------- ---------- 
16152              bash       16Gn                         00:01:26 
16152.0           sleep       16Gn       983K    424864K   00:00:10 
16152.1           sleep       16Gn       995K    424864K   00:00:20 

So, you can see that I only get the MaxRSS and MaxVMSize of the actual task launched with srun (10s and 20s, not 5s neither 15s).
We neither gather those values for the job allocation, just for actual steps.

A similar thing happen with sbatch:

>> sbatch -n1 --wrap "sleep 5; srun sleep 10; sleep 15; srun sleep 20"                                                                                         
Submitted batch job 16153

>> sacct -o jobid,jobname,reqmem,maxrss,maxvm,elapsed -j 16153
       JobID    JobName     ReqMem     MaxRSS  MaxVMSize    Elapsed 
------------ ---------- ---------- ---------- ---------- ---------- 
16153              wrap       16Gn                         00:00:50 
16153.batch       batch       16Gn      1159K    356512K   00:00:50 
16153.0           sleep       16Gn       963K    424868K   00:00:10 
16153.1           sleep       16Gn       981K    424868K   00:00:20 

As you can see, in this case we also gather the .batch step/task, though.

Does it solve you question?
Or are you having different results?

If you have further questions on this, please open a bug referring to this comments for better tracking this issue.

Thanks,
Albert