Ticket 10538 - Conflicting memory limit failure
Summary: Conflicting memory limit failure
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.2
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Felip Moll
QA Contact:
URL:
: 10588 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2021-01-04 06:34 MST by Marios Hadjieleftheriou
Modified: 2022-05-30 10:11 MDT (History)
5 users (show)

See Also:
Site: Lion Cave Capital
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
enable_debug_jobacctgather.patch (1.15 KB, patch)
2021-01-07 07:20 MST, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Comment 4 Marios Hadjieleftheriou 2021-01-04 08:20:00 MST
> sacct -j 60970517
Slurm accounting storage is disabled

> dmesg -T
dmesg: read kernel buffer failed: Operation not permitted
(I do not have sudo; if really needed please let me know)
Comment 7 Marios Hadjieleftheriou 2021-01-04 09:00:55 MST
None of these errors make any sense to me anymore. I just re-run one of the jobs with mem-trace and they use KBs of memory. Unless I am missing something major, something is wrong with memory accounting in slurm 20. Again, we had no issues at all with slurm 18.
Comment 8 Felip Moll 2021-01-04 10:03:39 MST
That's off-topic, but you may want to remove the deprecated/unrecognized option from slurm.conf:

CryptoType=crypto/openssl

Later in the file you have CredType=cred/none (which is making your system unsecured but I guess this is what you want).


---

I will give you more information about the JobAcctGather/linux plugin. This is just informative but can help us to get some lead.

I see you are using pgid, jobacctgather/linux with OverMemoryKill, and just task/affinity, this means: no cgroups.

So.. the way memory is tracked and limits enforced is with the help of JobAcctGather plugin. This plugin reads the stats every JobAcctGatherFreq for every task in the job. It is run from slurmstepd and will kill steps at the moment it finds out the limit was exceeded, but this is not exact because in the meantime of the interval the limit can be exceeded.

JobAcctGather plugin bases its calculations on /proc/pid/stat[m] values for every pid.

- If you use jobacct_gather/linux, the MaxRSS will be the one read from /proc/<pid>/stat, 24th field, (rss) * pagesize (values are given in pages).
- If you use jobacct_gather/linux with the option UsePSS, the MaxRSS will be the maximum between Rss and Pss, being Pss the sum of all Pss values in /proc/<pid>/smaps
- If you use jobacct_gather/linux with the option NoShare, the MaxRSS will be the one read from /proc/<pid>/statm, 2d field (rss) - 3rd field (share) * pagesize (values are given in pages).

From the manual:
/proc/[pid]/stat
              Status information about the process.
                  (24) rss  %ld
                        Resident Set Size: number of pages the process has
                        in real memory.  This is just the pages which count
                        toward text, data, or stack space.  This does not
                        include pages which have not been demand-loaded in,
                        or which are swapped out.

/proc/[pid]/statm
              Provides information about memory usage, measured in pages.            
                  resident   (2) resident set size (in number of pages).
                             (same as VmRSS in /proc/[pid]/status)

/proc/[pid]/status
              Provides much of the information in /proc/[pid]/stat and
              /proc/[pid]/statm in a format that's easier for humans to
              parse.
                     * VmRSS: Resident set size (in number of pages).
			      (Since linux kernel 4.5: Note that the value here is the sum of RssAnon, RssFile, and RssShmem.)

                     * RssAnon: Size of resident anonymous memory.  (since Linux
                                4.5).

                     * RssFile: Size of resident file mappings.  (since Linux 4.5).

                     * RssShmem: Size of resident shared memory (includes System V
                       shared memory, mappings from tmpfs(5), and shared anonymous
                       mappings).  (since Linux 4.5).


About 'UsePSS' option, this will change the way to count used memory from using RSS to using PSS, making that if the application is heavily threaded the memory will not be counted once per thread/process; doing so could incur in an apparently high memory usage. PSS divides up the shared mem usage between all the separate processes, so when summed back together you get a more realistic view of the memory consumption.

Finally 'NoShared' option could be also useful if you are caching a lot of data, like for example when using NFS filesystems.

More information:

We have found in previous bugs (and I am aware it happened in other resource managers too) that proc summations can inappropriately kill jobs owing to double counting of memory.
This can happen, for example, when a large-memory process fork()s, and the summation occurs between the fork() and the exec(), both processes will apparently be using a large amount of memory even though that is not true (since the pages have not been copied yet). This can also happen when an app. is using COW and duplicates pages on fork.

-----------------

Now I have three theories.

1st. You should use UsePSS since your app is heavily threaded.
2d. Your application is doing forks which leads to double accounting. 
3d. The value of .batch step is counting the same thing twice.
4rt. Consider using cgroups for memory limit enforcing, is that a possibility?

So, can you try setting JobAcctGatherParams=UsePss , check again and let me know if you see something different?

Also, what does exactly your 'regression test' do?

> We switched to slurm 20 last night.
20.11.2 right?

Thanks
Comment 11 Marios Hadjieleftheriou 2021-01-04 14:11:04 MST
We were using 20.11.2, but we reverted to 18.08.8. After we reverted the issue has disappeared.

I cannot use UsePSS, as we already reverted (v20 was unusable, so we had to do something quickly). We will install a 2nd slurm installation for testing soon and start experimenting more.

The jobs I am having issues with have input in the order of KBs. Forking or not, threads or not, there is nowhere near enough input to produce 3, 6, 10 or 20 GBs of stuff :) The job is actually a bash script, it has no backgrounding and the occasional sub-shell. It will have at most two processes running at a time. I would not call it "heavily threaded".

I re-run the script on the command line and gathered per second samples of memory used using "ps -p $pid rss,vsz,%mem". The max memory I got was a few KBs as expected. Whatever slurm v20 is reporting does not make any sense to me. I will keep looking though...

My suggestion is to see what has changed between v18 MemLimitEnforce and v20 OverMemoryKill. My impression is that it should be something egregious, unless I am missing something major with respect to the nature of the job (but 10GB is a ridiculously large number for what this job actually is).

Pending questions:
1. What changed between v18 and v20 MemLimitEnforce
2. You called the job "heavily threaded". Is there something in the logs to suggest that? Looking at the script it should never have more than two threads. Should I be looking more carefully?
3. You mentioned that in the log there were *two* steps, one with 3GB memory and another with 6GB memory. I only see one step reporting two different memory sizes, which is confusing. What am I misunderstanding?
Comment 12 Marios Hadjieleftheriou 2021-01-04 14:12:08 MST
Regarding cgorups. Once we install a v20 cluster we will try to configure cgroups. Does it need the slurm DB or munge? Because, in the short term, this is not a possibility.
Comment 13 Felip Moll 2021-01-06 09:11:33 MST
(In reply to Marios Hadjieleftheriou from comment #11)
> We were using 20.11.2, but we reverted to 18.08.8. After we reverted the
> issue has disappeared.
> 
> I cannot use UsePSS, as we already reverted (v20 was unusable, so we had to
> do something quickly). We will install a 2nd slurm installation for testing
> soon and start experimenting more.

Good, this 2nd installation will surely help to see what was going on.

> The jobs I am having issues with have input in the order of KBs. Forking or
> not, threads or not, there is nowhere near enough input to produce 3, 6, 10
> or 20 GBs of stuff :) The job is actually a bash script, it has no
> backgrounding and the occasional sub-shell. It will have at most two
> processes running at a time. I would not call it "heavily threaded".

Totally understood. So my theories seems to not apply.

> I re-run the script on the command line and gathered per second samples of
> memory used using "ps -p $pid rss,vsz,%mem". The max memory I got was a few
> KBs as expected. Whatever slurm v20 is reporting does not make any sense to
> me. I will keep looking though...

I didn't expect that, so it must be an error in Slurm adding something more than once.
 
> My suggestion is to see what has changed between v18 MemLimitEnforce and v20
> OverMemoryKill. My impression is that it should be something egregious,
> unless I am missing something major with respect to the nature of the job
> (but 10GB is a ridiculously large number for what this job actually is).

This doesn't really have an impact, but I'll instead look at how jobacctgather/linux plays with pgid.

Note that the OverMemoryKill is only the mechanism which kills steps/jobs and which exposes the problem, but the real issue seems to be we're counting incorrectly the Used RSS. This happens with or without OverMemoryKill.

> Pending questions:
> 1. What changed between v18 and v20 MemLimitEnforce

This parameter was removed in 19.05 and the functionality moved to OverMemoryLimit.
There were no relevant changes to your case as I said, because the accounting is incorrect be this parameter active or unactive.

I leave this here for your own knowledge:

commit fac06f2d80c09991bba622560641077843d74ca0
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Fri Dec 20 17:04:25 2019 +0100
Commit:     Douglas Wightman <wightman@schedmd.com>
CommitDate: Thu Jan 16 14:49:05 2020 -0700

    Fix deprecation of MemLimitEnforce
    
    There was a mistake in the documentation where the old parameter was still
    documented. It has been fixed. Also if MemLimitEnforce was set in slurm.conf
    it was ignored and one may had the value to 'yes' and see how the memory was
    not enforced at all. With this fix we will give a non-fatal error in slurmd
    and ctld if the parameter appears in the conf, and will add OverMemoryLimit
    to JobAcctGatherParams if it was MemLimitEnforce=yes.
    
    Bug 8258


commit 4ae29f1a2e3177ff20720d61d474718e0892a225
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Mon Jan 14 11:52:17 2019 +0100
Commit:     Felip Moll <felip.moll@schedmd.com>
CommitDate: Mon Jan 14 11:52:17 2019 +0100

    Remove MemLimitEnforce parameter
    
    This patch removes the MemLimitEnforce parameter from Slurm and moves the
    condition to enable the feature to the JobAcctGatherParam=OverMemoryKill.
    Setting OverMemoryKill will now enable the mechanism that was before enabled
    with MemLimitEnforce=yes, so activation of killing jobs or steps due to
    exceeding memory will be done in the same place. Also reconfiguring or
    starting slurmd/ctld with incorrect configuration will cause a fatal error.
    
    Bug 5479


> 2. You called the job "heavily threaded". Is there something in the logs to
> suggest that? Looking at the script it should never have more than two
> threads. Should I be looking more carefully?

No, there is not. It was just a guess. Nevertheless, can you share the script with me? or is confidential/has external dependencies?
We can make the attachment private if it is a concern.

> 3. You mentioned that in the log there were *two* steps, one with 3GB memory
> and another with 6GB memory. I only see one step reporting two different
> memory sizes, which is confusing. What am I misunderstanding?

Sorry, it was only 1 step but two line entries.
You are not misunderstanding it, sorry for my mislead.


I will try to reproduce the same you see in my testbed and let you know of anything I see.



> Regarding cgorups. Once we install a v20 cluster we will try to configure cgroups. Does it need the slurm DB or munge?

Nope, it doesn't.
The ideal configuration would be:

slurm.conf
-----------
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux
TaskPlugin=task/affinity,task/cgroup

cgroup.conf
-----------
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainRAMSpace=yes
TaskAffinity=no

assuming /sys/fs/cgroup is working and you have memory,freezer,devices,cpuset subsystems up (check /proc/mounts) and you are using cgroups v1.
Comment 14 Marios Hadjieleftheriou 2021-01-06 09:28:00 MST
> > Pending questions:
> > 1. What changed between v18 and v20 MemLimitEnforce
> 
> This parameter was removed in 19.05 and the functionality moved to
> OverMemoryLimit.
> There were no relevant changes to your case as I said, because the
> accounting is incorrect be this parameter active or unactive.
> 

I will try to answer other question too, but this point here I think is the most important and we should dig deeper. We are encountering a problem with v20 that does not happen in v18. If you compare the source code between v18 and v20 regarding how memory accounting is done when using jobacct_gather/linux between those two version, my intuition tells me that you should see something glaringly obvious that has changed.
Comment 15 Marios Hadjieleftheriou 2021-01-06 10:12:46 MST
> 
> > 2. You called the job "heavily threaded". Is there something in the logs to
> > suggest that? Looking at the script it should never have more than two
> > threads. Should I be looking more carefully?
> 
> No, there is not. It was just a guess. Nevertheless, can you share the
> script with me? or is confidential/has external dependencies?
> We can make the attachment private if it is a concern.
> 

It is a 540 lines bash script with a ton of external dependencies. I could still provide it if you think it might be useful.

The script prints debug information while running so I know exactly where it gets killed. It is executing a sequence of calls to an R script on a bunch of different files. The total size of the files it operates on are a few KBs (and frankly, I have never seen any of these files exceed a few megabytes).

I took 1s samples of both the script itself and the R script invocations using ps, and neither reported more than a few KBs of memory.

This is such a big discrepancy though that tells me that I should take another look on my end for sure (make sure that I am not missing something big). I haven't gotten around to it yet.
Comment 16 Marios Hadjieleftheriou 2021-01-06 10:14:35 MST
> assuming /sys/fs/cgroup is working and you have
> memory,freezer,devices,cpuset subsystems up (check /proc/mounts) and you are
> using cgroups v1.

I am not an admin, but I checked one of our nodes and I do not see /proc/mounts or freezer, cpuset, etc. in there.

I 'll ask our admins if they know anything about these, but is there documentation somewhere on how to install them if we need to?
Comment 21 Felip Moll 2021-01-07 07:01:57 MST
I will respond below to some of your questions, then I will post again requesting a new test you need to carry on your system and Slurm 20.11:

> If you compare the source code between v18
> and v20 regarding how memory accounting is done when using
> jobacct_gather/linux between those two version, my intuition tells me that
> you should see something glaringly obvious that has changed.

Have you seen anything in the code that suggests this?


> The job is actually a bash script, it has no backgrounding and the occasional sub-shell. It will have at most two processes running at a time. I would not call it "heavily threaded".

Given that sentence I thought that the script was simple, but after your comments: 

> It is a 540 lines bash script with a ton of external dependencies. I could
> still provide it if you think it might be useful.

> It is executing a sequence of calls to an R script on a bunch
> of different files.

This makes me thing this is more probable to be the cause. 
Why it is not happening in 18.08 I don't know, but maybe this wasn't detected correctly before.

> The total size of the files it operates on are a few KBs
> (and frankly, I have never seen any of these files exceed a few megabytes).

Are the files you are opening in a remote filesystem like NFS?


> I took 1s samples of both the script itself and the R script invocations
> using ps, and neither reported more than a few KBs of memory.

The batch step memory consumption adds all the child memory and can be significantly bigger than the limit until it is caught by jobacctgather.


> > assuming /sys/fs/cgroup is working and you have
> > memory,freezer,devices,cpuset subsystems up (check /proc/mounts) and you are
> > using cgroups v1.
> 
> I am not an admin, but I checked one of our nodes and I do not see
> /proc/mounts or freezer, cpuset, etc. in there.
> 
> I 'll ask our admins if they know anything about these, but is there
> documentation somewhere on how to install them if we need to?

/proc/mounts is a file which says which filesystems are mounted. You need to 'cat /proc/mounts'.
You will see the list of cgroup controllers which are available in your system, which should contain freezer, cpuset, memory, etc.
You need to ask your sysadmins about if cgroups v1 are available on your system.
It will probably be, because this comes with the kernel, it is not some "package" you can install, but part of the kernel.
Comment 22 Felip Moll 2021-01-07 07:20:22 MST
Created attachment 17373 [details]
enable_debug_jobacctgather.patch

I need you to do the following test.

1. Apply the patch I upload here. It just defines the _DEBUG constant to 1 in a couple of files.

2. Compile and install Slurm.

3. Run a job like this:

sbatch --mem=30M --wrap "srun sleep 1000"

4. Look at the slurmd log in the node the job landed, you should see something similar to this:

Lines for .batch step:
-----------------------
Note here that jag_common_poll_data and _aggregate_prec records stats for 3 pids, 200335, 200332 and 200330.

[2021-01-07T15:05:31.720] [7262.batch] debug2: profile signaling type Task
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: jag_common_poll_data: pid:200330 ppid:200326 rss:3207168 B
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: _aggregate_prec: pid:200332 ppid:200330 rss:8269824 B
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: _aggregate_prec: pid:200335 ppid:200332 rss:798720 B
[2021-01-07T15:05:31.725] [7262.batch] debug2: jobacct_gather/linux: jag_common_poll_data: energycounted = 0
[2021-01-07T15:05:31.725] [7262.batch] debug2: jobacct_gather/linux: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0
[2021-01-07T15:05:31.725] [7262.batch] debug:  jobacct_gather/linux: jag_common_poll_data: jag_common_poll_data: Task 0 pid 200330 ave_freq = 2891 mem size/max 12275712/12275712 vmem size/max 532193280/532193280, disk read size/max (54603/54603), disk write size/max (2567/2567), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2021-01-07T15:05:31.725] [7262.batch] debug:  StepId=7262.batch memory used:12275712 limit:31457280 B

if I do a ps for these pids I get:

]# ps -o ppid,pid,vsz,rss,command -p 200335 200332 200330
   PPID     PID    VSZ   RSS COMMAND
 200326  200330 216732  3132 /bin/sh /home/user/slurm/20.11/inst/spool/gamba1/job07262/slurm_script
 200330  200332 286328  8076 srun sleep 1000
 200332  200335  16660   780 srun sleep 1000

The first pid 200330 is the bash script.
The second pid 200332 is the srun that launched the step for the command sleep 1000.
The third pid 200332 is the command line srun that created the srun job.

That's the structure of the job, just for your information (pstree -psla):

  |   |-slurmstepd,200326                   
  |   |   |-slurm_script,200330 /home/lipi/slurm/20.11/inst/spool/gamba1/job07262/slurm_script
  |   |   |   `-srun,200332 sleep 1000
  |   |   |       |-srun,200335 sleep 1000
  |   |   |       |-{srun},200336
  |   |   |       |-{srun},200337
  |   |   |       |-{srun},200338
  |   |   |       `-{srun},200339


A) Now, going back to these lines:
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: jag_common_poll_data: pid:200330 ppid:200326 rss:3207168 B
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: _aggregate_prec: pid:200332 ppid:200330 rss:8269824 B
[2021-01-07T15:05:31.725] [7262.batch] jobacct_gather/linux: _aggregate_prec: pid:200335 ppid:200332 rss:798720 B

798720+8269824+3207168 = 12275712 Bytes

B) That result matches this line:

[2021-01-07T15:05:31.725] [7262.batch] debug:  jobacct_gather/linux: jag_common_poll_data: jag_common_poll_data: Task 0 pid 200330 ave_freq = 2891 mem size/max 12275712/12275712

C) Now, I go back to the output of ps and sum the 3 RSS values:

3132+8076+780 = 11988KiB * 1024 = 12275712 Bytes

As you see, ps output matches with jobacctgather/linux for batch step.

------------------------------

The same would apply for .step 0:

[2021-01-07T15:05:31.747] [7262.0] debug2: profile signaling type Task
[2021-01-07T15:05:31.753] [7262.0] jobacct_gather/linux: jag_common_poll_data: pid:200349 ppid:200343 rss:536576 B
[2021-01-07T15:05:31.753] [7262.0] debug2: jobacct_gather/linux: jag_common_poll_data: energycounted = 0
[2021-01-07T15:05:31.753] [7262.0] debug2: jobacct_gather/linux: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0
[2021-01-07T15:05:31.753] [7262.0] debug:  jobacct_gather/linux: jag_common_poll_data: jag_common_poll_data: Task 0 pid 200349 ave_freq = 3745 mem size/max 536576/2736128 vmem size/max 220463104/358502400, disk read size/max (17069/17069), disk write size/max (1862/1862), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2021-01-07T15:05:31.753] [7262.0] debug:  StepId=7262.0 memory used:536576 limit:31457280 B


  |   |-slurmstepd,200343                       
  |   |   |-sleep,200349 1000
  |   |   |-{slurmstepd},200344
  |   |   |-{slurmstepd},200345
  |   |   |-{slurmstepd},200346
  |   |   |-{slurmstepd},200347
  |   |   `-{slurmstepd},200348

5. Once you verified this simple job works as expected, run your special bash script with R, and check the same. If processes are ending/dying very fast, it will be difficult for you to do the maths. You can send here the logs and I can help you to interpret this too.

Another aspect to consider is really the virtual memory or shared memory your process use. This can grow if the scripts/processes/files resides in a shared filesystem, where virtual pages may count as RSS, then UsePSS and maybe NoShared would be really needed. I had problems in the past with R scripts.

Let me know about this experiment, then it will be also easier for me to see the difference between 18.08 and a more recent version.
Comment 23 Marios Hadjieleftheriou 2021-01-07 07:47:13 MST
> Have you seen anything in the code that suggests this?
> 
I have not looked at the code. I wouldn't know where to start. Also, we pay support so that we do not have to spend valuable cycles looking at code that we don't have to :)


> 
> Are the files you are opening in a remote filesystem like NFS?
> 
They are generated locally and stored locally. 


> 
> The batch step memory consumption adds all the child memory and can be
> significantly bigger than the limit until it is caught by jobacctgather.
> 
The script is 540 lines of mostly bash code aggregating data. It is large, but it is not complicated. We are calling various other executables but the aggregate size of the files processed is 40MBs.

I profiled the script itself and it used 3.9MBs of memory. I profiled the child invocations where the script was killed and each invocation used fewer than a few KBs of memory.

Now, if slurm v20 aggregates memory of all children that have already finished and have released that memory, and considers that memory as still used, then that is a slurm bug. Although, even in that case I doubt we are anywhere close to 3GBs. I 'll take another look, just in case.

Additional evidence:
1. In the logs, I see jobs killed with slurm reporting 20GBs of memory. This is an impossibility given the nature of this job.
2. You did not comment on this bizarre slurm behavior:
==> 60970508_1.err <==
exit_code: 0
slurmstepd: error: StepId=60970509.batch exceeded memory limit (5952811008 > 3145728000), being killed
slurmstepd: error: Exceeded job memory limit

My script exits with success ("exit code: 0" is the last thing it prints). Then slurm kills the step. Notice that after the script exits with "exit code 0", it is not consuming any memory anymore. Why does slurm think that it is using 6GB?

> 
> /proc/mounts is a file which says which filesystems are mounted. You need to
> 'cat /proc/mounts'.

Yes, I see them in /proc/mounts. I also asked our admins and they verified that we have cgroups.
Comment 24 Marios Hadjieleftheriou 2021-01-07 08:04:34 MST
I will try to do this soon.

Just to re-iterate:
1. The total input is 40MBs
2. I profiled *all* R invocations using 1 sec samples of 'ps -p $pid -o rss,vsz,%mem,%cpu'. None of them reported more than a few KBs of memory.


Question:
Could slurm be killing the job due to an earlier part of the script that has already finished and released the memory?
Comment 25 Felip Moll 2021-01-07 08:38:44 MST
(In reply to Marios Hadjieleftheriou from comment #23)
> > Have you seen anything in the code that suggests this?
> > 
> I have not looked at the code. I wouldn't know where to start. Also, we pay
> support so that we do not have to spend valuable cycles looking at code that
> we don't have to :)

Indeed, I don't pretend you to do so, but from your text I understood that maybe you had seen something.

> > 
> > Are the files you are opening in a remote filesystem like NFS?
> > 
> They are generated locally and stored locally. 

Okay.


> Now, if slurm v20 aggregates memory of all children that have already
> finished and have released that memory, and considers that memory as still
> used, then that is a slurm bug. Although, even in that case I doubt we are
> anywhere close to 3GBs. I 'll take another look, just in case.

That doesn't seem possible, that's why I ask for the experiment.
Slurm is aggregating data reading the list of PIDs belonging to the step, every JobAcctGatherInterval. Every iteration reads everything again with new data so if a pid is dead it won't be in /proc anymore thus we won't get the stats anymore.

> Additional evidence:
> 1. In the logs, I see jobs killed with slurm reporting 20GBs of memory. This
> is an impossibility given the nature of this job.
> 2. You did not comment on this bizarre slurm behavior:
> ==> 60970508_1.err <==
> exit_code: 0
> slurmstepd: error: StepId=60970509.batch exceeded memory limit (5952811008 >
> 3145728000), being killed
> slurmstepd: error: Exceeded job memory limit
> 
> My script exits with success ("exit code: 0" is the last thing it prints).
> Then slurm kills the step. Notice that after the script exits with "exit
> code 0", it is not consuming any memory anymore. Why does slurm think that
> it is using 6GB?

It may happen that jobacctgather which is run from slurmstepd is still alive even if your process is dead or disappearing. It may have the list of pids and do the calculation at that exact time.


But, the next step as I say is to do the experiment, then I will have more information and know where to investigate.

Thankss
Comment 26 Marios Hadjieleftheriou 2021-01-07 08:48:34 MST
We are install v20 right now, using cgroups. Let's see what this reports first.
Comment 27 Marios Hadjieleftheriou 2021-01-07 08:51:45 MST
> 
> It may happen that jobacctgather which is run from slurmstepd is still alive
> even if your process is dead or disappearing. It may have the list of pids
> and do the calculation at that exact time.
> 
> 

If you look at the log carefully you will see that it did that for 14 different jobs in a row. How likely is that?
Comment 28 Marios Hadjieleftheriou 2021-01-07 08:57:22 MST
We are trying to install cgroups and getting this error:
[2021-01-07T10:37:47.866] fatal: Jobs memory is being constrained by both TaskPlugin cgroup and JobAcctGather plugin. This enables two incompatible memory enforcement mechanisms, one of them must be disabled.

What do we need to change to the config (attached).
Comment 30 Felip Moll 2021-01-07 09:21:07 MST
(In reply to Marios Hadjieleftheriou from comment #28)
> We are trying to install cgroups and getting this error:
> [2021-01-07T10:37:47.866] fatal: Jobs memory is being constrained by both
> TaskPlugin cgroup and JobAcctGather plugin. This enables two incompatible
> memory enforcement mechanisms, one of them must be disabled.
> 
> What do we need to change to the config (attached).

Remove OverMemoryKill in slurm.conf and enable ConstrainRAMSpace=yes in cgroups.conf, so you're constraining with cgroups and not with jobacctgather/linux.
Comment 31 Marios Hadjieleftheriou 2021-01-07 10:31:33 MST
Quick update. I re-run a job that was killed by v20 "linux" five times in a row, using v20 cgroups, and it finished without issues. I left the default memory limit of 2GBs for this (and the job should not be even close to that to begin with).

I will revert to "linux" now and re-run in the new installation and see if I can reproduce the problem on this set of nodes.
Comment 35 Marios Hadjieleftheriou 2021-01-08 07:50:34 MST
I run the sleep command and everything seems fine as you already know.

But if you look at the log I sent yesterday, you can see that the gatherer is summing up memory of child processes that have already terminated and released their memory. Notice that no child pid consume more than a few MBs of memory. But in aggregate they do.

The issue seems to be that the gather does not subtract the memory of child pids that have already terminated.
Comment 36 Marios Hadjieleftheriou 2021-01-08 08:06:22 MST
I guess I should re-run my process and capture pstree on 1sec intervals and see what child processes are active at any given moment in time just to make sure that we are not really launching all these processes at the same time.
Comment 39 Marios Hadjieleftheriou 2021-01-08 12:30:47 MST
> 
> Notice that the job is killed *twice*. The first time it is reported as
> having 1272602624 bytes and the second time 2304163840. Weird. If I add up
> all the numbers I get: 1031561216. So neither number is correct.
> 

This is incorrect, as I forgot to add the rss of the parent. When I add the rss of the parent, the memory reported by slurm in both instances is correct. Notice though that in the second message, the rss of the parent is the sum of the rss of all children from the previous iteration.

I believe that this is the bug! The rss of the parent accumulates the rss of all children through time.
Comment 40 Marios Hadjieleftheriou 2021-01-08 13:15:57 MST
I am still looking at the gatherer code, and I have to say I do not see anything wrong there.

I 'll see if I can call pstree continuously while this is running to try to capture all those missing pids and see what they are...
Comment 41 Marios Hadjieleftheriou 2021-01-08 13:30:17 MST
I think I am closer. The list of pids the gatherer is iterating over is stale. It contains child pids that have finished and died long ago.

How/when is that list updated? Please take a look at that code, for differences between v18 and v20.

What I did was to run pstree on the ppid in a while loop. Then I looked at all the pids reported by slurm when the ppid died and matched them against the output of pstree. It is crystal clear from that output that the pids slurm is working with are completely irrelevant at that point.

I also believe that the memory reported by slurm for those dead pids at that point is dubious at best.
Comment 43 Felip Moll 2021-01-11 02:59:43 MST
Hi Marios, I am out until today afternoon, I will respond properly asap. Can you try to reduce JobAcctGatherFrequency value and check if this changes anything? This should gather the values more often, thus updating the list of pids often too.

I am also wondering if processes remain in D state for a while before dying,let me look more into this.
Comment 44 Marios Hadjieleftheriou 2021-01-11 05:56:16 MST
I will try it.

I saw processes like "cat" that occurred a minute and a half ago being 
accounted for by the gatherer. cat cannot be in D state, as the script 
has moved on and has done further processing on that data (the script 
concatenates a number of files files into one, then operates on that one 
file). Additionally, cat had supposedly consumed 200MBs of memory, which 
I do not think is possible as cat itself does not buffer.

Did you notice any differences in how task_list is populated between v18 
and v20? Where is that code? I briefly looked for it, but I could not 
find it.

On 1/11/2021 4:59 AM, bugs@schedmd.com wrote:
>
> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=10538#c43> on 
> bug 10538 <https://bugs.schedmd.com/show_bug.cgi?id=10538> from Felip 
> Moll <mailto:felip.moll@schedmd.com> *
> Hi Marios, I am out until today afternoon, I will respond properly asap. Can
> you try to reduce JobAcctGatherFrequency value and check if this changes
> anything? This should gather the values more often, thus updating the list of
> pids often too.
>
> I am also wondering if processes remain in D state for a while before dying,let
> me look more into this.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 45 Felip Moll 2021-01-11 08:56:29 MST
> I saw processes like "cat" that occurred a minute and a half ago being 
> accounted for by the gatherer. cat cannot be in D state, as the script 
> has moved on and has done further processing on that data (the script 
> concatenates a number of files files into one, then operates on that one 
> file). Additionally, cat had supposedly consumed 200MBs of memory, which 
> I do not think is possible as cat itself does not buffer.

Okay

> Did you notice any differences in how task_list is populated between v18 
> and v20? Where is that code? I briefly looked for it, but I could not 
> find it.

Relevant code is in src/plugins/jobacct_gather/common/common_jag.c:jag_common_poll_data()

I am investigating now one change done in bug 9555 that could be related.

commit d86339dc27ed1b95905eb4ee450220fa8e40fc8b

    Making prec_list global to track completed processes.
    
    This makes it so if a script has multiple serial processes in it
    jobacct_gather/linux will not loose track of the processes as they
    finish.
    
    Bug 9555


Will let you know when I have something more.
Comment 46 Marios Hadjieleftheriou 2021-01-11 10:26:03 MST
Code from v20 common_jag.c that does not exist in v18:
    972          * We can't use the prec from the list as we need to 
keep it in
    973          * the original state without offspring since we reuse 
this list
    974 ** keeping around precs after they end. *
    975          */
    976         memcpy(&tmp_prec, prec, sizeof(*prec));
    977         prec = &tmp_prec;


On 1/11/2021 10:56 AM, bugs@schedmd.com wrote:
>
> *Comment # 45 <https://bugs.schedmd.com/show_bug.cgi?id=10538#c45> on 
> bug 10538 <https://bugs.schedmd.com/show_bug.cgi?id=10538> from Felip 
> Moll <mailto:felip.moll@schedmd.com> *
> > I saw processes like "cat" that occurred a minute and a half ago being > accounted for by the gatherer. cat cannot be in D state, as the 
> script > has moved on and has done further processing on that data 
> (the script > concatenates a number of files files into one, then 
> operates on that one > file). Additionally, cat had supposedly 
> consumed 200MBs of memory, which > I do not think is possible as cat 
> itself does not buffer.
>
> Okay
>
> > Did you notice any differences in how task_list is populated between v18 > and v20? Where is that code? I briefly looked for it, but I could 
> not > find it.
>
> Relevant code is in
> src/plugins/jobacct_gather/common/common_jag.c:jag_common_poll_data()
>
> I am investigating now one change done inbug 9555  <show_bug.cgi?id=9555>  that could be related.
>
> commit d86339dc27ed1b95905eb4ee450220fa8e40fc8b
>
>      Making prec_list global to track completed processes.
>
>      This makes it so if a script has multiple serial processes in it
>      jobacct_gather/linux will not loose track of the processes as they
>      finish.
>
>      Bug 9555  <show_bug.cgi?id=9555>
>
>
> Will let you know when I have something more.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 47 Felip Moll 2021-01-12 05:02:48 MST
Marios,

According to your logs and explanations it is also very clear to me that we have an issue. I am working through it at the moment trying to reproduce it to then see what is the exact issue. I am highly suspicious about the bug I mentioned.
Comment 48 Felip Moll 2021-01-12 06:01:35 MST
Marios, I reproduced the issue and I mostly confirm this is a regression introduced in bug 9555.


My reproducer looks like this:

]$ cat run-serial.sh
#!/bin/bash
#SBATCH --job-name=serial
#SBATCH --output=/tmp/sl_%j.out
#SBATCH --error=/tmp/sl_%j.err
#SBATCH --partition=debug
#SBATCH --mem=100M
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

cd /home/lipi/git/memeat
./memeat 10M &
PID1=$!
./memeat 10M &
PID2=$!
./memeat 10M &
PID3=$!
./memeat 10M &

sleep 10
kill -9 $PID1
sleep 10
kill -9 $PID2
sleep 10
kill -9 $PID3
sleep 10

echo ALL DONE

sleep 1000000



]$ sbatch run-serial.sh


Even if my memeat pids are dead after 40 seconds, I will see jobacctgather/linux which I set to JobAcctGatherFrequency=3 always accounting for all of the pids, even if gone. So, to conclude, the issue is for commands launched directly inside the batch script.


I will work on a solution and let you know about it.
Comment 49 Felip Moll 2021-01-12 08:43:48 MST
This is confirmed.

Reverting commit d86339dc27ed1b95 makes the issue to disappear.

Stay tuned.
Comment 50 Nate Rini 2021-01-12 14:15:30 MST
*** Ticket 10588 has been marked as a duplicate of this ticket. ***
Comment 58 Jason Booth 2021-01-13 10:33:24 MST
*** Ticket 10588 has been marked as a duplicate of this ticket. ***
Comment 60 Nate Rini 2021-01-13 10:48:03 MST
(In reply to Marios Hadjieleftheriou from comment #59)
> Can you please mark as private (or edit out) all messages with *****?

Done. We can't really edit messages (sans a database change) with Bugzilla so they are now marked private.
Comment 64 Marios Hadjieleftheriou 2021-01-15 08:59:05 MST
The bug seems to be private again (I could not access it when not logged in). You might want to make it public.
Comment 68 Felip Moll 2021-01-15 09:32:28 MST
Hi Marios,

Finally we commited a fix which will be in 20.11.3+, commits 1ba2875649272..8b68aff3dcb7d.

The issue was due to a change where we modified the way we store the list of process statistics for a step. Now we don't remove the precs anymore because we need some values for each pid at the end of the job in order to accumulate syscpu and user cpu times. The issue was that memory was also incorrectly summed for all dead pids, which along with the rest of TRES required a different treatment than consumed cpu seconds.

We invalidate the TRES values (including memory) for past precs.

This release (20.11.3) should be coming out soon (around next week). 

Thank you for reporting.
Comment 69 Marios Hadjieleftheriou 2021-01-15 09:34:38 MST
Thank you!
Comment 70 Marios Hadjieleftheriou 2021-04-14 19:47:27 MDT
We just upgraded from 18.08 to 20.11.5. It's been about a week, and we seem to be having memory issues again (even weirder stuff) since the day we did the install.

Has anyone else complained about jobs being killed due to memory limits being exceeded?

Another issue we are having is that sometimes jobs get killed due to memory limits and sometimes they will end up swapping even though there is plenty of memory available on the machine. I know this is crazy, but all this started the night we installed slurm 20.11.5, and we didn't change anything else. Even crazier, we run exactly the same job on the command line and it finishes without issue (no swapping; virtual + resident identical to when swapping for no reason). Yet, running via sbatch consistently either causes it to swap or it gets killed for exceeding the memory limit.

I should point ou that the memory usage of the process is nowhere near the memory limit or the total memory of the machine (e.g., we are running a 30GB job with a 150GB memory limit on a 1TB memory machine). It really doesn't make any sense. But, my intuition tells me that if we downgrade to slurm 18.08 the problem is going to go away immediately.
Comment 71 Jason Booth 2021-04-14 19:54:47 MDT
Please open a new bug on this and have us look into this again. When logging the bug please reference this bug.