Bug 8656 - cgroup account gather
Summary: cgroup account gather
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other bugs)
Version: 19.05.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
: 6518 8763 10429 (view as bug list)
Depends on: 9202
Blocks:
  Show dependency treegraph
 
Reported: 2020-03-11 01:20 MDT by CSC sysadmins
Modified: 2024-02-14 12:41 MST (History)
5 users (show)

See Also:
Site: CSC - IT Center for Science
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.3,21.08pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm config files (30.00 KB, application/x-tar)
2020-03-12 00:04 MDT, CSC sysadmins
Details
slurmd debug5 logs (127.07 KB, text/x-log)
2020-03-20 03:21 MDT, CSC sysadmins
Details
Slurmd debug log (71.73 KB, text/x-log)
2020-03-30 06:41 MDT, CSC sysadmins
Details
Slurmd log with latest patch applied (8.29 MB, text/x-log)
2020-04-09 02:43 MDT, CSC sysadmins
Details
patch for CSC (1.69 KB, patch)
2020-04-13 14:59 MDT, Nate Rini
Details | Diff
slurmd debug5 logs (48.03 KB, text/x-log)
2020-04-15 03:17 MDT, CSC sysadmins
Details
patch for 19.05 (1.86 KB, patch)
2020-04-15 13:41 MDT, Nate Rini
Details | Diff
debug logs forjob 2037852 (48.72 KB, text/x-log)
2020-04-16 04:39 MDT, CSC sysadmins
Details
debug patch for CSC (1.09 KB, patch)
2020-04-17 11:22 MDT, Nate Rini
Details | Diff
debug logs with latest two patch applied (77.74 KB, text/x-log)
2020-04-20 06:22 MDT, CSC sysadmins
Details
patch for 19.05 (3.36 KB, patch)
2020-04-20 09:02 MDT, Nate Rini
Details | Diff
debug logs with path 13876 (8.29 MB, text/x-log)
2020-04-21 03:32 MDT, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this bug.
Description CSC sysadmins 2020-03-11 01:20:02 MDT
Hi,

This is duplicate of this one, but it's reported w/o support contract:

https://bugs.schedmd.com/show_bug.cgi?id=6518

We do not use memory cgroups, OverMemoryKill is in use

This issue is quite harmful for some of our users, makes difficult to run e.g. Turbomole NumForce and other usually commercial applications.

It's really strange, 2GB allocation is eventually multiplied by 3x

[2020-02-27T15:39:29.352] [1478441.extern] error: Step 1478441.4294967295 exceeded memory limit (4295303168 > 4294967296), being killed
[2020-02-27T15:39:29.927] [1478441.extern] error: Step 1478441.4294967295 exceeded memory limit (6442954752 > 4294967296), being killed
[2020-02-27T15:39:39.350] [1478441.extern] error: Step 1478441.4294967295 exceeded memory limit (6455267328 > 4294967296), being killed
[2020-02-27T15:39:49.355] [1478441.extern] error: Step 1478441.4294967295 exceeded memory limit (6445830144 > 4294967296), being killed
Comment 1 Nate Rini 2020-03-11 10:04:09 MDT
(In reply to Tommi Tervo from comment #0)
> We do not use memory cgroups, OverMemoryKill is in use
Please attach your slurm.conf and cgroup.conf
 
> This issue is quite harmful for some of our users, makes difficult to run
> e.g. Turbomole NumForce and other usually commercial applications.
Please call the following:
> sacct -j 1478441 -a -p
Comment 2 CSC sysadmins 2020-03-12 00:04:23 MDT
Created attachment 13355 [details]
Slurm config files

sacct -j 1478441 -a -p
JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode|
1478441|bash|interactive|project_2001657|1|CANCELLED by 10007132|0:0|
1478441.extern|extern||project_2001657|1|COMPLETED|0:0|

It's really hard from user PoW to see why job died:

# seff 1478441
Job ID: 1478441
Cluster: puhti
User/Group: tervotom/tervotom
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:01:01
CPU Efficiency: 42.36% of 00:02:24 core-walltime
Job Wall-clock time: 00:02:24
Memory Utilized: 2.00 GB
Memory Efficiency: 50.10% of 4.00 GB
Comment 3 Nate Rini 2020-03-12 13:58:04 MDT
(In reply to Tommi Tervo from comment #2)
> Created attachment 13355 [details]
> Slurm config files
Can you please also attach 'cgroup.conf'.
Comment 4 Nate Rini 2020-03-12 14:38:04 MDT
(In reply to Nate Rini from comment #3)
> (In reply to Tommi Tervo from comment #2)
> > Created attachment 13355 [details]
> > Slurm config files
Can you please also attach 'cgroup.conf'.

(In reply to Tommi Tervo from comment #2)
> Created attachment 13355 [details]
> Slurm config files
> 
> sacct -j 1478441 -a -p
> JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode|
> 1478441|bash|interactive|project_2001657|1|CANCELLED by 10007132|0:0|
> 1478441.extern|extern||project_2001657|1|COMPLETED|0:0|
> 
> It's really hard from user PoW to see why job died:
Did the interactive session trigger the OOM event?
Comment 5 CSC sysadmins 2020-03-13 00:08:59 MDT
> Did the interactive session trigger the OOM event?

Yes. I did ran test case from the original bug https://bugs.schedmd.com/show_bug.cgi?id=6518 

[2020-02-27T15:39:49.355] [1478441.extern] error: Step 1478441.4294967295 exceeded memory limit (6445830144 > 4294967296), being killed


cat /etc/slurm/cgroup.conf
# File managed by puppet in module bullxbm-slurm

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
TaskAffinity=yes
ConstrainRAMSpace=no
Comment 6 Nate Rini 2020-03-13 11:28:31 MDT
*** Bug 6518 has been marked as a duplicate of this bug. ***
Comment 7 Nate Rini 2020-03-13 11:32:35 MDT
(In reply to Tommi Tervo from comment #5)
> > Did the interactive session trigger the OOM event?
> 
> Yes. I did ran test case from the original bug
> https://bugs.schedmd.com/show_bug.cgi?id=6518 
> 
> [2020-02-27T15:39:49.355] [1478441.extern] error: Step 1478441.4294967295
> exceeded memory limit (6445830144 > 4294967296), being killed

In your slurm.conf:
> #PrologFlags=Alloc #disabled by CSC, need to test/think in case of beeond etc.
Can you please call this:
> scontrol show config | grep PrologFlags

This looks like PrologFlags=Alloc,Contain is active and the job memory usage on the 1478441.extern is counting towards the job. I will also test with locally to see if I can replicate the issue since I now have your full config.
Comment 9 CSC sysadmins 2020-03-16 10:03:41 MDT
> Can you please call this:
> > scontrol show config | grep PrologFlags
> 
> This looks like PrologFlags=Alloc,Contain is active and the job memory usage
> on the 1478441.extern is counting towards the job. I will also test with
> locally to see if I can replicate the issue since I now have your full
> config.

Hi,

scontrol show config | grep PrologFlags
PrologFlags             = Alloc,Contain

-Tommi
Comment 10 Nate Rini 2020-03-16 12:57:33 MDT
I assume you have 4edf4a5898a (from bug#8140) applied to your Slurm install.

(In reply to Tommi Tervo from comment #9)
> PrologFlags             = Alloc,Contain
Thanks for confirming that.

(In reply to SafranTech from bug#6518 comment#0)
> It appears a orphaned process is added after each termination of a tracked
> process. And so the memory associated to the orphaned task increments the
> total memory of the job step (here step_extern). This leads to the job step
> being killed by slurm plugin.

This is working as documented with pam_slurm_adopt (with PrologFlags=contain). The objective of using contain is avoid "orphaned task"s, so we are using cgroups to catch these processes and their children and there is no (known) way for users to escape their cgroups without root privileges. Using pam_slurm_adopt captures the new processes created in SSH that would otherwise be orphans and unknown to Slurm. pam_slurm_adopt is covered in more detail here: https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf (page 38).

Looking at reproducer.txt from bug#6518 comment#0:
> [dt921770@rosetta-login01 ~]$ salloc -n1 -t 1:00:00 --mem=4G --qos=wsqlimits -p ws.q -w rosetta-ws16

Job allocated with memory limit.
> [dt921770@rosetta-login01 ~]$ ssh rosetta-ws16 

Job uses SSH to enter the node. pam_slurm_adopt is correctly detecting this ssh session as part of the job and is placing all of the child processes into the extern step. The extern step is subject all of the job restrictions including the "--mem=4G" limit.

> [dt921770@rosetta-ws16 ~]$ ./mem.x 2048 &
> [dt921770@rosetta-login01 ~]$ salloc: Exceeded job memory limit
> salloc: Exceeded job memory limit

A test application uses memory over the limit and is getting killed per
> JobAcctGatherParams=UsePss,OverMemoryKill

in the slurm.conf (as given).

(In reply to Tommi Tervo from comment #0)
> This issue is quite harmful for some of our users, makes difficult to run
> e.g. Turbomole NumForce and other usually commercial applications.

I will note that the sacct output could have the return of 253 to indicate that the job was killed for out-of-memory error. This would likely be considered an RFE.
 

(In reply to Tommi Tervo from comment #5)
> Yes. I did ran test case from the original bug

I assume you ran the same mem.f90 in bug#6518.

> It's really strange, 2GB allocation is eventually multiplied by 3x

It will allocate at least the amount requested (with "UsePSS") per process. It does an allocation, and then proceeds to touch all the pages which would slowly build up the dirty page count. I don't see any control over packing of the "x" array either, so I can't comment on how efficient the memory allocation is either.

From the fortran code:
>  real(kind=8), allocatable, dimension(:) :: x

Which is using 8 byte reals.

They allocate:
> x(1024*1024/8 * 2048)

Assuming my math is correct, they request at least 268MB of memory. Based on how the job is written, it is not too surprising the the size increases over time per your log. As you are not enforcing memory limits by cgroup, the process can take as much memory as the kernel will give it while slurmstepd trys to KILL it. ConstrainRAMSpace=yes (and ConstrainSwapSpace=yes) would cause the cgroup to kill the task before allowing it even 1 byte over the limit.

Can you please run the same test but with a program that has predictable memory usage? Most Linux distros will come with memtester (https://github.com/jnavila/memtester).
Comment 11 CSC sysadmins 2020-03-17 06:17:22 MDT
(In reply to Nate Rini from comment #10)
> I assume you have 4edf4a5898a (from bug#8140) applied to your Slurm install.s.

Yes

> Can you please run the same test but with a program that has predictable
> memory usage? Most Linux distros will come with memtester
> (https://github.com/jnavila/memtester).

I tried memhog (comes from numactl rpm), behavior is even worse:

srun --account=project_2001657 -p test   -n1 --mem=4g --time=10:00 --pty $SHELL
ssh node
memhog -r100 2g #no need to put on the background

[2020-03-17T14:07:39.028] [1666025.extern] error: Step 1666025.4294967295 exceeded memory limit (4295229440 > 4294967296), being killed
[2020-03-17T14:09:27.118] [1666032.extern] error: Step 1666032.4294967295 exceeded memory limit (4297195520 > 4294967296), being killed
[2020-03-17T14:10:03.713] [1666033.extern] error: Step 1666033.4294967295 exceeded memory limit (4297211904 > 4294967296), being killed

(tried multiple times, seff from the latest run)

seff 1666033
Job ID: 1666033
Cluster: puhti
User/Group: tervotom/tervotom
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:06
CPU Efficiency: 2.27% of 00:04:24 core-walltime
Job Wall-clock time: 00:04:24
Memory Utilized: 2.00 GB
Memory Efficiency: 50.03% of 4.00 GB
Job consumed 0.10 CSC billing units based on following used resources
CPU BU: 0.07
Mem BU: 0.03
Comment 12 Nate Rini 2020-03-17 16:57:23 MDT
(In reply to Tommi Tervo from comment #11)
> I tried memhog (comes from numac

Can you please call the following (outside of slurm):
> ssh node
> /usr/bin/time -v memhog -r100 2g
Comment 13 CSC sysadmins 2020-03-18 02:54:05 MDT
RHEL7 time does not support -v flag.
[tervotom@puhti-login1 ~]$ time ssh r07c06 memhog -r100 2g
.... 


.........Connection to r07c06 closed by remote host.

real    0m5.664s
user    0m0.021s
sys     0m0.017s


And allocation console:
[tervotom@puhti-login1 ~]$ srun --account=project_2001657 -p test   -n1 --mem=4g --time=10:00 --pty $SHELL
srun: job 1675208 queued and waiting for resources
srun: job 1675208 has been allocated resources
[tervotom@r07c06 ~]$ srun: Exceeded job memory limit
Comment 14 CSC sysadmins 2020-03-18 07:07:15 MDT
I installed time, memhog runs ~5s until it's kicked out.

[tervotom@r07c06 ~]$ /usr/bin/time -v memhog -r100 2g
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
.............................................................................................................................................................................................................
..........................................................................................................Connection to r07c06 closed by remote host.
Connection to r07c06 closed.
Comment 16 Nate Rini 2020-03-18 11:15:08 MDT
(In reply to Tommi Tervo from comment #14)
> I installed time, memhog runs ~5s until it's kicked out.

You will need to get the process started outside of Slurm and pam_slurm_adopt to get it to complete on the node. Temporarily disabling pam_slurm_adopt.so in the pam configuration should be sufficient for testing.

I have done the same testing locally and Slurm is reporting the correct (and expected) memory usage.
Comment 17 CSC sysadmins 2020-03-19 03:28:25 MDT
Run outside of slurm:

Command being timed: "memhog -r100 2g"
User time (seconds): 25.19
System time (seconds): 0.96
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2097860
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 524512
Voluntary context switches: 4
Involuntary context switches: 4401
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[tervotom@r07c06 ~]$
Comment 18 Nate Rini 2020-03-19 11:54:12 MDT
(In reply to Tommi Tervo from comment #17)
> Run outside of slurm:
> Maximum resident set size (kbytes): 2097860

Running it locally, I get 2098848kB which is within a page.

We are going to need to see what the cgroups on your kernel are reporting. On your test node, can you please set the following:
> SlurmdDebug=debug5
Please restart slurmd, run memhog and then upload the logs. Please feel free to reverse the change after that.

Thanks,
--Nate
Comment 19 CSC sysadmins 2020-03-20 03:21:24 MDT
Created attachment 13439 [details]
slurmd debug5 logs
Comment 20 Nate Rini 2020-03-20 14:32:47 MDT
(In reply to Tommi Tervo from comment #19)
> Created attachment 13439 [details]
> slurmd debug5 logs

> spank: opening plugin stack /etc/slurm/plugstack.conf.d/x11.conf
> spank: /usr/lib64/slurm/x11.so: no callbacks in this context
Looks like your X11 plugin needs to be re-compiled or removed if your moving to the new X11 forwarding features in 19.05.
Comment 21 Nate Rini 2020-03-20 14:39:11 MDT
(In reply to Tommi Tervo from comment #19)
> Created attachment 13439 [details]
> slurmd debug5 logs

> debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
> debug2: xcgroup_get_pids: unable to get pids of '(null)'
This bug is being handled in bug#7888 and should be harmless.
Comment 24 Nate Rini 2020-03-20 15:02:51 MDT
(In reply to Tommi Tervo from comment #19)
> Created attachment 13439 [details]
> slurmd debug5 logs

> debug:  Step 1706948.4294967295 memory used:4297277440 limit:4294967296 B

Considering that is saying your application used ~4PB, I'm thinking the parser from the kernel is having issues on your system.

Can you please provide your kernel version (uname -a) and distro release (lsb_release -a).
Comment 25 CSC sysadmins 2020-03-23 02:33:18 MDT
Compute nodes have standard RHEL 7.7 OS, diskless -> no swap.

[root@r01c01 ~]# uname -a
Linux r01c01.bullx 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@r01c01 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 (Maipo)
Comment 26 CSC sysadmins 2020-03-24 03:10:22 MDT
> > debug:  Step 1706948.4294967295 memory used:4297277440 limit:4294967296 B
> 
> Considering that is saying your application used ~4PB, I'm thinking the
> parser from the kernel is having issues on your system.

Where did you get that ~4PB, if used mem is in bytes then value is ~4GB?
Comment 28 Nate Rini 2020-03-24 17:47:39 MDT
(In reply to Tommi Tervo from comment #26)
> > > debug:  Step 1706948.4294967295 memory used:4297277440 limit:4294967296 B
> > 
> > Considering that is saying your application used ~4PB, I'm thinking the
> > parser from the kernel is having issues on your system.
> 
> Where did you get that ~4PB, if used mem is in bytes then value is ~4GB?

Correct, I meant "~4GB", that was a typo. Just to make sure we are on the same page here from the logs in comment #19:

> [1706948.extern] debug:  Step 1706948.4294967295 memory used:0 limit:4294967296 B

4294967296 B -> 4GiB

> error: Step 1706948.4294967295 exceeded memory limit (4297277440 > 4294967296), being killed

Limit: 4297277440 B -> 4GiB
Used: 4294967296 B -> 4GiB + 2.2MiB

(In reply to Nate Rini from comment #10)
> Looking at reproducer.txt from bug#6518 comment#0:
> > [dt921770@rosetta-login01 ~]$ salloc -n1 -t 1:00:00 --mem=4G --qos=wsqlimits -p ws.q -w rosetta-ws16

Running this as memhog instead on same kernel/distro as in comment #25:
> [root@mgmtnode ~]# salloc --mem=4g srun /usr/bin/time -v memhog -r100 2g
> salloc: Granted job allocation 19
> salloc: Waiting for resource configuration
> salloc: Nodes node00 are ready for job
-- skip dots --
>       Command being timed: "memhog -r100 2g"
>       Maximum resident set size (kbytes): 2097856
>       Average resident set size (kbytes): 0
> salloc: Relinquishing job allocation 19

It runs within about 1 page RSS size as job in comment #10. It also succeeds without issue. Can you please verify that you are requesting "2g" from memhog and not "4g"?
Comment 29 CSC sysadmins 2020-03-25 02:26:11 MDT
> It runs within about 1 page RSS size as job in comment #10. It also succeeds
> without issue. Can you please verify that you are requesting "2g" from
> memhog and not "4g"?

Yes, I request 2g which time -v shows.

sacct -j 1706948 -o reqmem,maxrss
    ReqMem     MaxRSS 
---------- ---------- 
       4Gn            
       4Gn   2098280K 
       4Gn       652K
Comment 30 Nate Rini 2020-03-25 14:01:36 MDT
(In reply to Tommi Tervo from comment #29)
> > It runs within about 1 page RSS size as job in comment #10. It also succeeds
> > without issue. Can you please verify that you are requesting "2g" from
> > memhog and not "4g"?
> 
> Yes, I request 2g which time -v shows.

There must be some kernel settings that I don't have on my VM to cause double the memory usage.

Can you please provide the output of the following:
> grep -Nr /etc/sysctl.*

If possible, it would be helpful to get the output of the cgroup for the job step too. I admit this procedure isn't nearly as friendly as I would prefer:

1. Run the job with a much larger memory constraint (maybe 10G).
> salloc --mem=10g srun /usr/bin/time -v memhog -r9999 2g 
2. Find the cgroup containing memhog:
> [root@localhost ~]# grep memory /proc/$(pgrep memhog)/cgroup
> 8:memory:/slurm_node00/uid_0/job_23/step_0/task_0
3. Enter the cgroup dir
> [root@localhost ~]# cd /sys/fs/cgroup/memory/slurm_node00/uid_0/job_23/step_0/task_0
4. grab all of the values
> grep -nR .
Comment 31 Nate Rini 2020-03-25 14:02:10 MDT
Correction to fix args to grep:

(In reply to Nate Rini from comment #30)
> Can you please provide the output of the following:
> > grep -nR /etc/sysctl.*
Comment 32 CSC sysadmins 2020-03-26 08:01:43 MDT
[root@r07c04 job_1779918]# cd step_0/task_0/
[root@r07c04 task_0]# grep -nR .
memory.kmem.tcp.max_usage_in_bytes:1:0
memory.kmem.tcp.failcnt:1:0
memory.kmem.tcp.usage_in_bytes:1:0
memory.kmem.tcp.limit_in_bytes:1:9223372036854771712
memory.memsw.failcnt:1:0
memory.memsw.limit_in_bytes:1:9223372036854771712
memory.memsw.max_usage_in_bytes:1:2147766272
memory.memsw.usage_in_bytes:1:2147729408
grep: memory.kmem.slabinfo: Input/output error
memory.kmem.max_usage_in_bytes:1:0
memory.kmem.failcnt:1:0
memory.kmem.usage_in_bytes:1:0
memory.kmem.limit_in_bytes:1:9223372036854771712
memory.numa_stat:1:total=524348 N0=524348 N1=0
memory.numa_stat:2:file=0 N0=0 N1=0
memory.numa_stat:3:anon=524348 N0=524348 N1=0
memory.numa_stat:4:unevictable=0 N0=0 N1=0
grep: memory.pressure_level: Invalid argument
memory.oom_control:1:oom_kill_disable 0
memory.oom_control:2:under_oom 0
memory.move_charge_at_immigrate:1:0
memory.swappiness:1:10
memory.use_hierarchy:1:1
grep: memory.force_empty: Invalid argument
memory.stat:1:cache 0
memory.stat:2:rss 2147729408
memory.stat:3:rss_huge 0
memory.stat:4:mapped_file 0
memory.stat:5:swap 0
memory.stat:6:pgpgin 524633
memory.stat:7:pgpgout 285
memory.stat:8:pgfault 525740
memory.stat:9:pgmajfault 0
memory.stat:10:inactive_anon 0
memory.stat:11:active_anon 2147729408
memory.stat:12:inactive_file 0
memory.stat:13:active_file 0
memory.stat:14:unevictable 0
memory.stat:15:hierarchical_memory_limit 9223372036854771712
memory.stat:16:hierarchical_memsw_limit 9223372036854771712
memory.stat:17:total_cache 0
memory.stat:18:total_rss 2147729408
memory.stat:19:total_rss_huge 0
memory.stat:20:total_mapped_file 0
memory.stat:21:total_swap 0
memory.stat:22:total_pgpgin 524633
memory.stat:23:total_pgpgout 285
memory.stat:24:total_pgfault 525740
memory.stat:25:total_pgmajfault 0
memory.stat:26:total_inactive_anon 0
memory.stat:27:total_active_anon 2147729408
memory.stat:28:total_inactive_file 0
memory.stat:29:total_active_file 0
memory.stat:30:total_unevictable 0
memory.failcnt:1:0
memory.soft_limit_in_bytes:1:9223372036854771712
memory.limit_in_bytes:1:9223372036854771712
memory.max_usage_in_bytes:1:2147766272
memory.usage_in_bytes:1:2147729408
cgroup.clone_children:1:0
grep: cgroup.event_control: Invalid argument
notify_on_release:1:0
cgroup.procs:1:178048
cgroup.procs:2:178050
tasks:1:178048
tasks:2:178050
Comment 33 CSC sysadmins 2020-03-26 08:03:52 MDT
All settings from sysctl.confs:

[root@r07c04 task_0]# grep -v ^# /etc/sysctl.conf 
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_stale_time = 86400
kernel.shmall = 4294967296
net.ipv4.neigh.default.base_reachable_time_ms = 86400000
kernel.sem = 250 32000 32 256
kernel.shmmax = 68719476736
net.ipv4.conf.default.arp_ignore = 2
net.ipv4.neigh.default.gc_thresh3 = 8192
kernel.unknown_nmi_panic = 1
vm.min_free_kbytes = 1048576

[root@r07c04 task_0]# grep -v ^# /etc/sysctl.d/99-sysctl.conf 
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_stale_time = 86400
kernel.shmall = 4294967296
net.ipv4.neigh.default.base_reachable_time_ms = 86400000
kernel.sem = 250 32000 32 256
kernel.shmmax = 68719476736
net.ipv4.conf.default.arp_ignore = 2
net.ipv4.neigh.default.gc_thresh3 = 8192
kernel.unknown_nmi_panic = 1
vm.min_free_kbytes = 1048576
Comment 42 Nate Rini 2020-03-26 21:04:25 MDT
(In reply to Tommi Tervo from comment #32)
> memory.max_usage_in_bytes:1:2147766272

Your kernel reports the same amount of memory as my test system. Looking at the code, I believe I found a race condition that likely accounts for the 2x memory being reported for the task. Currently working on a patchset for testing.
Comment 44 Nate Rini 2020-03-27 18:44:09 MDT
Created attachment 13509 [details]
WIP patch for CSC only

Tommi,

This patch should fix your issue. Would you like to try it on your test system? This fix is also unlikely to goto slurm-19.05 but slurm-20.02 when it is upstreamed.

Thanks,
--Nate
Comment 45 CSC sysadmins 2020-03-30 06:41:04 MDT
Created attachment 13525 [details]
Slurmd debug log

Hi,

I applied patched slurmd to one compute node and I was able to run memhog without issues. Dare you say how WIP/production quality that patch is?

Now debug log reports 0 B memory usage for extern step , is it correct?

Step 1823373.4294967295 memory used:0 limit:10737418240 B

sacct has correct info:

seff 1823373

Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:01:31 core-walltime
Job Wall-clock time: 00:01:31
Memory Utilized: 5.00 GB
Memory Efficiency: 50.01% of 10.00 GB
Comment 46 Nate Rini 2020-03-30 11:46:29 MDT
(In reply to Tommi Tervo from comment #45)
> I applied patched slurmd to one compute node and I was able to run memhog
> without issues. Dare you say how WIP/production quality that patch is?
Lets figure out why memory use is 0 first.
 
> Now debug log reports 0 B memory usage for extern step , is it correct?
> 
> Step 1823373.4294967295 memory used:0 limit:10737418240 B

Please provide the output to the following:
> sacct -a -j 1823373 -p -o a

Did you run this command for the job (from comment #30)?
> 1. Run the job with a much larger memory constraint (maybe 10G).
> salloc --mem=10g srun /usr/bin/time -v memhog -r9999 2g 


I take this error is unrelated?
> [1823353.0] error: execve(): /usr/bin/time: No such file or directory
Comment 47 CSC sysadmins 2020-03-31 03:37:33 MDT
> > Step 1823373.4294967295 memory used:0 limit:10737418240 B
> 
> Please provide the output to the following:
> > sacct -a -j 1823373 -p -o a

Job had 10g allocation and I ran memhog -r100 5g from the another terminal with ssh.  Sacct usage seems to match 

Account|AdminComment|AllocCPUS|AllocGRES|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|37950|||||||||puhti|||0|0|00:01:31|91|0:0|00:01:31|91|2020-03-30T15:24:33|2020-03-30T15:26:06|0:0|SchedMain|10007132|tervotom|1823373|1823373|bash|||||||||||||||||||||1|1|r07c49||1058|interactive|normal|1|None|Unknown|Unknown|Unknown|Unknown|1||10Gn|1|billing=1,cpu=1,mem=10G,node=1|oc_test|141|00:00:02|00:00:02|2|2020-03-30T15:24:35|COMPLETED|2020-03-30T15:24:33|00:00:00|00:00:00||00:10:00|10|00:00:00|||||||||||||||||10007132|tervotom|00:00:00||0|/users/tervotom/bin|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|37950|00:01:52|0||||5245036K|||puhti|||0|0|00:01:31|91||00:01:31|91|2020-03-30T15:24:35|2020-03-30T15:26:06|0:0||||1823373.extern|1823373.extern|extern|Unknown||||||||||5244076K|r07c49|0|||||00:00:56|r07c49|0|1|1|r07c49|1||||||0|0|0|0|1||10Gn|1|||||||2020-03-30T15:24:35|COMPLETED|2020-03-30T15:24:35|00:00:00|00:00:00||||00:00:00|cpu=00:01:52,energy=0,mem=5245036K|cpu=00:00:56,energy=0,mem=5244076K|cpu=r07c49,energy=r07c49,mem=r07c49|cpu=0,mem=0|cpu=00:00:56,energy=0,mem=5244076K|cpu=r07c49,energy=r07c49,mem=r07c49|cpu=0,mem=0|cpu=00:01:52,energy=0,mem=5245036K|energy=0|energy=0|energy=r07c49||energy=0|energy=r07c49||energy=0|||00:00:00||||


> 
> Did you run this command for the job (from comment #30)?
> > 1. Run the job with a much larger memory constraint (maybe 10G).
> > salloc --mem=10g srun /usr/bin/time -v memhog -r9999 2g 

Nope, but I ran it now with patched slurmd:
        Command being timed: "memhog -r200 2g"
        User time (seconds): 40.83
        System time (seconds): 0.92
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.86
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2097864
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 524513
        Voluntary context switches: 10
        Involuntary context switches: 3843
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
salloc: Relinquishing job allocation 1833595
[tervotom@puhti-login2 ~]$ sacct -j 1833595 -o reqmem,maxrss
    ReqMem     MaxRSS 
---------- ---------- 
      10Gn            
      10Gn          0 
      10Gn   2097388K 

 
> I take this error is unrelated?
> > [1823353.0] error: execve(): /usr/bin/time: No such file or directory

Yes.
Comment 49 Nate Rini 2020-03-31 11:41:57 MDT
(In reply to Tommi Tervo from comment #47)
> Job had 10g allocation and I ran memhog -r100 5g from the another terminal
> with ssh.  Sacct usage seems to match 

All of the reports numbers look good in comment #47.

(In reply to Tommi Tervo from comment #45)
> Dare you say how WIP/production quality that patch is?

Combined with bug#7617, I'm actually going to suggest swapping to
> JobAcctGatherType=jobacct_gather/linux

I believe the patch *should* be production ready, but I also believe it is likely incomplete. As the final patch will either target 20.20 or 20.11, I'm suggesting to swap to jobacct_gather/linux on 19.05.
Comment 50 CSC sysadmins 2020-04-01 04:38:35 MDT
> Combined with bug#7617, I'm actually going to suggest swapping to
> > JobAcctGatherType=jobacct_gather/linux
> 
> I believe the patch *should* be production ready, but I also believe it is
> likely incomplete. As the final patch will either target 20.20 or 20.11, I'm
> suggesting to swap to jobacct_gather/linux on 19.05.

Hi,

I just want to double check:

So the bug is in jobacct_gather/cgroup plugin (just to make sure that we do not need to patch)? 

Is it safe to do cgroup->linux change on the fly?

BR,
Tommi
Comment 54 Nate Rini 2020-04-02 11:15:57 MDT
(In reply to Nate Rini from comment #49)
> I believe the patch *should* be production ready, but I also believe it is
> likely incomplete.
bug#8763 has been opened to review thread safety of jobacct_gather on a whole.

(In reply to Tommi Tervo from comment #50)
> So the bug is in jobacct_gather/cgroup plugin (just to make sure that we do
> not need to patch)? 

The patch is in shared code between the two plugins. Before making any changes, we would like to do some more investigation as a few more questions on our side have popped up. Can you please provide this output:
> cat /proc/mounts

> Is it safe to do cgroup->linux change on the fly?

This will require a reboot of nodes to ensure cgroups are uniform due to bug#7536.
Comment 55 CSC sysadmins 2020-04-03 01:00:50 MDT
> The patch is in shared code between the two plugins. Before making any
> changes, we would like to do some more investigation as a few more questions
> on our side have popped up. Can you please provide this output:
> > cat /proc/mounts

Compute nodes are diskless:

rootfs / rootfs rw 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=98057856k,nr_inodes=24514464,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
/dev/mapper/mpatha1 / ext4 ro,relatime,norecovery 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /boot tmpfs rw,relatime,size=307200k 0 0
tmpfs /etc tmpfs rw,relatime,size=51200k 0 0
tmpfs /usr tmpfs rw,relatime,size=5242880k 0 0
tmpfs /var tmpfs rw,relatime,size=5242880k 0 0
tmpfs /tmp tmpfs rw,relatime,size=102400k 0 0
tmpfs /root tmpfs rw,relatime,size=768000k 0 0
tmpfs /mnt tmpfs rw,relatime,size=1024k 0 0
tmpfs /var/tmp tmpfs rw,relatime,size=102400k 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=21,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=132494 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
dump.srv.bullx:/var/dump /var/dump nfs4 rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.140.37.219,local_lock=none,addr=10.140.39.11 0 0
10.140.15.146@o2ib:10.140.15.145@o2ib:/scratch/scratch /scratch lustre rw,nosuid,nodev,flock,lazystatfs 0 0
10.140.15.146@o2ib:10.140.15.145@o2ib:/scratch/users /users lustre rw,nosuid,nodev,flock,lazystatfs 0 0
10.140.15.146@o2ib:10.140.15.145@o2ib:/scratch/projappl /projappl lustre rw,nosuid,nodev,flock,lazystatfs 0 0
10.140.15.146@o2ib:10.140.15.145@o2ib:/scratch/appl /appl lustre ro,nosuid,nodev,flock,lazystatfs 0 0
10.140.15.146@o2ib:10.140.15.145@o2ib:/scratch/fmi /fmi lustre rw,nosuid,nodev,flock,lazystatfs 0 0
/dev/ram0 /var/spool/slurmd xfs rw,relatime,attr2,inode64,noquota 0 0

 
> > Is it safe to do cgroup->linux change on the fly?
> 
> This will require a reboot of nodes to ensure cgroups are uniform due to
> bug#7536.

Ok, that I was expecting.

BR,
Tommi
Comment 58 Nate Rini 2020-04-04 13:39:40 MDT
(In reply to Tommi Tervo from comment #55)
> > The patch is in shared code between the two plugins. Before making any
> > changes, we would like to do some more investigation as a few more questions
> > on our side have popped up. Can you please provide this output:
> > > cat /proc/mounts
> 
> Compute nodes are diskless:
This rules out another known issue with cgroups. We are still working on analysis of the issue.
Comment 62 Nate Rini 2020-04-06 12:40:12 MDT
Created attachment 13628 [details]
patch for CSC only

(In reply to Nate Rini from comment #58)
> (In reply to Tommi Tervo from comment #55)
> > > The patch is in shared code between the two plugins. Before making any
> > > changes, we would like to do some more investigation as a few more questions
> > > on our side have popped up. Can you please provide this output:
> > > > cat /proc/mounts
> > 
> > Compute nodes are diskless:
> This rules out another known issue with cgroups. We are still working on
> analysis of the issue.

Tommi,

Can you please try this patch on your test system?

Thanks,
--Nate
Comment 63 CSC sysadmins 2020-04-09 02:43:13 MDT
Created attachment 13682 [details]
Slurmd log with latest patch applied

Hi,

I tried that patch but it does not help to ssh case.
job: 1964441
salloc --account=project_2001657 -p interactive -w r07c49 --gres=nvme:100 -n1  --mem=4g --time=1:00  srun /usr/bin/time -v memhog -r9999 3g 

Was successful, run just out of time

job: 1964446

srun --account=project_2001657 -p interactive -w r07c49 --gres=nvme:100 -n1  --mem=4g --time=1:00 --pty $SHELL

ssh r07c49 /usr/bin/time -v memhog -r200 3g

[tervotom@r07c49 ~]$ srun: Exceeded job memory limit
srun: Force Terminated job 1964446
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: r07c49: task 0: Killed
srun: Terminating job step 1964446.0
Comment 68 Nate Rini 2020-04-11 12:25:02 MDT
(In reply to Tommi Tervo from comment #63)
> I tried that patch but it does not help to ssh case.
Thanks, that helps isolate possible code locations of the issue.

Can you please call this on your controller:
> scontrol show config |grep -i JobAcct

I want to confirm the config is getting loaded as expected. The logs from comment #19 suggest an energy plugin may also be active.
Comment 77 Nate Rini 2020-04-13 12:41:34 MDT
(In reply to Nate Rini from comment #68)
> (In reply to Tommi Tervo from comment #63)
> > I tried that patch but it does not help to ssh case.
> Thanks, that helps isolate possible code locations of the issue.
We have confirmed the issue locally. Working a new patchset currently.
 
> Can you please call this on your controller:
> > scontrol show config |grep -i JobAcct
> 
> I want to confirm the config is getting loaded as expected. The logs from
> comment #19 suggest an energy plugin may also be active.
Please disregard this request.
Comment 80 Nate Rini 2020-04-13 14:59:12 MDT
Created attachment 13759 [details]
patch for CSC

Tommi,

Please try this patch.

Thanks,
--Nate
Comment 82 CSC sysadmins 2020-04-15 03:17:31 MDT
Created attachment 13802 [details]
slurmd debug5 logs

ssh case fails still with latest patch applied.

salloc --account=project_2001657 -p test --reservation=8656 -w r07c01  -n1  --mem=10g --time=10:00

jobid: 2020132

ssh r07c01

[tervotom@r07c01 ~]$ /usr/bin/time -v memhog -r100 5g
Comment 84 Nate Rini 2020-04-15 11:50:44 MDT
(In reply to Tommi Tervo from comment #82)
> Created attachment 13802 [details]
> slurmd debug5 logs
> 
> ssh case fails still with latest patch applied.

Interesting, the logs show the same signature of the problem:

> [2020132.extern] debug:  jag_common_poll_data: Task 0 pid 138208 ave_freq = 118707 mem size/max 5369929728/5369929728
> [2020132.extern] debug:  jag_common_poll_data: Task 0 pid 138207 ave_freq = 118707 mem size/max 5369929728/5369929728

Identical memory usage of 5369929728 since jobacctgather is polling the containing cgroup instead of the pid tree.

> salloc --account=project_2001657 -p test --reservation=8656 -w r07c01  -n1 
> --mem=10g --time=10:00

Was the node (os) restarted before testing the patch?
Comment 85 Nate Rini 2020-04-15 13:41:43 MDT
Created attachment 13818 [details]
patch for 19.05

Updated patch with another race condition fix.
Comment 86 CSC sysadmins 2020-04-16 04:39:47 MDT
Created attachment 13827 [details]
debug logs forjob  2037852

Unfortunately the latest patch did not help.

I did not restart the OS, but verified that correct slurm was started
[2020-04-16T13:27:39.346] slurmd version 19.05.5-9 started

# pwd
/home/tommi/rpmbuild/BUILD/slurm-19.05.5-9
# grep 'mutex to protect taskid' src/slurmd/slurmstepd/req.c
/* mutex to protect taskid as this function is threaded */
Comment 88 Nate Rini 2020-04-17 11:22:39 MDT
Created attachment 13858 [details]
debug patch for CSC

(In reply to Tommi Tervo from comment #86)
> Created attachment 13827 [details]
> debug logs forjob  2037852
> 
> Unfortunately the latest patch did not help.
There could always be more than one bug causing issues.

Can you please apply this patch (in addition to the previous one) and attach the debug logs from it.

It will dump logs like this every account gather cycle:
> [8166.0] _prec_extra: calling xcgroup_get_param for pid 15967
> [8166.0] xcgroup_get_param: file_path=/sys/fs/cgroup/memory/slurm_host10/uid_1000/job_8166/step_0/task_0 size=508 content=
Comment 89 CSC sysadmins 2020-04-20 06:22:22 MDT
Created attachment 13875 [details]
debug logs with latest two patch applied
Comment 92 Nate Rini 2020-04-20 09:02:42 MDT
Created attachment 13876 [details]
patch for 19.05
Comment 93 Nate Rini 2020-04-20 09:03:57 MDT
(In reply to Nate Rini from comment #92)
> Created attachment 13876 [details]
> patch for 19.05

Tommi,

The log showed an off by one bug. This patch will replace the debug patch and the last fix patch.

Please give it a try.

Thanks,
--Nate
Comment 94 CSC sysadmins 2020-04-21 03:32:39 MDT
Created attachment 13890 [details]
debug logs with path 13876

Well, some kind of improvement can be seen. Jobs are not killed anymore but memory accounting does not work correctly.

Steps, first allocated job 2079782
salloc --account=project_2001657 --reservation=ttt   -n1  --mem=10g --time=10:00

Logged from other terminal to node and ran memhog -r100 5g

I was happy that job was not killed and decided to test what happens if I run srun memhog -r100 4g from the salloc session. Those memhogs were running parallel so that combined memory consumption should be 9g, but sacct is reporting only 4g.

Account|AdminComment|AllocCPUS|AllocGRES|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10890|||||||||puhti|||0|0|00:04:29|269|0:0|00:04:29|269|2020-04-21T12:14:05|2020-04-21T12:18:35|0:0|SchedMain|10007132|tervotom|2079782|2079782|bash|||||||||||||||||||||1|1|r07c54||673|small|normal|1|None|Unknown|Unknown|Unknown|Unknown|1||10Gn|1|billing=1,cpu=1,mem=10G,node=1|ttt|147|00:00:01|00:00:01|1|2020-04-21T12:14:06|COMPLETED|2020-04-21T12:14:05|00:00:00|00:01.874||00:10:00|10|00:43.269|||||||||||||||||10007132|tervotom|00:41.395||0|/users/tervotom|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10890|00:00:00|2.10G|0.00M|0|0|0|107956K||puhti|||0|0|00:04:30|270||00:04:30|270|2020-04-21T12:14:06|2020-04-21T12:18:36|0:0||||2079782.extern|2079782.extern|extern|Unknown|0.00M|r07c54|0|0|r07c54|0|0|r07c54|0|0|r07c54|0|107956K|r07c54|0||00:00:00|r07c54|0|1|1|r07c54|1||||||0|0|0|0|1||10Gn|1|||||||2020-04-21T12:14:06|COMPLETED|2020-04-21T12:14:06|00:00:00|00:00:00||||00:00:00|cpu=00:00:00,energy=0,fs/disk=2012,mem=0,pages=0,vmem=107956K|cpu=00:00:00,energy=0,fs/disk=2012,mem=0,pages=0,vmem=107956K|cpu=r07c54,energy=r07c54,fs/disk=r07c54,mem=r07c54,pages=r07c54,vmem=r07c54|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=0,pages=0,vmem=107956K|cpu=r07c54,energy=r07c54,fs/disk=r07c54,mem=r07c54,pages=r07c54,vmem=r07c54|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=0,pages=0,vmem=107956K|energy=0,fs/disk=0|energy=0,fs/disk=0|energy=r07c54,fs/disk=r07c54|fs/disk=0|energy=0,fs/disk=0|energy=r07c54,fs/disk=r07c54|fs/disk=0|energy=0,fs/disk=0|||00:00:00||||
project_2001657||1||1|cpu=1,mem=10G,node=1|10890|00:00:42|99.62M|0.02M|0.04M|0|4194440K|4202900K||puhti|||0|0|00:01:41|101||00:01:41|101|2020-04-21T12:16:43|2020-04-21T12:18:24|0:0||||2079782.0|2079782.0|memhog|Cyclic|0.02M|r07c54|0|0.04M|r07c54|0|0|r07c54|0|4194440K|r07c54|0|4202900K|r07c54|0||00:00:42|r07c54|0|1|1|r07c54|1||||||Unknown|Unknown|Unknown|Unknown|1||10Gn|1|||||||2020-04-21T12:16:43|COMPLETED|2020-04-21T12:16:43|00:00:00|00:01.873||||00:43.268|cpu=00:00:42,energy=0,fs/disk=19349,mem=4194440K,pages=0,vmem=4202900K|cpu=00:00:42,energy=0,fs/disk=19349,mem=4194440K,pages=0,vmem=4202900K|cpu=r07c54,energy=r07c54,fs/disk=r07c54,mem=r07c54,pages=r07c54,vmem=r07c54|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:42,energy=0,fs/disk=19349,mem=4194440K,pages=0,vmem=4202900K|cpu=r07c54,energy=r07c54,fs/disk=r07c54,mem=r07c54,pages=r07c54,vmem=r07c54|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:42,energy=0,fs/disk=19349,mem=4194440K,pages=0,vmem=4202900K|energy=0,fs/disk=42880|energy=0,fs/disk=42880|energy=r07c54,fs/disk=r07c54|fs/disk=0|energy=0,fs/disk=42880|energy=r07c54,fs/disk=r07c54|fs/disk=0|energy=0,fs/disk=42880|||00:41.395||||
Comment 96 Nate Rini 2020-04-21 10:24:10 MDT
(In reply to Tommi Tervo from comment #94)
> I was happy that job was not killed and decided to test what happens if I
> run srun memhog -r100 4g from the salloc session. Those memhogs were running
> parallel so that combined memory consumption should be 9g, but sacct is
> reporting only 4g.

Was the second memhog in salloc run with srun or directly in the salloc shell?
Comment 97 CSC sysadmins 2020-04-22 02:08:45 MDT
I ran memhog with srun from the salloc shell. So step.0 is from salloc srun memhog 4g and .extern from ssh memhog 5g

[2020-04-21T12:17:03.065] [2079782.0] debug3: _get_pss: read pss 4295132160 for process /proc/244939/smaps
[2020-04-21T12:17:06.358] [2079782.extern] debug3: _get_pss: read pss 5368869888 for process /proc/244897/smaps
Comment 99 Nate Rini 2020-04-24 14:49:02 MDT
(In reply to Nate Rini from comment #54)
> > Is it safe to do cgroup->linux change on the fly?
> 
> This will require a reboot of nodes to ensure cgroups are uniform due to
> bug#7536.

After consulting internally, this should be safe. The slurmstepd of running jobs will continue to use the previous jobacct plugin while new jobs will convert over.

Current job accounting code does not work with the patch provided and we are still working on a full fix. Swapping to linux should allow your site to run jobs with correct accounting until a fix is complete.
Comment 100 CSC sysadmins 2020-04-30 06:30:37 MDT
I changed to JobAcctGatherType=jobacct_gather/linux on test nodes but behaviour is not what I'd expect.
 
I started job with salloc:

salloc --account=project_2001657 -p test  -w r07c06 -n1  --mem=10g --time=10:00 

Starter memhog with: srun memhog -r200 6g

Started ssh session from other terminal:

[tervotom@puhti-login1 ~]$ ssh  r07c06
Last login: Thu Apr 30 12:22:09 2020 from puhti-login1.bullx
[tervotom@r07c06 ~ (PUHTI) ]$ memhog -r200 8g


-> combined memory consumption is way over reserved 10g but both runs simultaneously without problems.

[2020-04-30T13:50:45.060] [2178986.extern] debug3: _get_pss: read pss 8590093312 for process /proc/26243/smaps
[2020-04-30T13:50:52.340] [2178986.0] debug3: _get_pss: read pss 6442613760 for process /proc/25739/smaps



[tervotom@puhti-login2 ~]$ sacct -a -j 2178986 -p -o a
Account|AdminComment|AllocCPUS|AllocGRES|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10889|||||||||puhti|||0|0|00:05:22|322|0:0|00:05:22|322|2020-04-30T13:46:23|2020-04-30T13:52:22|0:0|SchedMain|10007132|tervotom|2178986|2178986|bash|||||||||||||||||||||1|1|r07c06||890|test|normal|1|None|Unknown|Unknown|Unknown|Unknown|1||10Gn|1|billing=1,cpu=1,mem=10G,node=1|||00:00:37|00:00:37|37|2020-04-30T13:47:00|COMPLETED|2020-04-30T13:46:23|00:00:00|00:02.870||00:10:00|10|02:04.951|||||||||||||||||10007132|tervotom|02:02.080||0|/users/tervotom|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10889|00:00:00|4.17G|5.50M|1.33M|0|6122K|713880K||puhti|||0|0|00:05:22|322||00:05:22|322|2020-04-30T13:47:00|2020-04-30T13:52:22|0:0||||2178986.extern|2178986.extern|extern|Unknown|5.50M|r07c06|0|1.33M|r07c06|0|0|r07c06|0|8394419K|r07c06|0|8838620K|r07c06|0||00:00:00|r07c06|0|1|1|r07c06|1||||||0|0|0|0|1||10Gn|1|||||||2020-04-30T13:47:00|COMPLETED|2020-04-30T13:47:00|00:00:00|00:00:00||||00:00:00|cpu=00:00:00,energy=0,fs/disk=5769966,mem=6122K,pages=0,vmem=713880K|cpu=00:02:45,energy=0,fs/disk=5767954,mem=8394419K,pages=0,vmem=8838620K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=474K,pages=0,vmem=272464K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=5769966,mem=6122K,pages=0,vmem=713880K|energy=0,fs/disk=1395909|energy=0,fs/disk=1395909|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=0|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=1395909|||00:00:00||||
project_2001657||1||1|cpu=1,mem=10G,node=1|10889|00:02:02|26.82M|0.02M|0.12M|0|6291623K|6300052K||puhti|||0|0|00:03:46|226||00:03:46|226|2020-04-30T13:47:12|2020-04-30T13:50:58|0:0||||2178986.0|2178986.0|memhog|Cyclic|0.02M|r07c06|0|0.12M|r07c06|0|0|r07c06|0|6291623K|r07c06|0|6300052K|r07c06|0||00:02:02|r07c06|0|1|1|r07c06|1||||||Unknown|Unknown|Unknown|Unknown|1||10Gn|1|||||||2020-04-30T13:47:12|COMPLETED|2020-04-30T13:47:12|00:00:00|00:02.870||||02:04.950|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|energy=0,fs/disk=122888|energy=0,fs/disk=122888|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=122888|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=122888|||02:02.080||||
Comment 101 Nate Rini 2020-04-30 11:25:15 MDT
(In reply to Tommi Tervo from comment #100)
> -> combined memory consumption is way over reserved 10g but both runs
> simultaneously without problems.
> 
> [2020-04-30T13:50:45.060] [2178986.extern] debug3: _get_pss: read pss
> 8590093312 for process /proc/26243/smaps
> [2020-04-30T13:50:52.340] [2178986.0] debug3: _get_pss: read pss 6442613760
> for process /proc/25739/smaps

The read values look valid considering memory usage overhead. OverMemoryKill appears to only enforce the step memory limit (instead of the job total) currently, which I believe is worthy of a new ticket.

(In reply to Tommi Tervo from comment #94)
> I was happy that job was not killed and decided to test what happens if I
> run srun memhog -r100 4g from the salloc session. Those memhogs were running
> parallel so that combined memory consumption should be 9g, but sacct is
> reporting only 4g.
Still working on a patch to fix the accounting.
Comment 105 Nate Rini 2020-05-19 11:29:57 MDT
(In reply to Nate Rini from comment #101)
> (In reply to Tommi Tervo from comment #94)
> > I was happy that job was not killed and decided to test what happens if I
> > run srun memhog -r100 4g from the salloc session. Those memhogs were running
> > parallel so that combined memory consumption should be 9g, but sacct is
> > reporting only 4g.
> Still working on a patch to fix the accounting.

Can you please provide the following output (again):
> sacct -j 1478441 -a -p -o a
> sacct -a -j 2178986 -p -o a

I noticed locally that the accounting may actually update but only after the job has been removed from slurmctld.
Comment 106 CSC sysadmins 2020-05-19 23:42:57 MDT
> Can you please provide the following output (again):
> > sacct -j 1478441 -a -p -o a
> > sacct -a -j 2178986 -p -o a
> 
> I noticed locally that the accounting may actually update but only after the
> job has been removed from slurmctld.

[root@mslurm1 ~ (PUHTI) ]# sacct -j 1478441 -a -p -o a
Account|AdminComment|AllocCPUS|AllocGRES|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
project_2001657||1||1|billing=1,cpu=1,mem=4G,node=1|37950|||||||||puhti|||0|0|00:02:24|144|0:0|00:02:24|144|2020-02-27T15:38:39|2020-02-27T15:41:03|0:0|SchedMain|10007132|tervotom|1478441|1478441|bash|||||||||||||||||||||1|1|r07c50||1368|interactive|normal|1|None|Unknown|Unknown|Unknown|Unknown|1||4Gn|1|billing=1,cpu=1,mem=4G,node=1|||00:00:00|00:00:00|0|2020-02-27T15:38:39|CANCELLED by 10007132|2020-02-27T15:38:39|00:00:00|00:00:01||01:00:00|60|00:01:01|||||||||||||||||10007132|tervotom|00:01:00||0|/users/tervotom/bug_6518|
project_2001657||1||1|billing=1,cpu=1,mem=4G,node=1|37950|00:01:02|2.17G|2.64M|0.03M|4|3437620K|270816K||puhti|||0|0|00:01:41|101||00:01:41|101|2020-02-27T15:38:39|2020-02-27T15:40:20|0:0||||1478441.extern|1478441.extern|extern|Unknown|2.64M|r07c50|0|0.03M|r07c50|0|2|r07c50|0|2101324K|r07c50|0|162860K|r07c50|0||00:00:01|r07c50|0|1|1|r07c50|1||||||0|0|0|0|1||4Gn|1|||||||2020-02-27T15:38:39|COMPLETED|2020-02-27T15:38:39|00:00:00|00:00:01||||00:01:01|cpu=00:01:02,energy=0,fs/disk=2773113,mem=3437620K,pages=4,vmem=270816K|cpu=00:01:01,energy=0,fs/disk=2771101,mem=2101324K,pages=2,vmem=162860K|cpu=r07c50,energy=r07c50,fs/disk=r07c50,mem=r07c50,pages=r07c50,vmem=r07c50|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:01,energy=0,fs/disk=2012,mem=1336296K,pages=2,vmem=107956K|cpu=r07c50,energy=r07c50,fs/disk=r07c50,mem=r07c50,pages=r07c50,vmem=r07c50|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:01:02,energy=0,fs/disk=2773113,mem=3437620K,pages=4,vmem=270816K|energy=0,fs/disk=36483|energy=0,fs/disk=36483|energy=r07c50,fs/disk=r07c50|fs/disk=0|energy=0,fs/disk=0|energy=r07c50,fs/disk=r07c50|fs/disk=0|energy=0,fs/disk=36483|||00:01:00||||
[root@mslurm1 ~ (PUHTI) ]# sacct -a -j 2178986 -p -o a
Account|AdminComment|AllocCPUS|AllocGRES|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10889|||||||||puhti|||0|0|00:05:22|322|0:0|00:05:22|322|2020-04-30T13:46:23|2020-04-30T13:52:22|0:0|SchedMain|10007132|tervotom|2178986|2178986|bash|||||||||||||||||||||1|1|r07c06||890|test|normal|1|None|Unknown|Unknown|Unknown|Unknown|1||10Gn|1|billing=1,cpu=1,mem=10G,node=1|||00:00:37|00:00:37|37|2020-04-30T13:47:00|COMPLETED|2020-04-30T13:46:23|00:00:00|00:02.870||00:10:00|10|02:04.951|||||||||||||||||10007132|tervotom|02:02.080||0|/users/tervotom|
project_2001657||1||1|billing=1,cpu=1,mem=10G,node=1|10889|00:00:00|4.17G|5.50M|1.33M|0|6122K|713880K||puhti|||0|0|00:05:22|322||00:05:22|322|2020-04-30T13:47:00|2020-04-30T13:52:22|0:0||||2178986.extern|2178986.extern|extern|Unknown|5.50M|r07c06|0|1.33M|r07c06|0|0|r07c06|0|8394419K|r07c06|0|8838620K|r07c06|0||00:00:00|r07c06|0|1|1|r07c06|1||||||0|0|0|0|1||10Gn|1|||||||2020-04-30T13:47:00|COMPLETED|2020-04-30T13:47:00|00:00:00|00:00:00||||00:00:00|cpu=00:00:00,energy=0,fs/disk=5769966,mem=6122K,pages=0,vmem=713880K|cpu=00:02:45,energy=0,fs/disk=5767954,mem=8394419K,pages=0,vmem=8838620K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=474K,pages=0,vmem=272464K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=5769966,mem=6122K,pages=0,vmem=713880K|energy=0,fs/disk=1395909|energy=0,fs/disk=1395909|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=0|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=1395909|||00:00:00||||
project_2001657||1||1|cpu=1,mem=10G,node=1|10889|00:02:02|26.82M|0.02M|0.12M|0|6291623K|6300052K||puhti|||0|0|00:03:46|226||00:03:46|226|2020-04-30T13:47:12|2020-04-30T13:50:58|0:0||||2178986.0|2178986.0|memhog|Cyclic|0.02M|r07c06|0|0.12M|r07c06|0|0|r07c06|0|6291623K|r07c06|0|6300052K|r07c06|0||00:02:02|r07c06|0|1|1|r07c06|1||||||Unknown|Unknown|Unknown|Unknown|1||10Gn|1|||||||2020-04-30T13:47:12|COMPLETED|2020-04-30T13:47:12|00:00:00|00:02.870||||02:04.950|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|cpu=r07c06,energy=r07c06,fs/disk=r07c06,mem=r07c06,pages=r07c06,vmem=r07c06|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:02:02,energy=0,fs/disk=19341,mem=6291623K,pages=0,vmem=6300052K|energy=0,fs/disk=122888|energy=0,fs/disk=122888|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=122888|energy=r07c06,fs/disk=r07c06|fs/disk=0|energy=0,fs/disk=122888|||02:02.080||||
Comment 107 Nate Rini 2020-05-27 16:34:12 MDT
Tommi,

Still working on a patchset for this issue.

Thanks,
--Nate
Comment 108 Nate Rini 2020-05-29 14:51:51 MDT
(In reply to Tommi Tervo from comment #94)
> I was happy that job was not killed and decided to test what happens if I
> run srun memhog -r100 4g from the salloc session. Those memhogs were running
> parallel so that combined memory consumption should be 9g, but sacct is
> reporting only 4g.

Can you please repeat the same test job from comment #1 and provide the new sacct output (requested in comment #105) against it with the patch.

(In reply to Tommi Tervo from comment #100)
> Starter memhog with: srun memhog -r200 6g
> [tervotom@puhti-login1 ~]$ ssh  r07c06
> [tervotom@r07c06 ~ (PUHTI) ]$ memhog -r200 8g
> -> combined memory consumption is way over reserved 10g but both runs
> simultaneously without problems.
This is expected per bug#8986 comment #8 as the limits are only enforced against the single process tree and not the step/job as a whole without 'ConstrainRAMSpace=yes'.
 
> [2020-04-30T13:50:45.060] [2178986.extern] debug3: _get_pss: read pss
> 8590093312 for process /proc/26243/smaps
> [2020-04-30T13:50:52.340] [2178986.0] debug3: _get_pss: read pss 6442613760
> for process /proc/25739/smaps

The values from sacct in comment#106 match what we would expect:
> 2178986.extern=8394419KB
> 2178986=6291623KB
Comment 117 Nate Rini 2020-08-18 14:53:19 MDT
*** Bug 8763 has been marked as a duplicate of this bug. ***
Comment 176 Nate Rini 2021-01-19 10:52:34 MST
(In reply to Nate Rini from comment #108)
> (In reply to Tommi Tervo from comment #94)
> > I was happy that job was not killed and decided to test what happens if I
> > run srun memhog -r100 4g from the salloc session. Those memhogs were running
> > parallel so that combined memory consumption should be 9g, but sacct is
> > reporting only 4g.
> 
> Can you please repeat the same test job from comment #1 and provide the new
> sacct output (requested in comment #105) against it with the patch.

Tommi,

This issue has now been corrected in slurm-20.11.3 that should be coming out shortly. While debugging the issue, we found other related issues that have now been corrected.

Please consider swapping back to jobacct_gather/cgroup and testing.

I'm going to close this ticket but please respond if there are any more related issues/questions.

Thanks,
--Nate
Comment 177 Felip Moll 2024-02-14 12:41:11 MST
*** Bug 10429 has been marked as a duplicate of this bug. ***