Bug 3531

Summary: fix to jobacct_gather/cgroup not reporting some metrics, as well as collect cache statistic
Product: Slurm Reporter: Sam Gallop <sam.gallop>
Component: ContributionsAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, carlos.fenoy, hrz, kaizaad, minnus
Version: 17.02.0   
Hardware: Other   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4637
https://bugs.schedmd.com/show_bug.cgi?id=6201
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.02.7 17.11.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: patch file
SLURM configuration file
Patch to remove references to task directory in jobacct_gather/cgroup
Change to memory calculation in jobacct_gather/cgroup

Description Sam Gallop 2017-03-03 07:41:00 MST
Created attachment 4146 [details]
patch file

Hi,

I noticed when running with JobAcctGatherType=jobacct_gather/cgroup that no RSS is reported by sacct or sstat for either batch or job step ...
# sstat -j 161 -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize   AvePages MaxPages
------------ ---------- ---------- ---------- ---------- ---------- --------
161.0                 0          0    117484K    138256K          0        0

# sstat -j 161.batch -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize   AvePages MaxPages
------------ ---------- ---------- ---------- ---------- ---------- --------
161.batch             0          0    117484K    137492K          0        0

sacct -j 161 -o JobId,AveRss,MaxRss,AveVMSize,MaxVMSize,SystemCPU,UserCPU
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize  SystemCPU    UserCPU
------------ ---------- ---------- ---------- ---------- ---------- ----------
161                                                       00:14.300  46:57.144
161.batch             0          0    117484K    137492K  00:00.080  00:00.023
161.0                 0          0    117484K    138256K  00:14.220  46:57.121

It appears that the jobacct_gather/cgroup attempts to extract data from an non-existent task directory within the memory and cpuacct cgroup hierarchies.  I've generated a patch to address this (see attached).  In addition I've also included the cached memory statistic within the vsize metric (which is collected prior to entering the jobacct_gather/cgroup plugin).

Being aware of the overall memory footprint (rss + cache) is helpful when working with jobs that are managed within cgroups, as it's effectively these combined values that are the cause of jobs to be killed by OOM as they exceed memory.limit_in_bytes (or memory.memsw.limit_in_bytes).  Including the cache statistic within the vsize will hopefully help provide greater understanding of how much overall memory an application consumes, that is heap, stack, data, text, shared objects, cache, etc.

When run with the patch applied the following is available ...
# sstat -j 183 -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize   AvePages MaxPages
------------ ---------- ---------- ---------- ---------- ---------- --------
183.0              648K       648K  16096.50M  16486016K          0        0

# sstat -j 183.batch -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize   AvePages MaxPages
------------ ---------- ---------- ---------- ---------- ---------- --------
183.batch         3412K      3412K    117484K    137480K          0        0

# sacct -j 183 -o JobId,AveRss,MaxRss,AveVMSize,MaxVMSize,SystemCPU,UserCPU
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize  SystemCPU    UserCPU
------------ ---------- ---------- ---------- ---------- ---------- ----------
183                                                       00:14.578  46:56.792
183.batch         3412K      3412K    117488K    137480K  00:00.066  00:00.022
183.0              648K       648K  16096.50M  16486016K  00:14.512  46:56.770

Compared to the values in the memory cgroup ...
# cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_batch/memory.stat | grep -Ew "^rss|^cache"
cache 16758034432      <-- step 0
rss 663552             <-- step 0
cache 0                <-- batch
rss 3493888            <-- batch

This has been tested on 17.02.0 and 16.05.8.

The patch is an update to files ...
doc/man/man5/slurm.conf.5
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.h
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_cpuacct.c
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_memory.c

cheers,
Sam
Comment 1 Sam Gallop 2017-03-03 07:41:53 MST
Created attachment 4147 [details]
SLURM configuration file
Comment 3 Tim Wickberg 2017-03-07 17:26:20 MST
Hi Sam -

Thanks for submitting this, it looks pretty good.

I know it's a bit of work, but would it be possible to separate out the change to the memory calculation from the task directory code? We usually try to structure each logical change as a separate commit to make review and later troubleshooting simpler.

I'm marking up this bug as an enhancement, targeting 17.11 as well. I can't make significant changes to how the memory calculations are done on the existing releases, but this does look like something we should explore before the next major release.

thanks,
- Tim
Comment 4 Sam Gallop 2017-03-09 03:03:48 MST
Hi Tim,

I'll do as you say and split out the memory calculation change.  Once done I'll attach both patches to this bugzilla.

thanks,
Sam
Comment 5 Sam Gallop 2017-03-10 09:49:27 MST
Created attachment 4185 [details]
Patch to remove references to task directory in jobacct_gather/cgroup
Comment 6 Sam Gallop 2017-03-10 09:53:39 MST
Created attachment 4186 [details]
Change to memory calculation in jobacct_gather/cgroup
Comment 7 Sam Gallop 2017-03-10 10:15:35 MST
Hi,

I've split the original patch out into two separate patches.  The attachments are ...

jobacct_gather_cgroup_part1.patch - Patch to remove references to task directory in jobacct_gather/cgroup
jobacct_gather_cgroup_part2.patch - Change to memory calculation in jobacct_gather/cgroup

Please let me know if you need anything further from me.

thanks,
Sam
Comment 8 Thomas Opfer 2017-03-21 03:37:34 MDT
Hi Tim,

I agree that attachment 4186 [details] should be delayed until 17.11 as it changes the way that consumed memory is calculated, but I think attachment 4185 [details] should be considered a bugfix and added to 17.02.2 .

We currently plan to switch from jobacct_gather/linux to jobacct_gather/cgroup, but for some kind of jobs, it leads to AveRSS = 0 and MaxRSS = 0 in the current implementation. I think this is a bug and should not be considered as "working as intended in this release". Attachment 4185 [details] seems to provide correct results for these jobs.

Just for the sake of completeness: As our own monitoring makes use of MaxRSS, we cannot switch to jobacct_gather/cgroup until this is fixed.

Of course, I would like to thank Sam for providing these improvements.

Best regards,
Thomas
Comment 9 Martins Innus 2017-04-25 13:25:05 MDT
Hi,

    Will the patch in attachment 4185 [details] make it into 17.02? We are seeing the same issue with missing data.  This patch fixes the issue.  Thanks Sam for the fix!

Thanks

Martins
Comment 10 Danny Auble 2017-07-18 14:55:22 MDT
Sam, this patch breaks task/rank related accounting.  The jobacct_gather plugin is designed to track by task for each step so you can get outliers.  This patch removes that functionality.

This also gives bogus memory by multiplying usage by the number of tasks on the node in the step (meaning in most cases you will get way more memory than you would normally get, same for cpu time).  If all your jobs are only 1 task then you would not notice this, but any job having more than 1 task would get exaggerated stats.

I can easily see the task_* directories.  I am not sure why you aren't.

I have found if the tasks end really quickly they could be missed from the final poll meaning no accounting is had.  Perhaps this is what you are seeing?

In any case could you see if adding --acctg-freq=1 to your srun line causes you to get memory stats?  At the moment I believe this is a race condition in the cgroup system more than anything else.  I am looking into other ways to fix this, but I don't believe this is the correct way at the moment.  If you find a magic bullet to always get the task cgroup to be up to date let me know.

In my testing setting --acctg-freq=1 seems to always give good results, but I understand that isn't a real fix.
Comment 11 Carlos Fenoy 2017-07-19 09:34:03 MDT
There seems to be a bug/race condition when task/cgroup and jobacct_gather/cgroup are both used.

When submitting a job with just 1 cpu, the process is attached to the step_0 memory cgroup and not the task_0. However, when requesting more than 1 cpus, it is attached to the right cgroup, see below:

# grep cgroup /etc/slurm/slurm.conf
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/cgroup

$ srun -c 1 cat /proc/self/cgroup
20:net_cls:/
19:perf_event:/
18:devices:/system.slice/slurm.service
17:cpuacct,cpu:/slurm/uid_82718/job_1349/step_0/task_0
16:cpuset:/slurm/uid_82718/job_1349/step_0
15:memory:/slurm/uid_82718/job_1349/step_0
14:freezer:/slurm/uid_82718/job_1349/step_0
13:blkio:/system.slice/slurm.service
12:hugetlb:/
1:name=systemd:/system.slice/slurm.service

$ srun -c 2 cat /proc/self/cgroup
20:net_cls:/
19:perf_event:/
18:devices:/system.slice/slurm.service
17:cpuacct,cpu:/slurm/uid_82718/job_1348/step_0/task_0
16:cpuset:/slurm/uid_82718/job_1348/step_0
15:memory:/slurm/uid_82718/job_1348/step_0/task_0
14:freezer:/slurm/uid_82718/job_1348/step_0
13:blkio:/system.slice/slurm.service
12:hugetlb:/
1:name=systemd:/system.slice/slurm.service
Comment 12 Danny Auble 2017-07-19 15:01:13 MDT
This is fixed in commit e5c0554965712.  We were able to find the race condition and the fix turned out to be fairly straightforward.

As you can only have a pid in one cgroup there was a race from the task plugin putting the pid in the step level cgroup where the jobacct_gather needed it in the task level cgroup.  Since the task plugin was doing it in a forked process it created a window where it could happen after the jobacct_gather already place it in the task level cgroup thus moving it erroneously to the step level cgroup.  The task plugin call to do this didn't need to happen in the fork and it was moved to right before we do the jobacct_gather removing the race all together.
Comment 13 Danny Auble 2017-07-19 15:35:01 MDT
*** Bug 3895 has been marked as a duplicate of this bug. ***