6332 – Wrong CPU times reported by cgroup accounting plugin

Bug 6332 - Wrong CPU times reported by cgroup accounting plugin

Summary: Wrong CPU times reported by cgroup accounting plugin

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other bugs)
Version:	18.08.4
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Duplicates (3):	6095 6315 6344 (view as bug list)
Depends on:
Blocks:

Reported:	2019-01-09 08:57 MST by Jurij Pečar
Modified:	2019-04-01 11:08 MDT (History)
CC List:	6 users (show)

See Also:	6380 6541
Site:	EMBL
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	18.08.5
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Two short-lived steps compared against each other (155.63 KB, image/png) 2019-01-16 13:16 MST, Michael Hinton	Details
Two longer-lived steps compared against each other (159.18 KB, image/png) 2019-01-16 13:17 MST, Michael Hinton	Details
AveCPUFreq Discrepancies (81.25 KB, image/png) 2019-01-16 13:18 MST, Michael Hinton	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jurij Pečar 2019-01-09 08:57:04 MST

To continue from #6000 and to get #6095 some weight by a whining from a supported site, I'm opening this again.

Relevant config:

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherParams=NoOverMemoryKill

Problem: cpu usage records in the accounting db are way too high since the upgrade to 18.08. On 17.11 this worked fine.

Example:

Simple 1 core 2GB memory job asking for 30h of runtime:

$ sacct -j 26232969 -n -X -P --noconvert -o AllocTRES,AllocCPUS,ReqMEM,CputimeRAW,ElapsedRAW
billing=1,cpu=1,energy=17524879,mem=2048M,node=1|1|2048Mc|71246|71246

Ran for a bit less than 20h. Seff reports that it consumed over 21 days of cpu time:

$ seff -d 26232969
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 26232969  pecar users COMPLETED hof 1 1 0 2097152 0 1836162 71246 0 0

Job ID: 26232969
Cluster: hof
User/Group: pecar/users
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 21-06:02:42
CPU Efficiency: 2577.21% of 19:47:26 core-walltime
Job Wall-clock time: 19:47:26
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 2.00 GB (2.00 GB/core)

It gets this number from db so I'm afraid numbers are wrong there.

I'd like to have two things:
* workaround (switching to jobacct_gather/linux ?)
* understanding how this came to be so we can identify and fix wrong values in the database

Thanks.

Comment 2 Michael Hinton 2019-01-09 09:29:10 MST

Perfect, thanks. I'll look into it!

-Michael

Comment 5 Michael Hinton 2019-01-09 11:02:57 MST

(In reply to Jurij Pečar from comment #0)
> I'd like to have two things:
> * workaround (switching to jobacct_gather/linux ?)
From the user list thread you alluded to in bug 6000 (https://groups.google.com/forum/#!topic/slurm-users/ZRALYg1iKWo), it looks like there is already a proposed workaround of either setting JobAcctGatherFrequency=task=0 or setting JobAcctGatherType=jobacct_gather/linux. If these don't work for you, let me know and I'll try to find a suitable workaround.

> * understanding how this came to be so we can identify and fix wrong values
> in the database
I'll work on reproducing it and try to figure out the underlying cause. There has been work done in this area before to try to fix stats for 18.08 (see https://github.com/SchedMD/slurm/commit/5847bd71d0b705).

Comment 6 Michael Hinton 2019-01-09 11:10:50 MST

As mentioned in the user list thread and documented in slurm.conf#JobAcctGatherType, don's switch JobAcctGatherType while you have running job steps:

"NOTE: Changing this configuration parameter changes the contents of the messages between Slurm daemons. Any previously running job steps are managed by a slurmstepd daemon that will persist through the lifetime of that job step and not change its communication protocol. Only change this configuration parameter when there are no running job steps."

Comment 7 Jurij Pečar 2019-01-10 02:55:49 MST

Thanks for the notice. I've scheduled short maintenance window over the weekend when I'll do this change and will report back on Monday.

Comment 8 Marshall Garey 2019-01-11 09:13:52 MST

*** Bug 6344 has been marked as a duplicate of this bug. ***

Comment 9 Jurij Pečar 2019-01-14 02:04:44 MST

We've switched to jobacct_gather/linux and job cpu utilization as reported by seff is back to sane and belivable values. However seff still reports memory utilized as 0.00 MB.

Comment 21 Michael Hinton 2019-01-15 14:26:49 MST

Hey Jurij,

We got it fixed! Here's the commit: https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59

This will be available in 18.08.5. If you can't wait, here it is in patch form:
https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59.patch

Just FYI, this broke before slurm-18-08-0-0rc1 (commit 07483564162), which means that versions 18.08.0 through 18.08.4 are all affected.

Thanks,
Michael

Comment 22 Jurij Pečar 2019-01-15 14:35:49 MST

Great, I'll try out the patch tomorrow.

What I don't get from the patch - are the values affected when gathered and stored in db or when fetched from db and interpreted in human units?

If the wrong values are in the db, is there any chance of making them right again?

Comment 23 Michael Hinton 2019-01-15 14:42:07 MST

*** Bug 6095 has been marked as a duplicate of this bug. ***

Comment 24 Michael Hinton 2019-01-15 15:27:22 MST

(In reply to Jurij Pečar from comment #22)
> Great, I'll try out the patch tomorrow.
> 
> What I don't get from the patch - are the values affected when gathered and
> stored in db or when fetched from db and interpreted in human units?
The values are affected as soon as they are read out from the `cpuacct.stat` cgroup subsystem into Slurm on a jobacct_gather poll. So basically it's messed up as early as possible, affecting the values in the db and everything downstream. 

> If the wrong values are in the db, is there any chance of making them right
> again?
Well, I think all you need to do is divide the affected db fields by whatever USER_HZ is on the system, which will likely be 100.

From http://man7.org/linux/man-pages/man7/time.7.html:

“The times(2) system call is a special case.  It reports times with a granularity defined by the kernel constant USER_HZ.  User-space applications can determine the value of this constant using sysconf(_SC_CLK_TCK).”

See also "cpuacct.stat" in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpuacct

Figuring out which db fields are affected is more difficult. I'll take a look and see what I can find.

Comment 25 Jurij Pečar 2019-01-16 01:39:45 MST

Looks like our CentOS 7 has CONFIG_HZ=1000 so that would mean division by 1000, right?

Also I can get pretty exact date when we upgraded to 18.08 and from there a possible first job id that run under 18.08. So sql query to update tables would have "where job id > last_job_under_17.11 and job_id < first_job_after_switching_acctgather". Can you help me with the rest?

Thanks.

Comment 26 Michael Hinton 2019-01-16 09:09:19 MST

I can't guarantee that CONFIG_HZ is what we are looking for. The only reliable way I know of determining the value of USER_HZ is to call `sysconf(_SC_CLK_TCK)`. 

So create a text file called `get_user_hz.c` with the following contents:

#include <stdio.h>
#include <unistd.h>
int main() {
    fprintf(stdout, "USER_HZ : %ld\n", sysconf(_SC_CLK_TCK));
}

Compile it with gcc: `gcc get_user_hz.c`.
Then run it: `./a.out`.
It should tell you the value of USER_HZ for that system.

Sorry, I prematurely closed this bug. I'll keep it open until we have a migration strategy in place. I'll keep you posted.

Comment 27 Jurij Pečar 2019-01-16 09:26:13 MST

Ok, so the right value is hiding in /lib/modules/`uname -r`/build/include/asm-generic/param.h and is indeed 100.

Had a look around the database and there is a time_start field in the job_table that is suitable for us to limit the fixing query on the subset of jobs. But I'm not sure which of the *id* fields is the one that corresponds to real job id. I'll wait for you to tell me that.

Thanks again.

Comment 28 Michael Hinton 2019-01-16 13:13:00 MST

(In reply to Jurij Pečar from comment #27)
> Ok, so the right value is hiding in /lib/modules/`uname
> -r`/build/include/asm-generic/param.h and is indeed 100.
If you say so. I can’t guarantee it, though.

> Had a look around the database and there is a time_start field in the
> job_table that is suitable for us to limit the fixing query on the subset of
> jobs. But I'm not sure which of the *id* fields is the one that corresponds
> to real job id. I'll wait for you to tell me that.
Actually, it has nothing to do with the job_table and everything to do with the step_table. The affected fields are only in the step_table and include `user_sec`, `sys_sec`, `act_cpufreq`, and `tres_usage_in_[ave|max|min|tot]`. Fields `user_usec` and `sys_usec` shouldn't be affected, because they aren’t coming out of the jobacct_gather/cgroup plugin.

The solution is to _integer_ divide `user_sec` and `sys_sec` by 100 (so if the value is < 100, it will become 0), and to parse `1=<value>` out of the tres strings and integer divide that <value> by 100.  You can target the affected records by querying on `time_start` and `time_end`.

The `act_cpufreq` field (i.e. AveCPUFreq) seems to be 1-2 orders of magnitude lower than normal, and this discrepancy gets worse the longer the step runs. I’m not sure what determines this.

Fixing just `user_sec` and `sys_sec` should be enough to have sacct return the proper value for TotalCPU and for seff to return believable values for CPU Utilized and CPU Efficiency.

See the attached pictures.

Could you send us your slurmdbd.conf (with the password removed)? Since this only affects step records, if you purge your step records e.g. every 2 months, you only have 2 months of records to fix.

These affected fields aren't used in anything too important. As far as I understand it, 
the TRES-related fields are only used relative to other steps to see if a single step is misbehaving.

> Thanks again.
No problem!

Comment 29 Michael Hinton 2019-01-16 13:16:37 MST

Created attachment 8942 [details]
Two short-lived steps compared against each other

Comment 30 Michael Hinton 2019-01-16 13:17:19 MST

Created attachment 8943 [details]
Two longer-lived steps compared against each other

Comment 31 Michael Hinton 2019-01-16 13:18:52 MST

Created attachment 8944 [details]
AveCPUFreq Discrepancies

Comment 32 Jurij Pečar 2019-01-17 01:55:49 MST

Ok, this is enough info for me to put together a script that will fix affected records. There's just shy of 4M of them so it might take a while. We don't (yet?) purge any job or step info from the db since we find it useful for historical purposes. We might look into this in the future.

Comment 33 Michael Hinton 2019-01-17 10:14:49 MST

(In reply to Jurij Pečar from comment #32)
> Ok, this is enough info for me to put together a script that will fix
> affected records. There's just shy of 4M of them so it might take a while.
> We don't (yet?) purge any job or step info from the db since we find it
> useful for historical purposes. We might look into this in the future.
Great! I'll go ahead and close out the bug then. Please reopen if you have any other issues related to this.

Thanks,
Michael

Comment 34 Michael Hinton 2019-01-18 08:57:10 MST

*** Bug 6315 has been marked as a duplicate of this bug. ***

Comment 35 Steve Ford 2019-02-21 09:40:41 MST

Jurij,

We are currently in the process of fixing this issue in our database. Are you willing to share the script you used to correct your database?

Best,
Steve

Comment 36 Jurij Pečar 2019-03-01 02:34:38 MST

It's nothing fancy. I did select * from xxx_step_table where time_start > unix_timestamp('2018-11-06:15:30:00') and time_start < unix_timestamp('2019-01-13:20:20:00') into outfile '/tmp/steps' to get the faulty data. Then I wrote a simple bash while read loop to parse the data, fix the faulty values and update them back in the db. Something as simple as

c=1
all=`cat /tmp/steps|wc -l`

while IFS=_ read job_db_inx id_step user_sec sys_sec tres_ave tres_max tres_tot
do
        echo -n "fixing $job_db_inx,$id_step ($c of $all) ... "
        
        #tres looks like 1=23508383,2=12957013,6=155,7=259102037
        ave_ok_part="`echo $tres_ave | cut -f2- -d,`"
        ave_val="`echo $tres_ave | cut -f1 -d, | cut -f2 -d=`"

        max_ok_part="`echo $tres_max | cut -f2- -d,`"
        max_val="`echo $tres_max | cut -f1 -d, | cut -f2 -d=`"

        if [ -n "$tres_tot" ]
        then
                tot_ok_part="`echo $tres_tot | cut -f2- -d,`"
                tot_val="`echo $tres_tot | cut -f1 -d, | cut -f2 -d=`"
                echo "update xxx_step_table set user_sec=$(($user_sec/100)), sys_sec=$(($sys_sec/100)), tres_usage_in_ave=\"1=$(($ave_val/100)),$ave_ok_part\", tres_usage_in_max=\"1=$(($max_val/100)),$max_ok_part\", tres_usage_in_tot=\"1=$(($tot_val/100)),$tot_ok_part\" where job_db_inx=$job_db_inx and id_step=$id_step" | $MYSQL
        else
                echo "update xxx_step_table set user_sec=$(($user_sec/100)), sys_sec=$(($sys_sec/100)), tres_usage_in_ave=\"1=$(($ave_val/100)),$ave_ok_part\", tres_usage_in_max=\"1=$(($max_val/100)),$max_ok_part\", where job_db_inx=$job_db_inx and id_step=$id_step" | $MYSQL
        fi
        let c+=1
        echo "ok."
done < <(cat /tmp/steps)

This did fix most of the entries. There were some corner cases that I didn't investigate thoroughly. So if you need 100% accuracy, do run this with set -e and investigate each one that fails.