To continue from #6000 and to get #6095 some weight by a whining from a supported site, I'm opening this again. Relevant config: JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherParams=NoOverMemoryKill Problem: cpu usage records in the accounting db are way too high since the upgrade to 18.08. On 17.11 this worked fine. Example: Simple 1 core 2GB memory job asking for 30h of runtime: $ sacct -j 26232969 -n -X -P --noconvert -o AllocTRES,AllocCPUS,ReqMEM,CputimeRAW,ElapsedRAW billing=1,cpu=1,energy=17524879,mem=2048M,node=1|1|2048Mc|71246|71246 Ran for a bit less than 20h. Seff reports that it consumed over 21 days of cpu time: $ seff -d 26232969 Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624. Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624. Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus Slurm data: 26232969 pecar users COMPLETED hof 1 1 0 2097152 0 1836162 71246 0 0 Job ID: 26232969 Cluster: hof User/Group: pecar/users State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 21-06:02:42 CPU Efficiency: 2577.21% of 19:47:26 core-walltime Job Wall-clock time: 19:47:26 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 2.00 GB (2.00 GB/core) It gets this number from db so I'm afraid numbers are wrong there. I'd like to have two things: * workaround (switching to jobacct_gather/linux ?) * understanding how this came to be so we can identify and fix wrong values in the database Thanks.
Perfect, thanks. I'll look into it! -Michael
(In reply to Jurij Pečar from comment #0) > I'd like to have two things: > * workaround (switching to jobacct_gather/linux ?) From the user list thread you alluded to in bug 6000 (https://groups.google.com/forum/#!topic/slurm-users/ZRALYg1iKWo), it looks like there is already a proposed workaround of either setting JobAcctGatherFrequency=task=0 or setting JobAcctGatherType=jobacct_gather/linux. If these don't work for you, let me know and I'll try to find a suitable workaround. > * understanding how this came to be so we can identify and fix wrong values > in the database I'll work on reproducing it and try to figure out the underlying cause. There has been work done in this area before to try to fix stats for 18.08 (see https://github.com/SchedMD/slurm/commit/5847bd71d0b705).
As mentioned in the user list thread and documented in slurm.conf#JobAcctGatherType, don's switch JobAcctGatherType while you have running job steps: "NOTE: Changing this configuration parameter changes the contents of the messages between Slurm daemons. Any previously running job steps are managed by a slurmstepd daemon that will persist through the lifetime of that job step and not change its communication protocol. Only change this configuration parameter when there are no running job steps."
Thanks for the notice. I've scheduled short maintenance window over the weekend when I'll do this change and will report back on Monday.
*** Bug 6344 has been marked as a duplicate of this bug. ***
We've switched to jobacct_gather/linux and job cpu utilization as reported by seff is back to sane and belivable values. However seff still reports memory utilized as 0.00 MB.
Hey Jurij, We got it fixed! Here's the commit: https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59 This will be available in 18.08.5. If you can't wait, here it is in patch form: https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59.patch Just FYI, this broke before slurm-18-08-0-0rc1 (commit 07483564162), which means that versions 18.08.0 through 18.08.4 are all affected. Thanks, Michael
Great, I'll try out the patch tomorrow. What I don't get from the patch - are the values affected when gathered and stored in db or when fetched from db and interpreted in human units? If the wrong values are in the db, is there any chance of making them right again?
*** Bug 6095 has been marked as a duplicate of this bug. ***
(In reply to Jurij Pečar from comment #22) > Great, I'll try out the patch tomorrow. > > What I don't get from the patch - are the values affected when gathered and > stored in db or when fetched from db and interpreted in human units? The values are affected as soon as they are read out from the `cpuacct.stat` cgroup subsystem into Slurm on a jobacct_gather poll. So basically it's messed up as early as possible, affecting the values in the db and everything downstream. > If the wrong values are in the db, is there any chance of making them right > again? Well, I think all you need to do is divide the affected db fields by whatever USER_HZ is on the system, which will likely be 100. From http://man7.org/linux/man-pages/man7/time.7.html: “The times(2) system call is a special case. It reports times with a granularity defined by the kernel constant USER_HZ. User-space applications can determine the value of this constant using sysconf(_SC_CLK_TCK).” See also "cpuacct.stat" in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpuacct Figuring out which db fields are affected is more difficult. I'll take a look and see what I can find.
Looks like our CentOS 7 has CONFIG_HZ=1000 so that would mean division by 1000, right? Also I can get pretty exact date when we upgraded to 18.08 and from there a possible first job id that run under 18.08. So sql query to update tables would have "where job id > last_job_under_17.11 and job_id < first_job_after_switching_acctgather". Can you help me with the rest? Thanks.
I can't guarantee that CONFIG_HZ is what we are looking for. The only reliable way I know of determining the value of USER_HZ is to call `sysconf(_SC_CLK_TCK)`. So create a text file called `get_user_hz.c` with the following contents: #include <stdio.h> #include <unistd.h> int main() { fprintf(stdout, "USER_HZ : %ld\n", sysconf(_SC_CLK_TCK)); } Compile it with gcc: `gcc get_user_hz.c`. Then run it: `./a.out`. It should tell you the value of USER_HZ for that system. Sorry, I prematurely closed this bug. I'll keep it open until we have a migration strategy in place. I'll keep you posted.
Ok, so the right value is hiding in /lib/modules/`uname -r`/build/include/asm-generic/param.h and is indeed 100. Had a look around the database and there is a time_start field in the job_table that is suitable for us to limit the fixing query on the subset of jobs. But I'm not sure which of the *id* fields is the one that corresponds to real job id. I'll wait for you to tell me that. Thanks again.
(In reply to Jurij Pečar from comment #27) > Ok, so the right value is hiding in /lib/modules/`uname > -r`/build/include/asm-generic/param.h and is indeed 100. If you say so. I can’t guarantee it, though. > Had a look around the database and there is a time_start field in the > job_table that is suitable for us to limit the fixing query on the subset of > jobs. But I'm not sure which of the *id* fields is the one that corresponds > to real job id. I'll wait for you to tell me that. Actually, it has nothing to do with the job_table and everything to do with the step_table. The affected fields are only in the step_table and include `user_sec`, `sys_sec`, `act_cpufreq`, and `tres_usage_in_[ave|max|min|tot]`. Fields `user_usec` and `sys_usec` shouldn't be affected, because they aren’t coming out of the jobacct_gather/cgroup plugin. The solution is to _integer_ divide `user_sec` and `sys_sec` by 100 (so if the value is < 100, it will become 0), and to parse `1=<value>` out of the tres strings and integer divide that <value> by 100. You can target the affected records by querying on `time_start` and `time_end`. The `act_cpufreq` field (i.e. AveCPUFreq) seems to be 1-2 orders of magnitude lower than normal, and this discrepancy gets worse the longer the step runs. I’m not sure what determines this. Fixing just `user_sec` and `sys_sec` should be enough to have sacct return the proper value for TotalCPU and for seff to return believable values for CPU Utilized and CPU Efficiency. See the attached pictures. Could you send us your slurmdbd.conf (with the password removed)? Since this only affects step records, if you purge your step records e.g. every 2 months, you only have 2 months of records to fix. These affected fields aren't used in anything too important. As far as I understand it, the TRES-related fields are only used relative to other steps to see if a single step is misbehaving. > Thanks again. No problem!
Created attachment 8942 [details] Two short-lived steps compared against each other
Created attachment 8943 [details] Two longer-lived steps compared against each other
Created attachment 8944 [details] AveCPUFreq Discrepancies
Ok, this is enough info for me to put together a script that will fix affected records. There's just shy of 4M of them so it might take a while. We don't (yet?) purge any job or step info from the db since we find it useful for historical purposes. We might look into this in the future.
(In reply to Jurij Pečar from comment #32) > Ok, this is enough info for me to put together a script that will fix > affected records. There's just shy of 4M of them so it might take a while. > We don't (yet?) purge any job or step info from the db since we find it > useful for historical purposes. We might look into this in the future. Great! I'll go ahead and close out the bug then. Please reopen if you have any other issues related to this. Thanks, Michael
*** Bug 6315 has been marked as a duplicate of this bug. ***
Jurij, We are currently in the process of fixing this issue in our database. Are you willing to share the script you used to correct your database? Best, Steve
It's nothing fancy. I did select * from xxx_step_table where time_start > unix_timestamp('2018-11-06:15:30:00') and time_start < unix_timestamp('2019-01-13:20:20:00') into outfile '/tmp/steps' to get the faulty data. Then I wrote a simple bash while read loop to parse the data, fix the faulty values and update them back in the db. Something as simple as c=1 all=`cat /tmp/steps|wc -l` while IFS=_ read job_db_inx id_step user_sec sys_sec tres_ave tres_max tres_tot do echo -n "fixing $job_db_inx,$id_step ($c of $all) ... " #tres looks like 1=23508383,2=12957013,6=155,7=259102037 ave_ok_part="`echo $tres_ave | cut -f2- -d,`" ave_val="`echo $tres_ave | cut -f1 -d, | cut -f2 -d=`" max_ok_part="`echo $tres_max | cut -f2- -d,`" max_val="`echo $tres_max | cut -f1 -d, | cut -f2 -d=`" if [ -n "$tres_tot" ] then tot_ok_part="`echo $tres_tot | cut -f2- -d,`" tot_val="`echo $tres_tot | cut -f1 -d, | cut -f2 -d=`" echo "update xxx_step_table set user_sec=$(($user_sec/100)), sys_sec=$(($sys_sec/100)), tres_usage_in_ave=\"1=$(($ave_val/100)),$ave_ok_part\", tres_usage_in_max=\"1=$(($max_val/100)),$max_ok_part\", tres_usage_in_tot=\"1=$(($tot_val/100)),$tot_ok_part\" where job_db_inx=$job_db_inx and id_step=$id_step" | $MYSQL else echo "update xxx_step_table set user_sec=$(($user_sec/100)), sys_sec=$(($sys_sec/100)), tres_usage_in_ave=\"1=$(($ave_val/100)),$ave_ok_part\", tres_usage_in_max=\"1=$(($max_val/100)),$max_ok_part\", where job_db_inx=$job_db_inx and id_step=$id_step" | $MYSQL fi let c+=1 echo "ok." done < <(cat /tmp/steps) This did fix most of the entries. There were some corner cases that I didn't investigate thoroughly. So if you need 100% accuracy, do run this with set -e and investigate each one that fails.