6541 – sacct totalcpu too large

Ticket 6541 - sacct totalcpu too large

Summary: sacct totalcpu too large

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	18.08.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-02-19 10:53 MST by Jenny Williams
Modified:	2021-05-10 01:58 MDT (History)
CC List:	1 user (show)

See Also:	6332
Site:	University of North Carolina at Chapel Hill
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
cgroup.conf (319 bytes, text/plain) 2019-04-01 09:22 MDT, Jenny Williams	Details
slurm.conf (4.92 KB, text/plain) 2019-04-01 09:23 MDT, Jenny Williams	Details
Add debug traces to follow UserCPU (4.80 KB, patch) 2019-05-30 09:13 MDT, Albert Gil	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2019-02-19 10:53:30 MST

# sacct -j 12524024 --format=elapsed,cputime,totalcpu,reqtres -p 
Elapsed|CPUTime|TotalCPU|ReqTRES|
01:50:58|11:05:48|32-08:56:02|billing=6,cpu=6,mem=10G,node=1|
01:50:58|11:05:48|32-08:56:02||
01:50:58|11:05:48|00:00.002||


How can the totalcpu time from sacct be ~ 64 x the CPU time ( elapsed x # cpus )

Most importantly, might this be affecting users fairshare calculation when using fair_tree ?

Comment 1 Albert Gil 2019-02-20 12:52:03 MST

Hi Jenny,

> How can the totalcpu time from sacct be ~ 64 x the CPU time ( elapsed x # cpus )

It looks like probably job gathering is not working fine.
Could you post the output of these commands:

$ sacct -j 12524024 -p --format=elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres
$ scontrol show config | grep Gather
$ scontrol show config | grep ProctrackType


> Most importantly, might this be affecting users fairshare calculation when using fair_tree?

No.
Fair-share is based on what the job asked for (job requested/consumed), not on what the job "physically" used (job gathered).

For example,
We have two users, Bob and Sue, with the same Share and currently no EffectiveUsage (no jobs run so far), so the same FairShare:

$ sshare -u bob,sue

             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
  development               bob          1    0.333333           0      0.000000   0.600000
  development               sue          1    0.333333           0      0.000000   0.600000


They run one job each asking for the same TRES 1 CPU for one 1 minute (CPUTime), but their actual tasks are much different in terms of CPU consuming (TotalCPU):

bob$ srun sleep 60           # CPUTime should be 60s and TotalCPU should be close to 0s
sue$ srun stress -t 60 -c 1  # CPUTime should be 60s and TotalCPU should be close to 60s

We can see that sacct shows that Sue used much more TotalCPU, but both used the same CPUTime:

$ sacct -u bob,sue --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres

     User        JobID    Elapsed    CPUTime  SystemCPU    UserCPU   TotalCPU    ReqTRES
--------- ------------ ---------- ---------- ---------- ---------- ---------- ----------
      bob 45             00:01:01   00:01:01   00:00:00   00:00:00   00:00:00 billing=1+
          45.extern      00:01:01   00:01:01   00:00:00   00:00:00   00:00:00
          45.0           00:01:01   00:01:01   00:00:00   00:00:00   00:00:00
      sue 46             00:01:01   00:01:01  00:00.380  00:57.904  00:58.284 billing=1+
          46.extern      00:01:01   00:01:01   00:00:00   00:00:00   00:00:00
          46.0           00:01:01   00:01:01  00:00.380  00:57.904  00:58.284

But, although this difference in TotalCPU, if we check again sshare, we can see that their EffectiveUsage is the same, therefore their FairShare factor remain balanced:

$ sshare -u bob,sue

             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
  development               bob          1    0.333333          60      0.085959   0.600000
  development               sue          1    0.333333          60      0.085959   0.600000

Finally, if for example Bob does an extra job (sleep):

bob$ srun sleep 60

Then Sue have a higher FairShare:

$ sshare -u bob,sue

             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
  development               bob          1    0.333333         120      0.157375   0.400000
  development               sue          1    0.333333          60      0.079243   0.600000


The FairShare value of sshare is a relation between NormShares (% of TRES that the Account is granted/promised to have) and EffectiveUsage (% of TRES that the Account have consumed so far). Gathered info is not used.


Albert

Comment 2 Jenny Williams 2019-02-20 15:56:53 MST

# sacct -j 12524024 -p --format=elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres
Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES|
01:50:58|11:05:48|03:11:50|32-05:44:11|32-08:56:02|billing=6,cpu=6,mem=10G,node=1|
01:50:58|11:05:48|03:11:50|32-05:44:11|32-08:56:02||
01:50:58|11:05:48|00:00.001|00:00.001|00:00.002||

# scontrol show config |egrep "Gather|ProctrackType"
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
JobAcctGatherFrequency  = task=15
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
ProctrackType           = proctrack/cgroup

Comment 3 Albert Gil 2019-02-21 03:46:00 MST

Jenny,

I'm not certain, but it seems that SystemCPU value is more reasonable than UserCPU.
But maybe both are wrong.

Is this happening to all jobs or just some?
Is this happening to all nodes or just some?

From your config these values should be obtained by reading the unormalized values from /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat and dividing/normalizing them by the CPU freq obtained by calling sysconf(_SC_CLK_TCK) (usually 100).

We can try to run some tests to inspect these values.
You could run these kind of commands and post the values of their respective cpuacct.stat:

$ srun -n 1 stress -c 1 t 60
$ srun -n 2 stress -c 1 t 60

And maybe also:

$ srun -n 1 sleep 60
$ srun -n 2 sleep 60

What are the values of CPUTime, SystemCPU, UserCPU and TotalCPU of these jobs?
If they are fine, you could try to run them in the same node that job 12524024 was run.

Also you can build the following code and run it into the failing nodes:

#include <stdio.h>
#include <unistd.h>
int main()
{
    fprintf(stdout, "No. of clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK));
    return 0;
}

Comment 4 Albert Gil 2019-02-26 11:04:36 MST

Jenny,
Could you also attach the cgroup.conf?

In some scenarios, if you don't ConstrainCores, then TotalCPU could be higher than CPUTime. Are you constraining cores in cgroup.conf?

Regards,
Albert

Comment 5 Albert Gil 2019-03-06 07:19:28 MST

Hi Jenny,

How is it going?
Have you being able to try what we mentioned in comment 3?


Albert

Comment 6 Albert Gil 2019-03-19 05:58:19 MDT

Hi Jenny,
Just a pinging to know if we still can help you here?

Comment 7 Albert Gil 2019-03-25 10:08:40 MDT

Hi Jenny,
I'm closing this bug as timedout, but please don't hesitate to reopen it again if you still face the issue or want further info.

Albert

Comment 8 Jenny Williams 2019-03-27 13:45:33 MDT

Reopening this case after having updated slurm on the cluster to 18.08-6.2
Listing for every compute node is the same reply for the clock-tick question
----------------
b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201-1208],t[0601-0605]
----------------
No. of clock ticks per sec : 100



----------------
b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],g[0301-0310,0312-0316,0601-0605],m1004,off[01-04],t[0601-0605]
----------------
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--

ConstrainRAMSpace=yes
CgroupReleaseAgentDir=/etc/slurm/cgroup
ConstrainCores=yes
TaskAffinity=yes
CgroupAutomount=yes



$ srun -n 1 stress -c 1 -t 60
srun: job 16781193 queued and waiting for resources
srun: job 16781193 has been allocated resources
stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [21204] successful run completed in 60s


$ srun -n 2 stress -c 1 -t 60
srun: job 16781255 queued and waiting for resources
srun: job 16781255 has been allocated resources
stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [176692] successful run completed in 60s
stress: info: [176693] successful run completed in 60s

Comment 9 Jenny Williams 2019-03-28 08:14:21 MDT

Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior.

Comment 10 Albert Gil 2019-04-01 06:18:15 MDT

Hi Jenny,

> Reopening this case after having updated slurm on the cluster to 18.08-6.2
> Listing for every compute node is the same reply for the clock-tick question
> ----------------
> b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],
> g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201-
> 1208],t[0601-0605]
> ----------------
> No. of clock ticks per sec : 100

This is quite expected.
Ok.


> # Slurm cgroup support configuration file
> ConstrainRAMSpace=yes
> CgroupReleaseAgentDir=/etc/slurm/cgroup
> ConstrainCores=yes
> TaskAffinity=yes
> CgroupAutomount=yes

About CgroupReleaseAgentDir:

Please note that it was removed / deprecated / igonred in 17.11.
You can safetely removed, but it's not causing any problem.

Aboyt TaskAffinity=yes

This is the related info that you can see in slurm.conf:

It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.


Jenny, if you are already using task/affinity and you also have TaskAffinity=yes here, this could lead to problems. Please double check it.
Anyway, although I'm not totally certain, this shouldn't lead to the issue reported here.

> $ srun -n 1 stress -c 1 -t 60
> srun: job 16781193 queued and waiting for resources
> srun: job 16781193 has been allocated resources
> stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
> stress: info: [21204] successful run completed in 60s
> 
> 
> $ srun -n 2 stress -c 1 -t 60
> srun: job 16781255 queued and waiting for resources
> srun: job 16781255 has been allocated resources
> stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
> stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
> stress: info: [176692] successful run completed in 60s
> stress: info: [176693] successful run completed in 60s


Ok!
Now, what's the value of:

$ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres

Please note that version 18.08.6 has a regression bug 6697, and probably you will need to specify the startime like this:

$ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres --starttime 2019-03-01


> Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior.

If I'm understanding it correctly, you mean here that the above command are reporting TotalCPU >> CPUTime?
In such case, could please replicate the commands and while they are running check the values of /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat in the node where they run?

Comment 11 Jenny Williams 2019-04-01 07:42:38 MDT

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, April 1, 2019 8:18 AM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 6541] sacct totalcpu too large

Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=6541#c10> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi Jenny,



> Reopening this case after having updated slurm on the cluster to 18.08-6.2

> Listing for every compute node is the same reply for the clock-tick question

> ----------------

> b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],

> g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201-

> 1208],t[0601-0605]

> ----------------

> No. of clock ticks per sec : 100



This is quite expected.

Ok.





> # Slurm cgroup support configuration file

> ConstrainRAMSpace=yes

> CgroupReleaseAgentDir=/etc/slurm/cgroup

> ConstrainCores=yes

> TaskAffinity=yes

> CgroupAutomount=yes



About CgroupReleaseAgentDir:



Please note that it was removed / deprecated / igonred in 17.11.

You can safetely removed, but it's not causing any problem.



Aboyt TaskAffinity=yes



This is the related info that you can see in slurm.conf:



It is recommended to stack task/affinity,task/cgroup together when configuring

TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf.

This setup uses the task/affinity plugin for setting the affinity of the tasks

(which is better and different than task/cgroup) and uses the task/cgroup

plugin to fence tasks into the specified resources, thus combining the best of

both pieces.





Jenny, if you are already using task/affinity and you also have

TaskAffinity=yes here, this could lead to problems. Please double check it.

Anyway, although I'm not totally certain, this shouldn't lead to the issue

reported here.



> $ srun -n 1 stress -c 1 -t 60

> srun: job 16781193 queued and waiting for resources

> srun: job 16781193 has been allocated resources

> stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

> stress: info: [21204] successful run completed in 60s

>

>

> $ srun -n 2 stress -c 1 -t 60

> srun: job 16781255 queued and waiting for resources

> srun: job 16781255 has been allocated resources

> stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

> stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

> stress: info: [176692] successful run completed in 60s

> stress: info: [176693] successful run completed in 60s





Ok!

Now, what's the value of:



$ sacct -j 16781193,16781255 -p

--format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres



Please note that version 18.08.6 has a regression bug 6697<show_bug.cgi?id=6697>, and probably you

will need to specify the startime like this:



$ sacct -j 16781193,16781255 -p

--format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres

--starttime 2019-03-01







$ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres --starttime 2019-03-01

User|JobID|Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES|

jennyw|16781193|00:01:01|00:01:01|00:00.003|01:39:55|01:39:55|billing=1,cpu=1,mem=400M,node=1|

|16781193.extern|00:01:01|00:01:01|00:00:00|00:00.001|00:00.001||

|16781193.0|00:01:00|00:01:00|00:00.003|01:39:55|01:39:55||

jennyw|16781255|00:01:01|00:02:02|00:00.005|03:19:46|03:19:46|billing=2,cpu=2,mem=800M,node=1|

|16781255.extern|00:01:01|00:02:02|00:00:00|00:00:00|00:00:00||

|16781255.0|00:01:00|00:02:00|00:00.004|03:19:46|03:19:46||









> Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior.



If I'm understanding it correctly, you mean here that the above command are

reporting TotalCPU >> CPUTime?

In such case, could please replicate the commands and while they are running

check the values of /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat in

the node where they run?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 12 Albert Gil 2019-04-01 08:35:32 MDT

Hi Jenny,

From these two commands:
> $ srun -n 1 stress -c 1 -t 60
> $ srun -n 2 stress -c 1 -t 60

These UserCPU values obtained are not correct:
> User   | JobID           | Elapsed  | CPUTime  | SystemCPU | UserCPU   | TotalCPU  | ReqTRES                         |
> jennyw | 16781193        | 00:01:01 | 00:01:01 | 00:00.003 | 01:39:55  | 01:39:55  | billing=1,cpu=1,mem=400M,node=1 |
>        | 16781193.extern | 00:01:01 | 00:01:01 | 00:00:00  | 00:00.001 | 00:00.001 |                                 |
>        | 16781193.0      | 00:01:00 | 00:01:00 | 00:00.003 | 01:39:55  | 01:39:55  |                                 |
> jennyw | 16781255        | 00:01:01 | 00:02:02 | 00:00.005 | 03:19:46  | 03:19:46  | billing=2,cpu=2,mem=800M,node=1 |
>        | 16781255.extern | 00:01:01 | 00:02:02 | 00:00:00  | 00:00:00  | 00:00:00  |                                 |
>        | 16781255.0      | 00:01:00 | 00:02:00 | 00:00.004 | 03:19:46  | 03:19:46  |                                 |

It seems that jobacctgathering is not working properly.
In comment 2 you said that you are using jobacct_gather/cgroup.
Let's try to see what cgroup is actually reporting.

To do it, first allocate into a node:

$ salloc -c 1 -n 1 srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL


Then run in background 10 seconds of a stress task/step and quickly check cgroups like this:


$ srun -o /dev/null -n1 stress -c 1 --timeout 10 &
$ for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm_c1/uid_$UID/job_$SLURM_JOB_ID/cpuacct.stat; sleep 1; done
user 502
system 3
user 602
system 3
user 702
system 3
user 802
system 3
user 902
system 3
user 1002
system 4
[1]+  Done                    srun -o /dev/null -n1 stress -c 1 --timeout 10
user 1003
system 4
user 1003
system 4
user 1003
system 4
user 1003
system 4


Note that in my case it is working properly and, as I also have _SC_CLK_TCK = 100 and I'm quering cgroups once per second, the user cpu value reported is increasing by 100 each time, until the srun stops. Then it is just not increased any more.
What happen in your case?

Please, try also with 2 CPUs and post the output. Note that in such a case the expected increment should be 200 ticks per second:

$ salloc -c 1 -n 2 srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL
$ srun -o /dev/null -n2 stress -c 1 --timeout 10 &
$ for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm_c1/uid_$UID/job_$SLURM_JOB_ID/cpuacct.stat; sleep 1; done
user 340
system 2
user 539
system 2
user 739
system 2
user 939
system 3
user 1139
system 3
user 1339
system 3
user 1540
system 3
user 1740
system 3
user 1939
system 3
[1]+  Done                    srun -o /dev/null -n2 stress -c 1 --timeout 10
user 2000
system 4

Comment 13 Albert Gil 2019-04-01 09:19:43 MDT

I forgot to mention this:

> Jenny, if you are already using task/affinity and you also have
> TaskAffinity=yes here, this could lead to problems. Please double check it.
> Anyway, although I'm not totally certain, this shouldn't lead to the issue
> reported here.

Could you confirm that you are not mixing task/affinity plugin in slurm.conf and TaskAffinity=yes in cgroups.conf.
In fact, could you attach your slurm.conf and cgroups.conf?
Also, the output of "uname -a" and "ls -lh /sys/fs/cgroup" of a failing node will also help us to replicate the issue.

Comment 14 Jenny Williams 2019-04-01 09:22:23 MDT

Created attachment 9761 [details]
cgroup.conf

Comment 15 Jenny Williams 2019-04-01 09:23:05 MDT

Created attachment 9762 [details]
slurm.conf

Comment 16 Albert Gil 2019-04-01 11:08:26 MDT

Hi Jenny,

Looking further this bug looks like a duplicate of bug 6332, but it was fixed for 18.08.5.
So, even we didn't solved in your case, or we have another very similar bug, or maybe we are actually facing the same bug?

Could you double-check that your currently installed lib/slurm/jobacct_gather_cgroup.so has been actually recompiled for 18.08.6?
And propagated/shared to the nodes that now run slurmd 18.08.6..? Sorry, just to be sure.

Please note that even with the fix, the jobs run with versions from 18.08.0 to 18.08.4 the values in the database will still be wrong.
But for new jobs it shouldn't.


Albert

Comment 17 Jenny Williams 2019-04-01 11:29:53 MDT

$ slurmd -V
slurm 18.08.6-2

$ srun -o /dev/null -n1 stress -c 1 --timeout 10

# in another session -

# for ii in {1..40}
> do
> cat /sys/fs/cgroup/cpuacct/slurm/uid_28534/job_17230410/cpuacct.stat
> sleep 1
> done
user 31
system 25
user 31
system 25
user 31
system 25
user 31
system 25
user 31
system 25
user 32
system 26
user 49
system 27
user 149
system 27
user 250
system 27
user 350
system 27
user 450
system 27
user 550
system 27
user 651
system 27
user 751
system 27
user 851
system 27
user 951
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27
user 1032
system 27

Comment 18 Jenny Williams 2019-04-01 11:43:34 MDT

A run more similar to yours, with commands in same session.

$ date; srun -o /dev/null -n1 stress -c 1 --timeout 10 &  for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm/uid_28534/job_17230448/cpuacct.stat ; sleep 1; date;  done 
Mon Apr  1 13:42:07 EDT 2019
[1] 157244
user 2023
system 26
Mon Apr  1 13:42:08 EDT 2019
user 2115
system 27
Mon Apr  1 13:42:09 EDT 2019
user 2215
system 27
Mon Apr  1 13:42:10 EDT 2019
user 2315
system 28
Mon Apr  1 13:42:11 EDT 2019
user 2415
system 28
Mon Apr  1 13:42:12 EDT 2019
user 2515
system 29
Mon Apr  1 13:42:13 EDT 2019
user 2615
system 29
Mon Apr  1 13:42:14 EDT 2019
user 2716
system 30
Mon Apr  1 13:42:15 EDT 2019
user 2817
system 30
Mon Apr  1 13:42:16 EDT 2019
user 2917
system 30
Mon Apr  1 13:42:17 EDT 2019
[jennyw@c0501 ~]$ jobs
[1]+  Done                    srun -o /dev/null -n1 stress -c 1 --timeout 10

Comment 19 Jenny Williams 2019-04-01 11:47:11 MDT

(In reply to Albert Gil from comment #16)
> Hi Jenny,
> 
> Looking further this bug looks like a duplicate of bug 6332, but it was
> fixed for 18.08.5.
> So, even we didn't solved in your case, or we have another very similar bug,
> or maybe we are actually facing the same bug?
> 
> Could you double-check that your currently installed
> lib/slurm/jobacct_gather_cgroup.so has been actually recompiled for 18.08.6?
> And propagated/shared to the nodes that now run slurmd 18.08.6..? Sorry,
> just to be sure.
> 
> Please note that even with the fix, the jobs run with versions from 18.08.0
> to 18.08.4 the values in the database will still be wrong.
> But for new jobs it shouldn't.
> 
> 
> Albert

The lib was installed with the rpm update.  This is checking against node on which test ran.

# ls -lth /usr/lib64/slurm/jobacct_gather_cgroup.so
-rwxr-xr-x 1 root root 339K Mar 11 09:36 /usr/lib64/slurm/jobacct_gather_cgroup.so


# rpm -qi slurm
Name        : slurm
Version     : 18.08.6
Release     : 2.el7
Architecture: x86_64
Install Date: Thu 21 Mar 2019 05:22:23 PM EDT
Group       : System Environment/Base
Size        : 59455659
License     : GPLv2+
Signature   : (none)
Source RPM  : slurm-18.08.6-2.el7.src.rpm
Build Date  : Mon 11 Mar 2019 09:38:47 AM EDT
Build Host  : longleaf-test02.its.unc.edu
Relocations : (not relocatable)
URL         : https://slurm.schedmd.com/
Summary     : Slurm Workload Manager
Description :
Slurm is an open source, fault-tolerant, and highly scalable
cluster management and job scheduling system for Linux clusters.
Components include machine status, partition management,
job management, scheduling and accounting modules

Comment 20 Jenny Williams 2019-04-01 11:48:19 MDT

$ sacct -j 17230448 -S 4/1 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres
User|JobID|Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES|
jennyw|17230448|00:11:01|00:11:01|00:00.322|00:30.200|00:30.523|billing=1,cpu=1,mem=4000M,node=1|
|17230448.extern|00:11:01|00:11:01|00:00.001|00:00:00|00:00.001||
|17230448.0|00:10:58|00:10:58|00:00.309|00:00.324|00:00.633||
|17230448.1|00:00:11|00:00:11|00:00.002|00:09.998|00:00:10||
|17230448.2|00:00:11|00:00:11|00:00.004|00:09.937|00:09.942||
|17230448.3|00:00:10|00:00:10|00:00.004|00:09.940|00:09.945||

Comment 21 Albert Gil 2019-04-03 03:26:34 MDT

Hi Jenny,
Your data is showing some kind of bug in slurm.
The info from kernel-cgroups looks fine, but the one reported from sacct is not.

I'm not yet able to replicate it, could you post the output of "uname -a" and "cat /etc-/os-release.

If I'm not able to replicate it I will provide you with a patch to add extra logging info about gathering, accounting and printing these values to see where exactly it gets corrupted in your cluster.

Comment 24 Jenny Williams 2019-05-16 14:17:04 MDT

# cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.6"

# uname -a
Linux longleaf-sched.its.unc.edu 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Comment 25 Albert Gil 2019-05-22 06:53:34 MDT

Hi Jenny,

I just want to let you know that I'm trying to replicate your issue again replicating as much as possible your same environment.

I'll let you know,
Albert

Comment 26 Jenny Williams 2019-05-22 14:34:45 MDT

If it would help we could do a remote session so you could see it.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, May 22, 2019 8:54 AM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 6541] sacct totalcpu too large

Comment # 25<https://bugs.schedmd.com/show_bug.cgi?id=6541#c25> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi Jenny,



I just want to let you know that I'm trying to replicate your issue again

replicating as much as possible your same environment.



I'll let you know,

Albert

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 27 Albert Gil 2019-05-27 08:33:02 MDT

Hi Jenny,

Although I've replicated your exact same environment (kernel, os, slurm version and config), I've been not able to replicate your issue.


I will attach a patch to add verbose information about this specific values to see what is going on in your cluster.
Just to double-check, could you confirm that you are still using slurm 18.08.6-2?


Albert

Comment 28 Jenny Williams 2019-05-28 09:08:28 MDT

We are using that slurm version yes

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, May 27, 2019 10:33 AM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 6541] sacct totalcpu too large

Comment # 27<https://bugs.schedmd.com/show_bug.cgi?id=6541#c27> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi Jenny,



Although I've replicated your exact same environment (kernel, os, slurm version

and config), I've been not able to replicate your issue.





I will attach a patch to add verbose information about this specific values to

see what is going on in your cluster.

Just to double-check, could you confirm that you are still using slurm

18.08.6-2?





Albert

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 34 Albert Gil 2019-05-30 09:13:09 MDT

Created attachment 10434 [details]
Add debug traces to follow UserCPU

Hi Jenny,

This small patch adds some extra debug traces to follow how the UserCPU is gathered, stored and printed.
Please, apply it to your environment and then run the commands that we already run to quickly replicate the issue:

$ srun -n 1 stress -c 1 -t 60
$ srun -n 2 stress -c 1 -t 60

Then also run the sacct with the extra -vv for the two jobs submitted previously (jobid_1 and jobid_2):

$ sacct -vvp -j $jobid_1,$jobid_2 --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres

Then please attach the slurmctld.log, slurmdbd.log and the slurmd.log (of the node where the job was run).

You can remove the patch once that commands are run, but it won't hurt.
The patch is based on version 18.08.6-2.


Thanks,
Albert


PS:
As mentioned in comment 16, from your data this bug really seems a duplicate of bug 6332. Your UserCPU seems 100 times bigger than it should be, and 100 is your _SC_CLK_TCK. That's exactly what we fixed in 18.08.5: https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59
Really strange... let's see what the extra debug traces tell us!

Comment 35 Albert Gil 2019-06-21 05:35:06 MDT

Hi Jenny,

Have you been able to replicate the issue with the patch with extra debug info of comment 34?

Regards,
Albert

Comment 36 Jenny Williams 2019-06-21 05:35:22 MDT

 I am out of the office. If you have questions that require a reply before then please email research@unc.edu.
 Regards,
Virginia Wililams
Systems Administrator
Research Computing
UNC Chapel Hill

Comment 37 Albert Gil 2019-08-01 03:09:28 MDT

Hi Jenny,
Are you again on the office?
I hope that all went well.

Have you been able to replicate the issue with the patch with extra debug info of comment 34?

Regards,
Albert

Comment 38 Albert Gil 2019-08-08 08:29:17 MDT

Hi Jenny,

I'm closing this bug as timedout.
Please feel free to reopen when you need it.

Regards,
Albert

Comment 39 Albert Gil 2021-05-10 01:58:37 MDT

Hi Jenny,

In case it helps I just want to let you know that in bug 10723 we have been able to reproduce a very similar issue to this one with a very similar debug patch to the one added here in comment 34.
With that we have diagnosed the root cause of the issue and we've fixed it.
The fix will be included in the next minor release 20.11.7.

Hope it helps,
Albert