Ticket 10723 - cpu time(TotalCPU) exceeded the elapsed time
Summary: cpu time(TotalCPU) exceeded the elapsed time
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-01-28 02:44 MST by issp2020support
Modified: 2021-07-15 12:34 MDT (History)
0 users

See Also:
Site: U of Tokyo
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.80 KB, text/plain)
2021-01-28 02:44 MST, issp2020support
Details
slurmd.log (3.04 MB, text/plain)
2021-02-04 10:39 MST, issp2020support
Details
output of sacct for 96 jobs (96.94 KB, text/plain)
2021-02-04 10:40 MST, issp2020support
Details
lscpu (1.73 KB, text/plain)
2021-02-12 01:09 MST, issp2020support
Details
slurmd.log (15.35 MB, application/x-gzip-compressed)
2021-03-01 10:59 MST, issp2020support
Details
Patch for 20.02.6-1 with extra debug logs (12.71 KB, patch)
2021-03-10 08:05 MST, Albert Gil
Details | Diff
slurmd.log for jobid 214551 (36.81 MB, application/x-gzip-compressed)
2021-03-15 00:45 MDT, issp2020support
Details
slurmd.log for jobid 214553 (56.97 MB, application/x-gzip-compressed)
2021-03-15 00:45 MDT, issp2020support
Details
slurmdbd.log on acct(acct is slurm servers hostname) (8.29 MB, application/x-gzip-compressed)
2021-03-15 00:46 MDT, issp2020support
Details
slurmd.log for jobid 222362 (5.83 MB, application/x-gzip-compressed)
2021-03-22 08:39 MDT, issp2020support
Details
slurmd.log for jobid 222423 (167.66 MB, application/x-gzip-compressed)
2021-03-22 08:40 MDT, issp2020support
Details
slurmctld.log on sched1(slurm ctl server) (33.66 MB, application/x-gzip-compressed)
2021-03-22 08:41 MDT, issp2020support
Details
slurmdbd.log on acct (35.97 MB, application/x-gzip-compressed)
2021-03-22 08:42 MDT, issp2020support
Details
scontrol show config (8.46 KB, text/plain)
2021-03-22 08:43 MDT, issp2020support
Details
Reproduced jobs list (3.37 KB, text/plain)
2021-03-22 08:51 MDT, issp2020support
Details
joblist (11.30 KB, text/plain)
2021-03-29 09:02 MDT, issp2020support
Details
slurmd.log for jobid 231303 (254.43 MB, application/x-gzip-compressed)
2021-03-29 09:04 MDT, issp2020support
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description issp2020support 2021-01-28 02:44:21 MST
Created attachment 17659 [details]
slurm.conf

Hi, SchedMD support team

As shown below, some programs were found whose CPU time exceeded the elapsed time. I don't think the CPU time will exceed the elapsed time, but please tell me why this happens. Hyper-threading is turned off in the settings.

[root@ohtaka1 ~]# sacct -j 108722 -o jobid,account,start,end,cputime%15,totalcpu%15
       JobID    Account               Start                 End         CPUTime        TotalCPU
------------ ---------- ------------------- ------------------- --------------- ---------------
108722          k042300 2020-12-16T05:16:28 2020-12-16T15:36:15   3966-14:43:12   7833-09:07:49
108722.batch    k042300 2020-12-16T05:16:28 2020-12-16T15:36:15     55-02:12:16       00:01.144
108722.0        k042300 2020-12-16T05:16:30 2020-12-16T15:36:14   3966-07:02:24   7833-09:07:48
[root@ohtaka1 ~]# seff 108722
Job ID: 108722
Cluster: ohtaka
User/Group: k042300/k0423
State: COMPLETED (exit code 0)
Nodes: 72
Cores per node: 128
CPU Utilized: 7833-09:07:49
CPU Efficiency: 197.48% of 3966-14:43:12 core-walltime
Job Wall-clock time: 10:19:47
Memory Utilized: 2.70 TB (estimated maximum)
Memory Efficiency: 16.67% of 16.17 TB (230.00 GB/node)
WorkDir: /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/_sample_20/run01

(job script)

[root@ohtaka1 ~]# cat /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/_sample_20/run01/run.sh
#!/bin/sh
#SBATCH -p B72cpu
#SBATCH -N 72
#SBATCH -n 9216
#SBATCH -J test
#SBATCH -t 11:59:59
#SBATCH --mail-type=BEGIN
srun /home/k0423/k042300/lammps_work/src/lmp_asano -in in.polymer
[root@ohtaka1 ~]#
Comment 1 Albert Gil 2021-01-29 08:20:39 MST
Hi,

> As shown below, some programs were found whose CPU time exceeded the elapsed
> time. I don't think the CPU time will exceed the elapsed time,

Actually, the CPUTime typically exceeds the Elapsed time as it counts all the CPUs allocated to the job and most typically one job use to allocate more than 1 CPUs.
From sacct docs (https://slurm.schedmd.com/sacct.html#OPT_CPUTime):

CPUTime
  Time used (Elapsed time * CPU count) by a job or step in HH:MM:SS format.

> but please
> tell me why this happens. Hyper-threading is turned off in the settings.
> 
> [root@ohtaka1 ~]# sacct -j 108722 -o
> jobid,account,start,end,cputime%15,totalcpu%15
>        JobID    Account               Start                 End        
> CPUTime        TotalCPU
> ------------ ---------- ------------------- -------------------
> --------------- ---------------
> 108722          k042300 2020-12-16T05:16:28 2020-12-16T15:36:15  
> 3966-14:43:12   7833-09:07:49
> 108722.batch    k042300 2020-12-16T05:16:28 2020-12-16T15:36:15    
> 55-02:12:16       00:01.144
> 108722.0        k042300 2020-12-16T05:16:30 2020-12-16T15:36:14  
> 3966-07:02:24   7833-09:07:48
> [root@ohtaka1 ~]# seff 108722
> Job ID: 108722
> Cluster: ohtaka
> User/Group: k042300/k0423
> State: COMPLETED (exit code 0)
> Nodes: 72
> Cores per node: 128
> CPU Utilized: 7833-09:07:49
> CPU Efficiency: 197.48% of 3966-14:43:12 core-walltime
> Job Wall-clock time: 10:19:47
> Memory Utilized: 2.70 TB (estimated maximum)
> Memory Efficiency: 16.67% of 16.17 TB (230.00 GB/node)
> WorkDir:
> /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/
> _sample_20/run01
> 
> (job script)
> 
> [root@ohtaka1 ~]# cat
> /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/
> _sample_20/run01/run.sh
> #!/bin/sh
> #SBATCH -p B72cpu
> #SBATCH -N 72
> #SBATCH -n 9216
> #SBATCH -J test
> #SBATCH -t 11:59:59
> #SBATCH --mail-type=BEGIN
> srun /home/k0423/k042300/lammps_work/src/lmp_asano -in in.polymer
> [root@ohtaka1 ~]#

This jobs is requesting 72 Nodes and based in your slurm.conf those nodes have 2 Sockets and 64 CoresPerSocket.
Doing rounded numbers the CPUTime in days that I would expect from this job is:

~10 Elapsed hours * 72 Nodes * 2 Sockets per node * 64 CoresPerSocket / 24h/day = ~3840 days of CPUTime.

Your sacct command is reporting ~3966 days, so it seems fine to me.
Does it make sense to you too?

Note also that the job is also requesting the equivalent value of 9216 Tasks (and the default 1 Core per Task).
The rounded math could also be done using the -n value:

~10 Elapsed hours * 9216 Cores / 24h/day = ~3840 days of CPUTime.

Am I answering your question?

Regards,
Albert
Comment 2 issp2020support 2021-01-31 19:30:05 MST
Sorry. My question may have been wrong.

My correct question is why the "TotalCPU" exceeds "CPUTime".

CPUTime : 3966-14:43-12
TotalCPU : 7833-09:07:49
CPU Efficiency : 197.48%

I think the CPU efficiency is usually less than 100%, but this job is 197%. Is this the correct value?
Comment 3 Albert Gil 2021-02-01 04:36:40 MST
Hi,

> My correct question is why the "TotalCPU" exceeds "CPUTime".

Sorry, now I understand the problem better.

> CPUTime : 3966-14:43-12
> TotalCPU : 7833-09:07:49
> CPU Efficiency : 197.48%
> 
> I think the CPU efficiency is usually less than 100%, but this job is 197%.
> Is this the correct value?

I guess that you are aware, but just to provide some context:

CPUTime is computed from the values that the user requested and Slurm allocated for the job, while TotalCPU is a value gathered from the system while your job was running, in your case using the jobacct_gather/cgroup. Note that in some configurations, if the jobs are NOT constrained to use only the allocated resources, it could happen that a job was allocated to use N CPU of a node, but it escapes and uses more than that.

So, it could be expected in some configurations, but I'm not yet certain if that's your case.

Can you attach your cgroup.conf?
Also, can you post the output of "slurmd -C" on any node of the partition B72cpu?

I know that we had some issues of duplicated accounting on some cases with jobacct_gather/cgroup (eg bug 8656), but I'm not certain if this is your case neither.
I'll study your issue more in deep.

Regards,
Albert
Comment 5 issp2020support 2021-02-02 22:32:37 MST
The output of "slurmd -C" and the contents of cgroup.conf are as follows.

[root@tempnode4 ~]# /usr/sbin/slurmd -C
NodeName=tempnode4 slurmd: Considering each NUMA node as a socket
CPUs=128 Boards=1 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257514
UpTime=5-01:36:17
[root@tempnode4 ~]#
[root@tempnode4 ~]# cat /etc/slurm/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes

AllowedRAMSpace=98

ConstrainCores=yes
#ConstrainRAMSpace=no
ConstrainRAMSpace=yes

Best regars,
Yuichi
Comment 6 Albert Gil 2021-02-03 09:31:47 MST
Hi Yuichi,

With the information provided I agree that this looks like a bug.
Just to double check, in your last comment the output is from "tempnode4", but that nodename is not part of the partition B72cpu, is it?

I'm trying to reproduce it, are you able to reproduce it?
Do you have the batch script submitted?

I'm wondering if that duplication is only impacting the CPU.
Could you attach the output of:

$ sacct -j 108722 -p -o JobID,SystemCPU,UserCPU,TotalCPU,ConsumedEnergy,NodeList,NTasks,AllocTRES,CPUTime,MaxRSS,AveRSS

Also, do you have the slurmd logs of one of the nodes allocated by 108722.

Anyway, if you are able to reproduce it, could you switch to JobAcctGatherType=jobacct_gather/linux and see if it still happen?

Thanks,
Albert
Comment 7 issp2020support 2021-02-04 10:39:52 MST
Created attachment 17759 [details]
slurmd.log
Comment 8 issp2020support 2021-02-04 10:40:52 MST
Created attachment 17760 [details]
output of sacct for 96 jobs
Comment 9 issp2020support 2021-02-04 10:41:11 MST
"tempnode4" is a spare node, not a node assigned to B72cpu. However, it has the same configuration as the node assigned to B72cpu.

It is necessary to ask the user for the reproduction test. I will ask the user.

The output of sacct for 108722 is as follows.

[root@ohtaka1 cpu_efficiency]# sacct -j 108722 -p -o JobID,SystemCPU,UserCPU,TotalCPU,ConsumedEnergy,NodeList,NTasks,AllocTRES,CPUTime,MaxRSS,AveRSS
JobID|SystemCPU|UserCPU|TotalCPU|ConsumedEnergy|NodeList|NTasks|AllocTRES|CPUTime|MaxRSS|AveRSS|
108722|8-02:15:51|7825-06:51:58|7833-09:07:49|0|c02u01n[1-4],c02u02n[1-4],c02u03n[1-4],c02u04n[1-4],c02u05n[1-4],c02u06n[1-4],c02u07n[1-4],c02u08n[1-4],c02u09n[1-4],c02u10n[1-4],c02u11n[1-4],c02u12n[1-4],c02u13n[1-4],c02u14n[1-4],c02u15n[1-4],c02u16n[1-4],c02u17n[1-4],c02u18n[1-4]||billing=9216,cpu=9216,mem=16560G,node=72|3966-14:43:12|||
108722.batch|00:00.810|00:00.333|00:01.144|0|c02u01n1|1|cpu=128,mem=230G,node=1|55-02:12:16|9368K|9368K|
108722.0|8-02:15:50|7825-06:51:57|7833-09:07:48|0|c02u01n[1-4],c02u02n[1-4],c02u03n[1-4],c02u04n[1-4],c02u05n[1-4],c02u06n[1-4],c02u07n[1-4],c02u08n[1-4],c02u09n[1-4],c02u10n[1-4],c02u11n[1-4],c02u12n[1-4],c02u13n[1-4],c02u14n[1-4],c02u15n[1-4],c02u16n[1-4],c02u17n[1-4],c02u18n[1-4]|9216|cpu=9216,mem=16560G,node=72|3966-07:02:24|314176K|471962024|

I also attached the slurmd.log for c02u01n1 to which 108722 was assigned.

When I checked B72cpu, F72cpu, B144cpu, and F144cpu from 2020/12 to 2021/01, 96 jobs that exceeded CPU efficiency by 100% were found. I also attach the sacct for 96 jobs.
Comment 11 Albert Gil 2021-02-05 08:24:40 MST
Hi Yuichi,

> "tempnode4" is a spare node, not a node assigned to B72cpu. However, it has
> the same configuration as the node assigned to B72cpu.

Thanks.
Would you attach the output of "lscpu" of that node?

> It is necessary to ask the user for the reproduction test. I will ask the
> user.

Ok, thanks.

> When I checked B72cpu, F72cpu, B144cpu, and F144cpu from 2020/12 to 2021/01,
> 96 jobs that exceeded CPU efficiency by 100% were found. I also attach the
> sacct for 96 jobs.

Then it looks that is quite easy to reproduce, maybe we can try a simple reproducer.
Could you run these couple of 10 seconds jobs and attach the sacct output of them?

$sbatch -p B72cpu -N72 -n 9216 --wrap "srun stress -c 1 -t 10"
$sbatch -p B72cpu -N72 -n 9216 --wrap "srun stress -c 2 -t 10"

Is the TotalCPU of the second one doubling its CPUTime?
The first one should not.

If you can reproduce it with this short stress jobs, could you try to switch to JobAcctGatherType=jobacct_gather/linux  and see if you still can reproduce it?

Thanks,
Albert
Comment 12 issp2020support 2021-02-12 01:09:52 MST
Created attachment 17913 [details]
lscpu
Comment 13 issp2020support 2021-02-12 01:15:36 MST
Hi Albert,

Sorry for the late reply.

I attached the result of lscpu. Please check it.

I'm asking the user who has the reproduction program, so please wait a moment.

And I will try run stress.

Regards,
Yuichi
Comment 15 Albert Gil 2021-02-19 04:41:52 MST
Hi Yuichi,

It seems that there is no issue regarding the number and allocation of CPUs, but on how the cpu usage is gathered.

> If you can reproduce it with this short stress jobs, could you try to switch to JobAcctGatherType=jobacct_gather/linux  and see if you still can reproduce it?

Have you been able to reproduce the issue?
Have you verified if it gets fixed/workarounded using jobacct_gather/linux?

Finally, could you build the following code, run it into the failing nodes and attach your output:

$ cat bug10723-clk.c 
#include <stdio.h>
#include <unistd.h>
int main()
{
    printf("Clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK));
    return 0;
}

$ gcc -o bug10723-clk bug10723-clk.c 
$ ./bug10723-clk 
Clock ticks per sec : 100

Thanks,
Albert
Comment 16 issp2020support 2021-02-22 01:49:12 MST
Hi Albert,

Thank you for your supporting.

It could not be reproduced by the stress program.

$ sacct -j 196350,196351 -ojobid,jobname,partition,account,alloccpus,elaps,totalcpu
       JobID    JobName  Partition    Account  AllocCPUS    Elapsed   TotalCPU
------------ ---------- ---------- ---------- ---------- ---------- ----------
196350         core1.sh     F72cpu     dell00       9216   00:00:18 1-10:19:23
196350.batch      batch                dell00        128   00:00:18  00:00.851
196350.0      stress-ng                dell00       9216   00:00:17 1-10:19:22
196351         core2.sh     F72cpu     dell00       9216   00:00:18 1-10:15:38
196351.batch      batch                dell00        128   00:00:18  00:00.889
196351.0      stress-ng                dell00       9216   00:00:18 1-10:15:37

[dell00@ohtaka1 stress]$ cat core1.sh
#!/bin/sh

#SBATCH -N 72
#SBATCH -n 9216

srun /home/system/tool/stress-ng/stress-ng -c 1 -t 10
[dell00@ohtaka1 stress]$ cat core2.sh
#!/bin/sh

#SBATCH -N 72
#SBATCH -n 9216

srun /home/system/tool/stress-ng/stress-ng -c 2 -t 10

F72cpu uses the same resources as B72cpu. The only difference is the priority.

I checked the tendency of user jobs, I found that the total CPU may or may not exceed 100% even with the same program, so it does not depend on user jobs.

I want to try to switch to "JobAcctGatherType = jobacct_gather/linux", but I can't check until the regular maintenance in March because this system is not allowed to change the config unless it is in regular maintenance. Therefore, please wait for a while.

And, the result of executing bug10723 is shown.

[dell00@c01u01n1 bug10723-clk]$ ./bug10723-clk
Clock ticks per sec : 100
[dell00@c01u01n1 bug10723-clk]$ cat bug10723-clk.c
#include <stdio.h>
#include <unistd.h>
int main()
{
    printf("Clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK));
    return 0;
}

Best regars,
Yuichi
Comment 17 Albert Gil 2021-02-22 07:34:03 MST
Hi Yuchi,

> It could not be reproduced by the stress program.
> 
> $ sacct -j 196350,196351
> -ojobid,jobname,partition,account,alloccpus,elaps,totalcpu
>        JobID    JobName  Partition    Account  AllocCPUS    Elapsed  
> TotalCPU
> ------------ ---------- ---------- ---------- ---------- ----------
> ----------
> 196350         core1.sh     F72cpu     dell00       9216   00:00:18
> 1-10:19:23
> 196350.batch      batch                dell00        128   00:00:18 
> 00:00.851
> 196350.0      stress-ng                dell00       9216   00:00:17
> 1-10:19:22
> 196351         core2.sh     F72cpu     dell00       9216   00:00:18
> 1-10:15:38
> 196351.batch      batch                dell00        128   00:00:18 
> 00:00.889
> 196351.0      stress-ng                dell00       9216   00:00:18
> 1-10:15:37
> 
> [dell00@ohtaka1 stress]$ cat core1.sh
> #!/bin/sh
> 
> #SBATCH -N 72
> #SBATCH -n 9216
> 
> srun /home/system/tool/stress-ng/stress-ng -c 1 -t 10
> [dell00@ohtaka1 stress]$ cat core2.sh
> #!/bin/sh
> 
> #SBATCH -N 72
> #SBATCH -n 9216
> 
> srun /home/system/tool/stress-ng/stress-ng -c 2 -t 10
> 
> F72cpu uses the same resources as B72cpu.

Thanks for the check.

Just to be 100% sure that we cannot reproduce it, could your try with a core16.sh with 

srun /home/system/tool/stress-ng/stress-ng -c 16 -t 10

> I checked the tendency of user jobs, I found that the total CPU may or may
> not exceed 100% even with the same program, so it does not depend on user
> jobs.

So far I would say that the problem is that the gathering is (at least) duplicated (sometimes?).
So, exceeding or not is not the key part because if the original efficiency is below 50% the duplicated efficiency with still be under 100% but it would be also wrong.
Is the efficiency of those jobs sometimes the double/half of the previous runs? Or is just slightly above and below 100%?

> And, the result of executing bug10723 is shown.
> 
> [dell00@c01u01n1 bug10723-clk]$ ./bug10723-clk
> Clock ticks per sec : 100

Ok, thanks for the check.
This is quite standard.

> I want to try to switch to "JobAcctGatherType = jobacct_gather/linux", but I
> can't check until the regular maintenance in March because this system is
> not allowed to change the config unless it is in regular maintenance.
> Therefore, please wait for a while.

Understood.

The key point is to find a reproducer.
That will allow us to investigate the issue.

At this point I think that this is a duplicate of bug 6541, that never got properly fixed.
We were not able to reproduce it there, and I guess that you are facing the same issue even a couple of major versions later.
In both cases the issue seems to the jobacct_gather/cgroups.
But in that case, we do were able to reproduce the issue using stress.
I'm not sure why you are not.

In bug 8656 we find a very similar issue on jobacct_gather/cgroups, duplicated gathered data (fixed on 20.11.3). But that was only on the .extern steps, and that's not your case.
But it could be similar.

Anyway, those are only reference more important to me than to you.
If this is ok for you, to investigate your case I would provide you with a patch that will increase the logs related to the jobacct_gather/cgroups. If you can apply it in your regular maintenance we could get more info about what is going on.
Can you confirm the Slurm version that you are planning to work with on the next maintenance?
Is at the beginning on March, or at the end of it?

Thanks,
Albert
Comment 18 issp2020support 2021-02-23 23:03:47 MST
Hi Albert,

> The key point is to find a reproducer.
> That will allow us to investigate the issue.

I understood. However, it seems that there is no reproducibility because there are some that exceed and some that do not exceed the same job.

> So far I would say that the problem is that the gathering is (at least) duplicated (sometimes?).
> So, exceeding or not is not the key part because if the original efficiency is below 50% the duplicated efficiency with still be under 100% but it would be also wrong.
> Is the efficiency of those jobs sometimes the double/half of the previous runs? Or is just slightly above and below 100%?

In the following cases, two identical programs (only with different parameters) may or may not exceed 100%. These are about twice as different.

[root@ohtaka1 ~]# cd /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T800/run01
[root@ohtaka1 run01]# ls
fluid  in.swave  in.swave~  left  log.lammps  right  run.sh  slurm-149557.out
[root@ohtaka1 run01]# cat run.sh
#!/bin/sh
#SBATCH -p B72cpu #パーティションを指定
#SBATCH -N 72
#SBATCH -n 9216
#SBATCH -J _swav
#SBATCH -t 11:59:59
#SBATCH --mail-type=BEGIN
srun /home/k0152/k015204/lmp_asano -in in.swave
[root@ohtaka1 run02]# seff 149557
Job ID: 149557
Cluster: ohtaka
User/Group: k015204/k0152
State: COMPLETED (exit code 0)
Nodes: 72
Cores per node: 128
CPU Utilized: 6844-20:12:56
CPU Efficiency: 197.50% of 3465-16:57:36 core-walltime
Job Wall-clock time: 09:01:31
Memory Utilized: 1.72 TB (estimated maximum)
Memory Efficiency: 10.64% of 16.17 TB (230.00 GB/node)
WorkDir: /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T800/run01
[root@ohtaka1 run01]# cd /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T1000/run02
[root@ohtaka1 run02]# ls
fluid  in.swave  in.swave~  left  log.lammps  right  run.sh  slurm-149558.out  temp.out
[root@ohtaka1 run02]# cat run.sh
#!/bin/sh
#SBATCH -p B72cpu #パーティションを指定
#SBATCH -N 72
#SBATCH -n 9216
#SBATCH -J _swav
#SBATCH -t 11:59:59
#SBATCH --mail-type=BEGIN
srun /home/k0152/k015204/lmp_asano -in in.swave
[root@ohtaka1 run02]# seff 149558
Job ID: 149558
Cluster: ohtaka
User/Group: k015204/k0152
State: COMPLETED (exit code 0)
Nodes: 72
Cores per node: 128
CPU Utilized: 3827-03:52:50
CPU Efficiency: 99.44% of 3848-20:28:48 core-walltime
Job Wall-clock time: 10:01:23
Memory Utilized: 1.73 TB (estimated maximum)
Memory Efficiency: 10.70% of 16.17 TB (230.00 GB/node)
WorkDir: /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T1000/run02


> Can you confirm the Slurm version that you are planning to work with on the next maintenance?

We will work with slurm 20.02.6.

> Is at the beginning on March, or at the end of it?

12th March.

Regards,
Yuichi
Comment 19 Albert Gil 2021-02-26 04:31:20 MST
Hi Yuichi,

> In the following cases, two identical programs (only with different
> parameters)

Thanks for the information.
I guess that, as they are in different directories, the parameters are different because of that.
Besides that, they are identical.
Do you have the slurmd logs of both executions?

> may or may not exceed 100%. These are about twice as different.

Yes, that duplication is the key part to me.
It seems that, in some cases, jobacct_gather/cgroups is counting the CPU usage twice.

Being able to reproduce it also important.
Do you think that you could be able to reproduce a job that failed running it again from same directory or from a duplicated one or similar?
Or are those jobs too big and that would disturb your cluster too much?

> > Is at the beginning on March, or at the end of it?
> 
> 12th March.

Perfect.
I'll attach a patch to add extra debug logs related to jobacct_gather/cgroups, so after the maintenance we can get better logs.
If you are able to reproduce it before the maintenance, then we will be able to quickly reproduce it with the patch.
If not, we'll wait until it is reproduced by some job from users.

> > Can you confirm the Slurm version that you are planning to work with on the next maintenance?
> 
> We will work with slurm 20.02.6.

The patch will be based on version 20.02.6, then.

Thanks,
Albert
Comment 20 issp2020support 2021-03-01 10:59:38 MST
Created attachment 18174 [details]
slurmd.log
Comment 21 issp2020support 2021-03-01 11:11:45 MST
Hi Albert,

Thank you for your supporting.

> Do you have the slurmd logs of both executions?

I have attached the slurmd.log for the node where both jobs executed.

[root@sched1 ~]# sacct -j 149557,149558 --allocations -o jobid,jobname,partition,account,start,end
       JobID    JobName  Partition    Account               Start                 End
------------ ---------- ---------- ---------- ------------------- -------------------
149557            _swav     B72cpu    k015204 2021-01-16T17:52:41 2021-01-17T02:54:12
149558            _swav     B72cpu    k015204 2021-01-16T18:01:16 2021-01-17T04:02:39

> Do you think that you could be able to reproduce a job that failed running it 
> again from same directory or from a duplicated one or similar?
> Or are those jobs too big and that would disturb your cluster too much?

This is difficult to reproduce. I have not yet obtained permission from the user to use the program.

> I'll attach a patch to add extra debug logs related to jobacct_gather/cgroups, > so after the maintenance we can get better logs.
> If you are able to reproduce it before the maintenance, then we will be able 
> to quickly reproduce it with the patch.
> If not, we'll wait until it is reproduced by some job from users.

Thank you for creating the patch.
I want to apply a patch for maintenance and wait for it to be reproduced.

Regards,
Yuichi
Comment 23 Albert Gil 2021-03-09 11:22:17 MST
Hi Yuichi,

Just to provide you with a quick update.
I've gone through your slurmd logs, but unfortunately I don't see anything that could help us on this issue.
I've also a first version of a patch adding extra debug logs on slurmd, but I want to enhance it to also add extra debug logs on the RPC messages that send the gathered information from the nodes to the DB, just in case the issue is not on the gathered information on each node, but on the final aggregation.
I'll attach the final version this week.

Regards,
Albert
Comment 24 issp2020support 2021-03-10 01:10:41 MST
Hi Albert,

Thank you for your supporting.

I want the final version by the end of today. After tomorrow, it will be difficult to apply it for the maintenance on March 12th, so it will be the next maintenance (April 1st).

Regards,
Yuichi
Comment 25 Albert Gil 2021-03-10 08:05:18 MST
Created attachment 18340 [details]
Patch for 20.02.6-1 with extra debug logs

Hi Yuichi,

I'm attaching the final version of the debug patch.
This patch is only adding extra debug log messages related to how the user and system CPU time for each step is gathered from the OS, and the into the DB.
It won't fix the issue and it's not changing any other behavior of 20.02.6.

Although they should not be too verbose (maybe a bit on slurmd), I still wanted to allow you to enable/disable these extra logs as easy as possible.
In newer Slurm versions we have a specific DebugFlags=JobAccountGather [1], but we don't have that on 20.02.6 to enable/disable them very easily.
For this reason I've decided to just use the "debug" level of normal log, and the reuse the DB_STEP flag.
So, to enable the extra logs you need to specify this in slurm.conf:

SlurmctldDebug=debug
SlurmdDebug=debug

And this into slurmdbd.conf:

DebugLevel=debug
DebugFlags=DB_STEP

Most of the new log message have the "BUG10723: " prefix, so you can grep it into the logs (specially the slurmd logs) to double-check if they are enabled.

Once you have the patch in place, it would be perfect if you are able to reproduce the issue as you did in comment 18, and you can send me the logs of slurmdbd, slurmctld and slurmd for both failing and non failing cases.

And once you manage to do that, you can disable the extra logs using again "info" instead of "debug" as it seems that you usually do.

Please let me know if you have any question.

Regards,
Albert

[1] https://github.com/SchedMD/slurm/blob/master/NEWS#L20
Comment 26 issp2020support 2021-03-11 11:01:45 MST
Hi Albert,

Thanks for creating the patch. As scheduled, we will apply patches and change the settings of slurm.conf during the maintenance tomorrow.

Regards,
Yuichi
Comment 27 issp2020support 2021-03-15 00:45:02 MDT
Created attachment 18422 [details]
slurmd.log for jobid 214551
Comment 28 issp2020support 2021-03-15 00:45:40 MDT
Created attachment 18423 [details]
slurmd.log for jobid 214553
Comment 29 issp2020support 2021-03-15 00:46:56 MDT
Created attachment 18424 [details]
slurmdbd.log on acct(acct is slurm servers hostname)
Comment 30 issp2020support 2021-03-15 00:51:12 MDT
During the maintenance on March 12, we applied a patch for debugging and changed the config. Among the jobs executed yesterday, 4 jobs with CPU efficiency exceeding 100% were found.

We sent you information about jobid 214551 and 214553.

[root@ohtaka1 ~]# sacct -j 214551 -o jobid,jobname,partition,account,alloccpus,start,end,elaps,totalcpu%13,node%30
       JobID    JobName  Partition    Account  AllocCPUS               Start                 End    Elapsed      TotalCPU                       NodeList
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------- ------------------------------
214551          tinb3up      F4cpu    m003902        512 2021-03-13T22:47:54 2021-03-14T02:19:22   03:31:28  186-02:40:20 c16u12n[2-3],c17u18n3,c19u16n4
214551.batch      batch               m003902        128 2021-03-13T22:47:54 2021-03-14T02:19:22   03:31:28     00:00.114                       c16u12n2
214551.0       vasp_std               m003902        512 2021-03-13T22:47:54 2021-03-14T02:19:21   03:31:27  186-02:40:20 c16u12n[2-3],c17u18n3,c19u16n4

[root@ohtaka1 ~]# seff 214551
Job ID: 214551
Cluster: ohtaka
User/Group: m003902/m0039
State: COMPLETED (exit code 0)
Nodes: 4
Cores per node: 128
CPU Utilized: 186-02:40:21
CPU Efficiency: 247.53% of 75-04:30:56 core-walltime
Job Wall-clock time: 03:31:28
Memory Utilized: 609.96 GB (estimated maximum)
Memory Efficiency: 66.30% of 920.00 GB (230.00 GB/node)
WorkDir: /work/m0039/m003902/calc/pmnpt/slab/7ML/up/TiNb/vol/3

[root@ohtaka1 ~]# sacct -j 214553 -o jobid,jobname,partition,account,alloccpus,start,end,elaps,totalcpu%13,node%30
       JobID    JobName  Partition    Account  AllocCPUS               Start                 End    Elapsed      TotalCPU                       NodeList
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------- ------------------------------
214553          mgnb3up      F4cpu    m003902        512 2021-03-13T22:58:29 2021-03-14T04:21:18   05:22:49  199-02:56:42 c17u06n[3-4],c17u07n2,c17u18n4
214553.batch      batch               m003902        128 2021-03-13T22:58:29 2021-03-14T04:21:18   05:22:49     00:00.115                       c17u06n3
214553.0       vasp_std               m003902        512 2021-03-13T22:58:29 2021-03-14T04:21:17   05:22:48  199-02:56:42 c17u06n[3-4],c17u07n2,c17u18n4

[root@ohtaka1 ~]# seff 214553
Job ID: 214553
Cluster: ohtaka
User/Group: m003902/m0039
State: COMPLETED (exit code 0)
Nodes: 4
Cores per node: 128
CPU Utilized: 199-02:56:43
CPU Efficiency: 173.48% of 114-18:42:08 core-walltime
Job Wall-clock time: 05:22:49
Memory Utilized: 726.76 GB (estimated maximum)
Memory Efficiency: 79.00% of 920.00 GB (230.00 GB/node)
WorkDir: /work/m0039/m003902/calc/pmnpt/slab/7ML/up/MgNb/vol/3

The F4cpu partition is a partition that uses 4 nodes and uses a different node from the B64cpu partition, but the node configuration is the same as the B64cpu.

I have attached slurmdbd.log and slurmd.log for information on jobid 214551 and 214553. For slurmd.log, only the parts related to this job are excerpted.
Please let me know if you have any other information to take.
Comment 31 issp2020support 2021-03-15 01:34:06 MDT
B64cpu is typo, correct is B72cpu.
Comment 32 Albert Gil 2021-03-15 06:23:52 MDT
Hi,

> During the maintenance on March 12, we applied a patch for debugging and
> changed the config.

Thanks for that.

> Among the jobs executed yesterday, 4 jobs with CPU
> efficiency exceeding 100% were found.

Good that we are still able to reproduce the issue.

> We sent you information about jobid 214551 and 214553.
>
> Job ID: 214551
> CPU Efficiency: 247.53% of 75-04:30:56 core-walltime
> 
> Job ID: 214553
> CPU Efficiency: 173.48% of 114-18:42:08 core-walltime

I'll look into these logs.
But please note that what could help me even more is something similar to comment 18.
That is, a job presenting the issue and another (very) similar job that went ok.
That would help me to see when/where/how the gathering is not working properly.

> The F4cpu partition is a partition that uses 4 nodes and uses a different
> node from the B72cpu partition, but the node configuration is the same as
> the B72cpu.

Thanks for the info, I see it in slurm.conf.

> I have attached slurmdbd.log and slurmd.log for information on jobid 214551
> and 214553. For slurmd.log, only the parts related to this job are excerpted.
> Please let me know if you have any other information to take.

I've already seen that the logs contain the patch information.
I'll look further into the details, but I've already seen some numbers not expected and I would like to quickly check a known problem with a buggy version of the glibc (bug 10824).
Are you using glibc 2.17-322?

Could you build and run this small app on the compute nodes (eg c16u12n2) to see if it prints 0.000 or nan?

$ cat glibctest.c

#include <stdio.h>
void main() {
    long double d = 0;
    printf("%Lg\r\n",d);
    printf("%Lf\r\n",d);
}

$ gcc glibctest.c -o glibctest
$ ./glibctest

Thanks,
Albert
Comment 33 issp2020support 2021-03-15 07:55:44 MDT
Hi Albert,

Thank you for your supporting.

> Are you using glibc 2.17-322?

[dell00@c16u12n2 ~]$ rpm -q glibc
glibc-2.28-127.el8.x86_64

> Could you build and run this small app on the compute nodes (eg c16u12n2) to 
> see if it prints 0.000 or nan?

It was printed 0.000.

[dell00@c16u12n2 ~]$ ./glibctest
0
0.000000

Regards,
Yuichi
Comment 38 Albert Gil 2021-03-16 11:54:11 MDT
Hi Yuichi,

I'm still investigating, but with the logs I've been able to find a first lead to work with.
My current theory under investigation is that the gathered information is correct per task and per node, but it seems that it is sent twice (or even more) from the N nodes to the batch node because the RPC message gets a timeout and it's send again.
If that's the case, we'll need to fix that.

In the meantime, it will be good if you increase your MessageTimeout in slurm.conf.
You are using the default 10s and I can see that it is timing out (probably because the batch node is really-really busy for job itself).
Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol reconfigure" and let me know if you still face the same issue?
If you do, could attach new logs of those jobs?
If you don't face the issue anymore, then we are on the right path to fix this! ;-)

Thanks,
Albert
Comment 39 issp2020support 2021-03-17 09:34:09 MDT
Hi Albert,

Thank you for your supporting.

> You are using the default 10s and I can see that it is timing out (probably
> because the batch node is really-really busy for job itself).
> Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol
> reconfigure" and let me know if you still face the same issue?

The system is currently in service, can we make this change even if there are jobs running?

Regards,
Yuichi
Comment 40 Albert Gil 2021-03-17 10:17:15 MDT
Hi Yuichi,

> > You are using the default 10s and I can see that it is timing out (probably
> > because the batch node is really-really busy for job itself).
> > Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol
> > reconfigure" and let me know if you still face the same issue?
> 
> The system is currently in service, can we make this change even if there
> are jobs running?

Yes, no problem at all.
You can verify the current MessageTimeout value with a command like this:

$ scontrol show config | grep MessageTimeout

MessageTimeout          = 10 sec

Then you can edit the slurm.conf adding this line:

MessageTimeout=50

And once you run this:

$ scontrol reconfigure

You can then verify that the parameter has been updated with the initial command:

$ scontrol show config | grep MessageTimeout

MessageTimeout          = 50 sec

The cluster will keep running normally.
Well, if I'm right, I expect all the new jobs to start having a proper CPUTotal value and not impossible efficiency values... ;-)

Regards,
Albert
Comment 41 issp2020support 2021-03-19 06:26:57 MDT
Hi Albert,

Thanks.

Today I set MessageTimeout = 50. We will continue to check if this issue occurs.

Regards,
Yuichi
Comment 44 issp2020support 2021-03-22 08:38:13 MDT
Hi Albert,

I changed the MessageTimeout to 50 on the 19th, but unfortunately this problem has reappeared. I will attach 2 logs out of the 8 reproduced cases. Both are jobs that run on four nodes.

Regards,
Yuichi
Comment 45 issp2020support 2021-03-22 08:39:24 MDT
Created attachment 18574 [details]
slurmd.log for jobid 222362
Comment 46 issp2020support 2021-03-22 08:40:54 MDT
Created attachment 18575 [details]
slurmd.log for jobid 222423
Comment 47 issp2020support 2021-03-22 08:41:55 MDT
Created attachment 18576 [details]
slurmctld.log on sched1(slurm ctl server)
Comment 48 issp2020support 2021-03-22 08:42:44 MDT
Created attachment 18577 [details]
slurmdbd.log on acct
Comment 49 issp2020support 2021-03-22 08:43:18 MDT
Created attachment 18578 [details]
scontrol show config
Comment 50 issp2020support 2021-03-22 08:51:26 MDT
Created attachment 18579 [details]
Reproduced jobs list
Comment 51 Albert Gil 2021-03-22 10:43:19 MDT
Hi Yuichi,

> I changed the MessageTimeout to 50 on the 19th, but unfortunately this
> problem has reappeared. I will attach 2 logs out of the 8 reproduced cases.
> Both are jobs that run on four nodes.

Thanks for the logs, it helps a lot.
Although you have been able to reproduce the issue with a higher MessageTimeout, the new logs still show the same problem that I've detected before, related to a timeout.
Actually, it seems strange to because althou in the "scontrol show config" the MessageTimeout value is correct, in your logs I can see 2 timeouts with only 10s between them (actually 11s because the code include 1s extra):

222423/slurmd.c19u16n4.log:[2021-03-20T03:06:39.454] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout
222423/slurmd.c19u16n4.log:[2021-03-20T03:06:50.464] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout
222423/slurmd.c19u15n4.log:[2021-03-20T03:06:39.431] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout
222423/slurmd.c19u15n4.log:[2021-03-20T03:06:50.442] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout
222423/slurmd.c19u16n2.log:[2021-03-20T03:06:39.436] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout
222423/slurmd.c19u16n2.log:[2021-03-20T03:06:50.446] [222423.0] debug:  slurm_recv_timeout at 0 of 4, timeout

Maybe this job was started before you did the reconfigure?
Note that only jobs started after the "scontrol reconfigure" will be actually using the higher MessageTimeout.

Anyway, if you still have this issue, if you want we can try to increase the MessageTimeout to 100, but i think that that's not necessary.

I have other news on the issue:
1) I've been able to reproduce it using MessageTimeout=1 and slowing down some threads to get those timeouts, and I get duplicated (or even triplicated) TotalCPU with the same logs that you get. And that's good news because it allows me to work on a fix.
2) But I've been able to reproduce it only on 20.02, not on 20.11.
We've done some changes on how those messages are handled on 20.11, and it would seem that the problem was (collaterally) solved on 20.11, but I need to investigate this a bit more to be sure about it.

Regards,
Albert
Comment 55 Albert Gil 2021-03-22 12:40:37 MDT
Hi Yuichi,

> 2) But I've been able to reproduce it only on 20.02, not on 20.11.
> We've done some changes on how those messages are handled on 20.11, and it
> would seem that the problem was (collaterally) solved on 20.11, but I need
> to investigate this a bit more to be sure about it.

I can confirm that the issue is NOT solved on 20.11.
The issue is on the communication form the nodes of the job to the main node, and this improvement on 20.11 was making my reproducer not to reproduce the issue because on certain conditions we avoid that communication between nodes and send it directly to the controller:
- https://github.com/SchedMD/slurm/commit/30af622a52a9dfc077381223c381a6fb409d8eaa

But that improvement is not fixing the issue, just avoiding it in some cases.

I've been able to reproduce the issue also on 20.11 changing a bit my reproducer.
We'll work on a proper fix for it.

Regards,
Albert
Comment 56 issp2020support 2021-03-23 04:16:21 MDT
Hi Albert,

> Maybe this job was started before you did the reconfigure?

I ran reconfig around 03/19 21:20.

[root@ohtaka1 ~]# sacct --allocations -j 222362,222423 -o j,start,end
       JobID               Start                 End
------------ ------------------- -------------------
222362       2021-03-20T01:05:09 2021-03-20T01:39:50
222423       2021-03-20T00:04:16 2021-03-20T15:09:11

As the result of sacct shows, these jobs started after reconfig. Will it take some time for the reconfiguration to take effect? If it recurs in the future, the log will be collected again.

Regards,
Yuichi
Comment 57 Albert Gil 2021-03-23 04:45:31 MDT
Hi Yuichi,

> As the result of sacct shows, these jobs started after reconfig. Will it
> take some time for the reconfiguration to take effect?

It shouldn't.
I assume that slurmds have access to the same slurm.conf with the new MessageTimeout option, right?

> If it recurs in the
> future, the log will be collected again.

Great.
I'm working on a fix, though.
I'm already quite confident about the root cause.

Thanks,
Albert
Comment 60 issp2020support 2021-03-29 09:00:57 MDT
Hi Albert,

When I checked the jobs from 3/20, I found 140 jobs with CPU time exceeding 100%.
The joblist and the log of jobid: 231303 is attached.

We are planning to do maintenance on 3/31. Can I stop debug mode at that time? Because the log has grown too large.

Regards,
Yuichi
Comment 61 issp2020support 2021-03-29 09:02:19 MDT
Created attachment 18719 [details]
joblist
Comment 62 issp2020support 2021-03-29 09:04:33 MDT
Created attachment 18720 [details]
slurmd.log for jobid 231303
Comment 63 Albert Gil 2021-03-29 09:12:23 MDT
Hi Yuichi,

> When I checked the jobs from 3/20, I found 140 jobs with CPU time exceeding
> 100%.
> The joblist and the log of jobid: 231303 is attached.

I guess that you are still using MessageTimeout=50, right?
I think that if you increase the MessageTimeout the number of wrong jobs will decrease.
But that's just a workaround, the actual fix is being reviewed as part of our QA.

> We are planning to do maintenance on 3/31. Can I stop debug mode at that
> time? Because the log has grown too large.

Yes, and you can already do it:
1) Restore your original SlurmctldDebug and SlurmdDebug values in slurm.conf.
2) Run "scontrol reconfigure"
3) Restore your original DebugLevel and DebugFlags in slurmdbd.conf.
4) Restart slurmdbd

On you maintenance you can also remove the debug patch.

I'll keep you updated about the actual fix of the issue,
Albert
Comment 64 issp2020support 2021-03-29 09:31:40 MDT
Hi Albert,

> I guess that you are still using MessageTimeout=50, right?

Yes,
[root@c19u09n3 ~]# scontrol show config | grep -i timeout
BatchStartTimeout       = 10 sec
EioTimeout              = 60
GetEnvTimeout           = 2 sec
MessageTimeout          = 50 sec
PrologEpilogTimeout     = 65534
ResumeTimeout           = 60 sec
SlurmctldTimeout        = 300 sec
SlurmdTimeout           = 300 sec
SuspendTimeout          = 30 sec
TCPTimeout              = 30 sec
UnkillableStepTimeout   = 60 sec

I will change to normal mode at the maintenance.
Thanks for telling me how to remove debug.

Regards,
Yuichi
Comment 66 issp2020support 2021-04-12 18:58:45 MDT
Hi Albert,

The settings were changed to the following during the regular maintenance on April 1.
JobAcctGatherType = jobacct_gather / linux

This issue continued to occur. However, the number of cases is about 30% less in linux than in cgroups.

ACCOUNT JOBID  PART    START               END                  ELAPS    CPU_EFFICIENCY
k006115 237414 F4cpu   2021/04/03/23:04:50 2021/04/04/06:27:57  26587sec 173.64%
k015013 237465 F4cpu   2021/04/04/03:33:53 2021/04/04/06:40:14  11181sec 123.53%
k035802 238853 F16cpu  2021/04/05/21:41:05 2021/04/06/00:09:30   8905sec 284.14%
k046303 239555 F4cpu   2021/04/06/16:44:53 2021/04/07/16:23:01  85088sec 174.06%
k046303 240824 F4cpu   2021/04/07/17:44:14 2021/04/08/17:23:28  85154sec 173.96%
k035802 241148 F144cpu 2021/04/08/03:08:06 2021/04/08/19:53:23  60317sec 126.72%
k016202 241536 F4cpu   2021/04/08/08:39:58 2021/04/08/08:58:20   1102sec 170.01%
k016202 241537 F4cpu   2021/04/08/08:39:59 2021/04/08/08:58:09   1090sec 169.61%
k046303 242254 F4cpu   2021/04/08/17:04:59 2021/04/08/18:04:41   3582sec 173.31%
k016202 242412 F4cpu   2021/04/08/18:51:20 2021/04/09/02:47:32  28572sec 173.45%
k016202 242413 F4cpu   2021/04/08/18:51:30 2021/04/09/00:00:26  18536sec 173.38%
k046303 242780 F4cpu   2021/04/09/16:31:06 2021/04/09/17:31:17   3611sec 171.86%
k046303 242336 F4cpu   2021/04/09/16:31:06 2021/04/10/16:10:40  85174sec 173.94%
k006115 242502 F4cpu   2021/04/09/16:31:06 2021/04/10/01:09:43  31117sec 173.77%
k027823 242976 L16cpu  2021/04/09/19:19:02 2021/04/11/15:36:38 159456sec 192.17%
k006115 243246 F4cpu   2021/04/10/16:31:36 2021/04/11/14:49:03  80247sec 173.92%
k046303 244085 F4cpu   2021/04/11/12:22:47 2021/04/11/13:22:41   3594sec 173.31%
k046303 244082 F4cpu   2021/04/11/12:22:47 2021/04/12/12:01:48  85141sec 174.00%
k006115 244362 F4cpu   2021/04/11/18:34:04 2021/04/12/07:30:34  46590sec 173.89%
k046303 244648 F4cpu   2021/04/12/15:42:16 2021/04/12/16:42:09   3593sec 173.00%
k046303 245219 F4cpu   2021/04/12/17:27:43 2021/04/12/18:27:38   3595sec 173.06%

Regards,
Yuichi
Comment 67 Albert Gil 2021-04-13 04:33:00 MDT
Hi Yuichi,

> The settings were changed to the following during the regular maintenance on
> April 1.
> JobAcctGatherType = jobacct_gather / linux
> This issue continued to occur.

Thanks for the information.
The fact that it also happen on jobacct_gather/linux is consistent with the root cause that we've found.

> However, the number of cases is about 30%
> less in linux than in cgroups.

The root problem found is a race condition that happens when the nodes are so busy with jobs that Slurm daemons get timeouts while trying to communicate between them (that's your case, network issue could also lead to those timeouts of course).
As any race condition it has some random behavior and unrelated changes in the code could lead to them to happen or (seem to) disappear.
The fact that it depends on how busy are the nodes, makes it even more random.
So, this 30% less is not important to me at this point.
Hopefully, the patch under internal QA will lead to a 0%! ;-)

Regards,
Albert
Comment 68 issp2020support 2021-04-14 17:17:25 MDT
Hi Albert,

I have a question from a customer who manages this supercomputer.
After this issue is fixed, can I fix the wrong CPU time for previous jobs?

Regards,
Yuichi
Comment 69 Albert Gil 2021-04-15 05:32:21 MDT
Hi Yuichi,

> I have a question from a customer who manages this supercomputer.
> After this issue is fixed, can I fix the wrong CPU time for previous jobs?

Unfortunately there isn't a supported way to change the wrong gathered and accounted data, specially because there is no way to know what data was supposed to be correctly gathered in the first place.

Anyway, as far as I've seen from your (extended) logs, you are facing that issue only on jobs that have a really high CPU usage (at least on 1 of the compute nodes).
Maybe assuming that values >100% are actually really close to 100% won't be too far from what really happened.

Regards,
Albert
Comment 80 Albert Gil 2021-05-10 01:50:07 MDT
Hi Yuchi,

I'm glad to let you know that this issue has been fixed and the fix will be released as part of the next minor release 20.11.7.

More details on the commit:
- https://github.com/SchedMD/slurm/commit/c3325f6402d103929e5862a2d4e9ebc7e7f27678

Thanks a lot for your help debugging the issue,
Albert
Comment 81 issp2020support 2021-05-10 02:13:20 MDT
Hi Albert,

Thank you for information of bug fix release.

Do you have the bug fix based on 20.02 ?
This system is running at 20.02.6.

Regards,
Yuichi
Comment 82 Albert Gil 2021-05-10 02:43:20 MDT
Hi Yuichi,

> Do you have the bug fix based on 20.02 ?

The bug fix will be released only on 20.11.7, and newer.
Once there is a new major release the bug fixing support on the older one is focused on security or very disruptive issues.

Regards,
Albert
Comment 83 issp2020support 2021-06-02 20:51:18 MDT
Hi Albert,

We will update our systems to 20.11.7 at next maintenance(2021.06.11).

Would you like to return the following parameter values to default values?

MessageTimeout=50  --> 10
JobAcctGatherType=jobacct_gather/linux  --> jobacct_gather/cgroup

Also, it was recommended to modify the values of the following parameters in (SlurmVersion)/src/common/pack.c in bug 10735, but should I modify them in 20.11.7 as well?

-#define MAX_ARRAY_LEN_SMALL    10000
-#define MAX_ARRAY_LEN_MEDIUM   1000000
-#define MAX_ARRAY_LEN_LARGE    100000000
+#define MAX_ARRAY_LEN_SMALL    100000
+#define MAX_ARRAY_LEN_MEDIUM   10000000
+#define MAX_ARRAY_LEN_LARGE    1000000000

Regards,
Yuichi
Comment 84 Albert Gil 2021-06-03 00:59:07 MDT
Hi Yuichi,

> We will update our systems to 20.11.7 at next maintenance(2021.06.11).

Good.

> Would you like to return the following parameter values to default values?
> 
> MessageTimeout=50  --> 10
> JobAcctGatherType=jobacct_gather/linux  --> jobacct_gather/cgroup

Yes please.
Those changes were only part of the debug/diagnose/workaround steps for this bug, but you don't need them anymore.

> Also, it was recommended to modify the values of the following parameters in
> (SlurmVersion)/src/common/pack.c in bug 10735, but should I modify them in
> 20.11.7 as well?
> 
> -#define MAX_ARRAY_LEN_SMALL    10000
> -#define MAX_ARRAY_LEN_MEDIUM   1000000
> -#define MAX_ARRAY_LEN_LARGE    100000000
> +#define MAX_ARRAY_LEN_SMALL    100000
> +#define MAX_ARRAY_LEN_MEDIUM   10000000
> +#define MAX_ARRAY_LEN_LARGE    1000000000

From your bug 10735 comment 23 I assume that the patch is being a good workaround for you now, so I would recommend that you keep it a bit more.
I see that we have some internal work in progress related to that issue, but we are still evaluating the possible solutions, therefore we have not yet the final fix and most probably the issue will appear again in your system if you remove the patch.

Regards,
Albert
Comment 85 issp2020support 2021-07-15 11:56:59 MDT
Hi Albert,

After the version upgrade, I monitored it for one month, but no data with CPU time exceeding elapsed time occurred. 
I reported to the customer that it was fixed.
Thank you for your support.

Regards,
Yuichi
Comment 86 Albert Gil 2021-07-15 12:34:58 MDT
Hi Yuichi,

Thanks for sharing it.
I'm glad that it works as expected.

Regards,
Albert