Created attachment 17659 [details] slurm.conf Hi, SchedMD support team As shown below, some programs were found whose CPU time exceeded the elapsed time. I don't think the CPU time will exceed the elapsed time, but please tell me why this happens. Hyper-threading is turned off in the settings. [root@ohtaka1 ~]# sacct -j 108722 -o jobid,account,start,end,cputime%15,totalcpu%15 JobID Account Start End CPUTime TotalCPU ------------ ---------- ------------------- ------------------- --------------- --------------- 108722 k042300 2020-12-16T05:16:28 2020-12-16T15:36:15 3966-14:43:12 7833-09:07:49 108722.batch k042300 2020-12-16T05:16:28 2020-12-16T15:36:15 55-02:12:16 00:01.144 108722.0 k042300 2020-12-16T05:16:30 2020-12-16T15:36:14 3966-07:02:24 7833-09:07:48 [root@ohtaka1 ~]# seff 108722 Job ID: 108722 Cluster: ohtaka User/Group: k042300/k0423 State: COMPLETED (exit code 0) Nodes: 72 Cores per node: 128 CPU Utilized: 7833-09:07:49 CPU Efficiency: 197.48% of 3966-14:43:12 core-walltime Job Wall-clock time: 10:19:47 Memory Utilized: 2.70 TB (estimated maximum) Memory Efficiency: 16.67% of 16.17 TB (230.00 GB/node) WorkDir: /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/_sample_20/run01 (job script) [root@ohtaka1 ~]# cat /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/_sample_20/run01/run.sh #!/bin/sh #SBATCH -p B72cpu #SBATCH -N 72 #SBATCH -n 9216 #SBATCH -J test #SBATCH -t 11:59:59 #SBATCH --mail-type=BEGIN srun /home/k0423/k042300/lammps_work/src/lmp_asano -in in.polymer [root@ohtaka1 ~]#
Hi, > As shown below, some programs were found whose CPU time exceeded the elapsed > time. I don't think the CPU time will exceed the elapsed time, Actually, the CPUTime typically exceeds the Elapsed time as it counts all the CPUs allocated to the job and most typically one job use to allocate more than 1 CPUs. From sacct docs (https://slurm.schedmd.com/sacct.html#OPT_CPUTime): CPUTime Time used (Elapsed time * CPU count) by a job or step in HH:MM:SS format. > but please > tell me why this happens. Hyper-threading is turned off in the settings. > > [root@ohtaka1 ~]# sacct -j 108722 -o > jobid,account,start,end,cputime%15,totalcpu%15 > JobID Account Start End > CPUTime TotalCPU > ------------ ---------- ------------------- ------------------- > --------------- --------------- > 108722 k042300 2020-12-16T05:16:28 2020-12-16T15:36:15 > 3966-14:43:12 7833-09:07:49 > 108722.batch k042300 2020-12-16T05:16:28 2020-12-16T15:36:15 > 55-02:12:16 00:01.144 > 108722.0 k042300 2020-12-16T05:16:30 2020-12-16T15:36:14 > 3966-07:02:24 7833-09:07:48 > [root@ohtaka1 ~]# seff 108722 > Job ID: 108722 > Cluster: ohtaka > User/Group: k042300/k0423 > State: COMPLETED (exit code 0) > Nodes: 72 > Cores per node: 128 > CPU Utilized: 7833-09:07:49 > CPU Efficiency: 197.48% of 3966-14:43:12 core-walltime > Job Wall-clock time: 10:19:47 > Memory Utilized: 2.70 TB (estimated maximum) > Memory Efficiency: 16.67% of 16.17 TB (230.00 GB/node) > WorkDir: > /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/ > _sample_20/run01 > > (job script) > > [root@ohtaka1 ~]# cat > /work/k0423/k042300/work/Polymer/rho0.40/Ns200Np5000/_reT1.60PPwca/ > _sample_20/run01/run.sh > #!/bin/sh > #SBATCH -p B72cpu > #SBATCH -N 72 > #SBATCH -n 9216 > #SBATCH -J test > #SBATCH -t 11:59:59 > #SBATCH --mail-type=BEGIN > srun /home/k0423/k042300/lammps_work/src/lmp_asano -in in.polymer > [root@ohtaka1 ~]# This jobs is requesting 72 Nodes and based in your slurm.conf those nodes have 2 Sockets and 64 CoresPerSocket. Doing rounded numbers the CPUTime in days that I would expect from this job is: ~10 Elapsed hours * 72 Nodes * 2 Sockets per node * 64 CoresPerSocket / 24h/day = ~3840 days of CPUTime. Your sacct command is reporting ~3966 days, so it seems fine to me. Does it make sense to you too? Note also that the job is also requesting the equivalent value of 9216 Tasks (and the default 1 Core per Task). The rounded math could also be done using the -n value: ~10 Elapsed hours * 9216 Cores / 24h/day = ~3840 days of CPUTime. Am I answering your question? Regards, Albert
Sorry. My question may have been wrong. My correct question is why the "TotalCPU" exceeds "CPUTime". CPUTime : 3966-14:43-12 TotalCPU : 7833-09:07:49 CPU Efficiency : 197.48% I think the CPU efficiency is usually less than 100%, but this job is 197%. Is this the correct value?
Hi, > My correct question is why the "TotalCPU" exceeds "CPUTime". Sorry, now I understand the problem better. > CPUTime : 3966-14:43-12 > TotalCPU : 7833-09:07:49 > CPU Efficiency : 197.48% > > I think the CPU efficiency is usually less than 100%, but this job is 197%. > Is this the correct value? I guess that you are aware, but just to provide some context: CPUTime is computed from the values that the user requested and Slurm allocated for the job, while TotalCPU is a value gathered from the system while your job was running, in your case using the jobacct_gather/cgroup. Note that in some configurations, if the jobs are NOT constrained to use only the allocated resources, it could happen that a job was allocated to use N CPU of a node, but it escapes and uses more than that. So, it could be expected in some configurations, but I'm not yet certain if that's your case. Can you attach your cgroup.conf? Also, can you post the output of "slurmd -C" on any node of the partition B72cpu? I know that we had some issues of duplicated accounting on some cases with jobacct_gather/cgroup (eg bug 8656), but I'm not certain if this is your case neither. I'll study your issue more in deep. Regards, Albert
The output of "slurmd -C" and the contents of cgroup.conf are as follows. [root@tempnode4 ~]# /usr/sbin/slurmd -C NodeName=tempnode4 slurmd: Considering each NUMA node as a socket CPUs=128 Boards=1 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257514 UpTime=5-01:36:17 [root@tempnode4 ~]# [root@tempnode4 ~]# cat /etc/slurm/cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes AllowedRAMSpace=98 ConstrainCores=yes #ConstrainRAMSpace=no ConstrainRAMSpace=yes Best regars, Yuichi
Hi Yuichi, With the information provided I agree that this looks like a bug. Just to double check, in your last comment the output is from "tempnode4", but that nodename is not part of the partition B72cpu, is it? I'm trying to reproduce it, are you able to reproduce it? Do you have the batch script submitted? I'm wondering if that duplication is only impacting the CPU. Could you attach the output of: $ sacct -j 108722 -p -o JobID,SystemCPU,UserCPU,TotalCPU,ConsumedEnergy,NodeList,NTasks,AllocTRES,CPUTime,MaxRSS,AveRSS Also, do you have the slurmd logs of one of the nodes allocated by 108722. Anyway, if you are able to reproduce it, could you switch to JobAcctGatherType=jobacct_gather/linux and see if it still happen? Thanks, Albert
Created attachment 17759 [details] slurmd.log
Created attachment 17760 [details] output of sacct for 96 jobs
"tempnode4" is a spare node, not a node assigned to B72cpu. However, it has the same configuration as the node assigned to B72cpu. It is necessary to ask the user for the reproduction test. I will ask the user. The output of sacct for 108722 is as follows. [root@ohtaka1 cpu_efficiency]# sacct -j 108722 -p -o JobID,SystemCPU,UserCPU,TotalCPU,ConsumedEnergy,NodeList,NTasks,AllocTRES,CPUTime,MaxRSS,AveRSS JobID|SystemCPU|UserCPU|TotalCPU|ConsumedEnergy|NodeList|NTasks|AllocTRES|CPUTime|MaxRSS|AveRSS| 108722|8-02:15:51|7825-06:51:58|7833-09:07:49|0|c02u01n[1-4],c02u02n[1-4],c02u03n[1-4],c02u04n[1-4],c02u05n[1-4],c02u06n[1-4],c02u07n[1-4],c02u08n[1-4],c02u09n[1-4],c02u10n[1-4],c02u11n[1-4],c02u12n[1-4],c02u13n[1-4],c02u14n[1-4],c02u15n[1-4],c02u16n[1-4],c02u17n[1-4],c02u18n[1-4]||billing=9216,cpu=9216,mem=16560G,node=72|3966-14:43:12||| 108722.batch|00:00.810|00:00.333|00:01.144|0|c02u01n1|1|cpu=128,mem=230G,node=1|55-02:12:16|9368K|9368K| 108722.0|8-02:15:50|7825-06:51:57|7833-09:07:48|0|c02u01n[1-4],c02u02n[1-4],c02u03n[1-4],c02u04n[1-4],c02u05n[1-4],c02u06n[1-4],c02u07n[1-4],c02u08n[1-4],c02u09n[1-4],c02u10n[1-4],c02u11n[1-4],c02u12n[1-4],c02u13n[1-4],c02u14n[1-4],c02u15n[1-4],c02u16n[1-4],c02u17n[1-4],c02u18n[1-4]|9216|cpu=9216,mem=16560G,node=72|3966-07:02:24|314176K|471962024| I also attached the slurmd.log for c02u01n1 to which 108722 was assigned. When I checked B72cpu, F72cpu, B144cpu, and F144cpu from 2020/12 to 2021/01, 96 jobs that exceeded CPU efficiency by 100% were found. I also attach the sacct for 96 jobs.
Hi Yuichi, > "tempnode4" is a spare node, not a node assigned to B72cpu. However, it has > the same configuration as the node assigned to B72cpu. Thanks. Would you attach the output of "lscpu" of that node? > It is necessary to ask the user for the reproduction test. I will ask the > user. Ok, thanks. > When I checked B72cpu, F72cpu, B144cpu, and F144cpu from 2020/12 to 2021/01, > 96 jobs that exceeded CPU efficiency by 100% were found. I also attach the > sacct for 96 jobs. Then it looks that is quite easy to reproduce, maybe we can try a simple reproducer. Could you run these couple of 10 seconds jobs and attach the sacct output of them? $sbatch -p B72cpu -N72 -n 9216 --wrap "srun stress -c 1 -t 10" $sbatch -p B72cpu -N72 -n 9216 --wrap "srun stress -c 2 -t 10" Is the TotalCPU of the second one doubling its CPUTime? The first one should not. If you can reproduce it with this short stress jobs, could you try to switch to JobAcctGatherType=jobacct_gather/linux and see if you still can reproduce it? Thanks, Albert
Created attachment 17913 [details] lscpu
Hi Albert, Sorry for the late reply. I attached the result of lscpu. Please check it. I'm asking the user who has the reproduction program, so please wait a moment. And I will try run stress. Regards, Yuichi
Hi Yuichi, It seems that there is no issue regarding the number and allocation of CPUs, but on how the cpu usage is gathered. > If you can reproduce it with this short stress jobs, could you try to switch to JobAcctGatherType=jobacct_gather/linux and see if you still can reproduce it? Have you been able to reproduce the issue? Have you verified if it gets fixed/workarounded using jobacct_gather/linux? Finally, could you build the following code, run it into the failing nodes and attach your output: $ cat bug10723-clk.c #include <stdio.h> #include <unistd.h> int main() { printf("Clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK)); return 0; } $ gcc -o bug10723-clk bug10723-clk.c $ ./bug10723-clk Clock ticks per sec : 100 Thanks, Albert
Hi Albert, Thank you for your supporting. It could not be reproduced by the stress program. $ sacct -j 196350,196351 -ojobid,jobname,partition,account,alloccpus,elaps,totalcpu JobID JobName Partition Account AllocCPUS Elapsed TotalCPU ------------ ---------- ---------- ---------- ---------- ---------- ---------- 196350 core1.sh F72cpu dell00 9216 00:00:18 1-10:19:23 196350.batch batch dell00 128 00:00:18 00:00.851 196350.0 stress-ng dell00 9216 00:00:17 1-10:19:22 196351 core2.sh F72cpu dell00 9216 00:00:18 1-10:15:38 196351.batch batch dell00 128 00:00:18 00:00.889 196351.0 stress-ng dell00 9216 00:00:18 1-10:15:37 [dell00@ohtaka1 stress]$ cat core1.sh #!/bin/sh #SBATCH -N 72 #SBATCH -n 9216 srun /home/system/tool/stress-ng/stress-ng -c 1 -t 10 [dell00@ohtaka1 stress]$ cat core2.sh #!/bin/sh #SBATCH -N 72 #SBATCH -n 9216 srun /home/system/tool/stress-ng/stress-ng -c 2 -t 10 F72cpu uses the same resources as B72cpu. The only difference is the priority. I checked the tendency of user jobs, I found that the total CPU may or may not exceed 100% even with the same program, so it does not depend on user jobs. I want to try to switch to "JobAcctGatherType = jobacct_gather/linux", but I can't check until the regular maintenance in March because this system is not allowed to change the config unless it is in regular maintenance. Therefore, please wait for a while. And, the result of executing bug10723 is shown. [dell00@c01u01n1 bug10723-clk]$ ./bug10723-clk Clock ticks per sec : 100 [dell00@c01u01n1 bug10723-clk]$ cat bug10723-clk.c #include <stdio.h> #include <unistd.h> int main() { printf("Clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK)); return 0; } Best regars, Yuichi
Hi Yuchi, > It could not be reproduced by the stress program. > > $ sacct -j 196350,196351 > -ojobid,jobname,partition,account,alloccpus,elaps,totalcpu > JobID JobName Partition Account AllocCPUS Elapsed > TotalCPU > ------------ ---------- ---------- ---------- ---------- ---------- > ---------- > 196350 core1.sh F72cpu dell00 9216 00:00:18 > 1-10:19:23 > 196350.batch batch dell00 128 00:00:18 > 00:00.851 > 196350.0 stress-ng dell00 9216 00:00:17 > 1-10:19:22 > 196351 core2.sh F72cpu dell00 9216 00:00:18 > 1-10:15:38 > 196351.batch batch dell00 128 00:00:18 > 00:00.889 > 196351.0 stress-ng dell00 9216 00:00:18 > 1-10:15:37 > > [dell00@ohtaka1 stress]$ cat core1.sh > #!/bin/sh > > #SBATCH -N 72 > #SBATCH -n 9216 > > srun /home/system/tool/stress-ng/stress-ng -c 1 -t 10 > [dell00@ohtaka1 stress]$ cat core2.sh > #!/bin/sh > > #SBATCH -N 72 > #SBATCH -n 9216 > > srun /home/system/tool/stress-ng/stress-ng -c 2 -t 10 > > F72cpu uses the same resources as B72cpu. Thanks for the check. Just to be 100% sure that we cannot reproduce it, could your try with a core16.sh with srun /home/system/tool/stress-ng/stress-ng -c 16 -t 10 > I checked the tendency of user jobs, I found that the total CPU may or may > not exceed 100% even with the same program, so it does not depend on user > jobs. So far I would say that the problem is that the gathering is (at least) duplicated (sometimes?). So, exceeding or not is not the key part because if the original efficiency is below 50% the duplicated efficiency with still be under 100% but it would be also wrong. Is the efficiency of those jobs sometimes the double/half of the previous runs? Or is just slightly above and below 100%? > And, the result of executing bug10723 is shown. > > [dell00@c01u01n1 bug10723-clk]$ ./bug10723-clk > Clock ticks per sec : 100 Ok, thanks for the check. This is quite standard. > I want to try to switch to "JobAcctGatherType = jobacct_gather/linux", but I > can't check until the regular maintenance in March because this system is > not allowed to change the config unless it is in regular maintenance. > Therefore, please wait for a while. Understood. The key point is to find a reproducer. That will allow us to investigate the issue. At this point I think that this is a duplicate of bug 6541, that never got properly fixed. We were not able to reproduce it there, and I guess that you are facing the same issue even a couple of major versions later. In both cases the issue seems to the jobacct_gather/cgroups. But in that case, we do were able to reproduce the issue using stress. I'm not sure why you are not. In bug 8656 we find a very similar issue on jobacct_gather/cgroups, duplicated gathered data (fixed on 20.11.3). But that was only on the .extern steps, and that's not your case. But it could be similar. Anyway, those are only reference more important to me than to you. If this is ok for you, to investigate your case I would provide you with a patch that will increase the logs related to the jobacct_gather/cgroups. If you can apply it in your regular maintenance we could get more info about what is going on. Can you confirm the Slurm version that you are planning to work with on the next maintenance? Is at the beginning on March, or at the end of it? Thanks, Albert
Hi Albert, > The key point is to find a reproducer. > That will allow us to investigate the issue. I understood. However, it seems that there is no reproducibility because there are some that exceed and some that do not exceed the same job. > So far I would say that the problem is that the gathering is (at least) duplicated (sometimes?). > So, exceeding or not is not the key part because if the original efficiency is below 50% the duplicated efficiency with still be under 100% but it would be also wrong. > Is the efficiency of those jobs sometimes the double/half of the previous runs? Or is just slightly above and below 100%? In the following cases, two identical programs (only with different parameters) may or may not exceed 100%. These are about twice as different. [root@ohtaka1 ~]# cd /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T800/run01 [root@ohtaka1 run01]# ls fluid in.swave in.swave~ left log.lammps right run.sh slurm-149557.out [root@ohtaka1 run01]# cat run.sh #!/bin/sh #SBATCH -p B72cpu #パーティションを指定 #SBATCH -N 72 #SBATCH -n 9216 #SBATCH -J _swav #SBATCH -t 11:59:59 #SBATCH --mail-type=BEGIN srun /home/k0152/k015204/lmp_asano -in in.swave [root@ohtaka1 run02]# seff 149557 Job ID: 149557 Cluster: ohtaka User/Group: k015204/k0152 State: COMPLETED (exit code 0) Nodes: 72 Cores per node: 128 CPU Utilized: 6844-20:12:56 CPU Efficiency: 197.50% of 3465-16:57:36 core-walltime Job Wall-clock time: 09:01:31 Memory Utilized: 1.72 TB (estimated maximum) Memory Efficiency: 10.64% of 16.17 TB (230.00 GB/node) WorkDir: /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T800/run01 [root@ohtaka1 run01]# cd /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T1000/run02 [root@ohtaka1 run02]# ls fluid in.swave in.swave~ left log.lammps right run.sh slurm-149558.out temp.out [root@ohtaka1 run02]# cat run.sh #!/bin/sh #SBATCH -p B72cpu #パーティションを指定 #SBATCH -N 72 #SBATCH -n 9216 #SBATCH -J _swav #SBATCH -t 11:59:59 #SBATCH --mail-type=BEGIN srun /home/k0152/k015204/lmp_asano -in in.swave [root@ohtaka1 run02]# seff 149558 Job ID: 149558 Cluster: ohtaka User/Group: k015204/k0152 State: COMPLETED (exit code 0) Nodes: 72 Cores per node: 128 CPU Utilized: 3827-03:52:50 CPU Efficiency: 99.44% of 3848-20:28:48 core-walltime Job Wall-clock time: 10:01:23 Memory Utilized: 1.73 TB (estimated maximum) Memory Efficiency: 10.70% of 16.17 TB (230.00 GB/node) WorkDir: /work/k0152/k015204/sound/_CriticalSound/Lx10050Ly100Lz100/T0.95/_A5/T1000/run02 > Can you confirm the Slurm version that you are planning to work with on the next maintenance? We will work with slurm 20.02.6. > Is at the beginning on March, or at the end of it? 12th March. Regards, Yuichi
Hi Yuichi, > In the following cases, two identical programs (only with different > parameters) Thanks for the information. I guess that, as they are in different directories, the parameters are different because of that. Besides that, they are identical. Do you have the slurmd logs of both executions? > may or may not exceed 100%. These are about twice as different. Yes, that duplication is the key part to me. It seems that, in some cases, jobacct_gather/cgroups is counting the CPU usage twice. Being able to reproduce it also important. Do you think that you could be able to reproduce a job that failed running it again from same directory or from a duplicated one or similar? Or are those jobs too big and that would disturb your cluster too much? > > Is at the beginning on March, or at the end of it? > > 12th March. Perfect. I'll attach a patch to add extra debug logs related to jobacct_gather/cgroups, so after the maintenance we can get better logs. If you are able to reproduce it before the maintenance, then we will be able to quickly reproduce it with the patch. If not, we'll wait until it is reproduced by some job from users. > > Can you confirm the Slurm version that you are planning to work with on the next maintenance? > > We will work with slurm 20.02.6. The patch will be based on version 20.02.6, then. Thanks, Albert
Created attachment 18174 [details] slurmd.log
Hi Albert, Thank you for your supporting. > Do you have the slurmd logs of both executions? I have attached the slurmd.log for the node where both jobs executed. [root@sched1 ~]# sacct -j 149557,149558 --allocations -o jobid,jobname,partition,account,start,end JobID JobName Partition Account Start End ------------ ---------- ---------- ---------- ------------------- ------------------- 149557 _swav B72cpu k015204 2021-01-16T17:52:41 2021-01-17T02:54:12 149558 _swav B72cpu k015204 2021-01-16T18:01:16 2021-01-17T04:02:39 > Do you think that you could be able to reproduce a job that failed running it > again from same directory or from a duplicated one or similar? > Or are those jobs too big and that would disturb your cluster too much? This is difficult to reproduce. I have not yet obtained permission from the user to use the program. > I'll attach a patch to add extra debug logs related to jobacct_gather/cgroups, > so after the maintenance we can get better logs. > If you are able to reproduce it before the maintenance, then we will be able > to quickly reproduce it with the patch. > If not, we'll wait until it is reproduced by some job from users. Thank you for creating the patch. I want to apply a patch for maintenance and wait for it to be reproduced. Regards, Yuichi
Hi Yuichi, Just to provide you with a quick update. I've gone through your slurmd logs, but unfortunately I don't see anything that could help us on this issue. I've also a first version of a patch adding extra debug logs on slurmd, but I want to enhance it to also add extra debug logs on the RPC messages that send the gathered information from the nodes to the DB, just in case the issue is not on the gathered information on each node, but on the final aggregation. I'll attach the final version this week. Regards, Albert
Hi Albert, Thank you for your supporting. I want the final version by the end of today. After tomorrow, it will be difficult to apply it for the maintenance on March 12th, so it will be the next maintenance (April 1st). Regards, Yuichi
Created attachment 18340 [details] Patch for 20.02.6-1 with extra debug logs Hi Yuichi, I'm attaching the final version of the debug patch. This patch is only adding extra debug log messages related to how the user and system CPU time for each step is gathered from the OS, and the into the DB. It won't fix the issue and it's not changing any other behavior of 20.02.6. Although they should not be too verbose (maybe a bit on slurmd), I still wanted to allow you to enable/disable these extra logs as easy as possible. In newer Slurm versions we have a specific DebugFlags=JobAccountGather [1], but we don't have that on 20.02.6 to enable/disable them very easily. For this reason I've decided to just use the "debug" level of normal log, and the reuse the DB_STEP flag. So, to enable the extra logs you need to specify this in slurm.conf: SlurmctldDebug=debug SlurmdDebug=debug And this into slurmdbd.conf: DebugLevel=debug DebugFlags=DB_STEP Most of the new log message have the "BUG10723: " prefix, so you can grep it into the logs (specially the slurmd logs) to double-check if they are enabled. Once you have the patch in place, it would be perfect if you are able to reproduce the issue as you did in comment 18, and you can send me the logs of slurmdbd, slurmctld and slurmd for both failing and non failing cases. And once you manage to do that, you can disable the extra logs using again "info" instead of "debug" as it seems that you usually do. Please let me know if you have any question. Regards, Albert [1] https://github.com/SchedMD/slurm/blob/master/NEWS#L20
Hi Albert, Thanks for creating the patch. As scheduled, we will apply patches and change the settings of slurm.conf during the maintenance tomorrow. Regards, Yuichi
Created attachment 18422 [details] slurmd.log for jobid 214551
Created attachment 18423 [details] slurmd.log for jobid 214553
Created attachment 18424 [details] slurmdbd.log on acct(acct is slurm servers hostname)
During the maintenance on March 12, we applied a patch for debugging and changed the config. Among the jobs executed yesterday, 4 jobs with CPU efficiency exceeding 100% were found. We sent you information about jobid 214551 and 214553. [root@ohtaka1 ~]# sacct -j 214551 -o jobid,jobname,partition,account,alloccpus,start,end,elaps,totalcpu%13,node%30 JobID JobName Partition Account AllocCPUS Start End Elapsed TotalCPU NodeList ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------- ------------------------------ 214551 tinb3up F4cpu m003902 512 2021-03-13T22:47:54 2021-03-14T02:19:22 03:31:28 186-02:40:20 c16u12n[2-3],c17u18n3,c19u16n4 214551.batch batch m003902 128 2021-03-13T22:47:54 2021-03-14T02:19:22 03:31:28 00:00.114 c16u12n2 214551.0 vasp_std m003902 512 2021-03-13T22:47:54 2021-03-14T02:19:21 03:31:27 186-02:40:20 c16u12n[2-3],c17u18n3,c19u16n4 [root@ohtaka1 ~]# seff 214551 Job ID: 214551 Cluster: ohtaka User/Group: m003902/m0039 State: COMPLETED (exit code 0) Nodes: 4 Cores per node: 128 CPU Utilized: 186-02:40:21 CPU Efficiency: 247.53% of 75-04:30:56 core-walltime Job Wall-clock time: 03:31:28 Memory Utilized: 609.96 GB (estimated maximum) Memory Efficiency: 66.30% of 920.00 GB (230.00 GB/node) WorkDir: /work/m0039/m003902/calc/pmnpt/slab/7ML/up/TiNb/vol/3 [root@ohtaka1 ~]# sacct -j 214553 -o jobid,jobname,partition,account,alloccpus,start,end,elaps,totalcpu%13,node%30 JobID JobName Partition Account AllocCPUS Start End Elapsed TotalCPU NodeList ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------- ------------------------------ 214553 mgnb3up F4cpu m003902 512 2021-03-13T22:58:29 2021-03-14T04:21:18 05:22:49 199-02:56:42 c17u06n[3-4],c17u07n2,c17u18n4 214553.batch batch m003902 128 2021-03-13T22:58:29 2021-03-14T04:21:18 05:22:49 00:00.115 c17u06n3 214553.0 vasp_std m003902 512 2021-03-13T22:58:29 2021-03-14T04:21:17 05:22:48 199-02:56:42 c17u06n[3-4],c17u07n2,c17u18n4 [root@ohtaka1 ~]# seff 214553 Job ID: 214553 Cluster: ohtaka User/Group: m003902/m0039 State: COMPLETED (exit code 0) Nodes: 4 Cores per node: 128 CPU Utilized: 199-02:56:43 CPU Efficiency: 173.48% of 114-18:42:08 core-walltime Job Wall-clock time: 05:22:49 Memory Utilized: 726.76 GB (estimated maximum) Memory Efficiency: 79.00% of 920.00 GB (230.00 GB/node) WorkDir: /work/m0039/m003902/calc/pmnpt/slab/7ML/up/MgNb/vol/3 The F4cpu partition is a partition that uses 4 nodes and uses a different node from the B64cpu partition, but the node configuration is the same as the B64cpu. I have attached slurmdbd.log and slurmd.log for information on jobid 214551 and 214553. For slurmd.log, only the parts related to this job are excerpted. Please let me know if you have any other information to take.
B64cpu is typo, correct is B72cpu.
Hi, > During the maintenance on March 12, we applied a patch for debugging and > changed the config. Thanks for that. > Among the jobs executed yesterday, 4 jobs with CPU > efficiency exceeding 100% were found. Good that we are still able to reproduce the issue. > We sent you information about jobid 214551 and 214553. > > Job ID: 214551 > CPU Efficiency: 247.53% of 75-04:30:56 core-walltime > > Job ID: 214553 > CPU Efficiency: 173.48% of 114-18:42:08 core-walltime I'll look into these logs. But please note that what could help me even more is something similar to comment 18. That is, a job presenting the issue and another (very) similar job that went ok. That would help me to see when/where/how the gathering is not working properly. > The F4cpu partition is a partition that uses 4 nodes and uses a different > node from the B72cpu partition, but the node configuration is the same as > the B72cpu. Thanks for the info, I see it in slurm.conf. > I have attached slurmdbd.log and slurmd.log for information on jobid 214551 > and 214553. For slurmd.log, only the parts related to this job are excerpted. > Please let me know if you have any other information to take. I've already seen that the logs contain the patch information. I'll look further into the details, but I've already seen some numbers not expected and I would like to quickly check a known problem with a buggy version of the glibc (bug 10824). Are you using glibc 2.17-322? Could you build and run this small app on the compute nodes (eg c16u12n2) to see if it prints 0.000 or nan? $ cat glibctest.c #include <stdio.h> void main() { long double d = 0; printf("%Lg\r\n",d); printf("%Lf\r\n",d); } $ gcc glibctest.c -o glibctest $ ./glibctest Thanks, Albert
Hi Albert, Thank you for your supporting. > Are you using glibc 2.17-322? [dell00@c16u12n2 ~]$ rpm -q glibc glibc-2.28-127.el8.x86_64 > Could you build and run this small app on the compute nodes (eg c16u12n2) to > see if it prints 0.000 or nan? It was printed 0.000. [dell00@c16u12n2 ~]$ ./glibctest 0 0.000000 Regards, Yuichi
Hi Yuichi, I'm still investigating, but with the logs I've been able to find a first lead to work with. My current theory under investigation is that the gathered information is correct per task and per node, but it seems that it is sent twice (or even more) from the N nodes to the batch node because the RPC message gets a timeout and it's send again. If that's the case, we'll need to fix that. In the meantime, it will be good if you increase your MessageTimeout in slurm.conf. You are using the default 10s and I can see that it is timing out (probably because the batch node is really-really busy for job itself). Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol reconfigure" and let me know if you still face the same issue? If you do, could attach new logs of those jobs? If you don't face the issue anymore, then we are on the right path to fix this! ;-) Thanks, Albert
Hi Albert, Thank you for your supporting. > You are using the default 10s and I can see that it is timing out (probably > because the batch node is really-really busy for job itself). > Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol > reconfigure" and let me know if you still face the same issue? The system is currently in service, can we make this change even if there are jobs running? Regards, Yuichi
Hi Yuichi, > > You are using the default 10s and I can see that it is timing out (probably > > because the batch node is really-really busy for job itself). > > Could you try to set "MessageTimeout=50" in your slurm.conf and "scontrol > > reconfigure" and let me know if you still face the same issue? > > The system is currently in service, can we make this change even if there > are jobs running? Yes, no problem at all. You can verify the current MessageTimeout value with a command like this: $ scontrol show config | grep MessageTimeout MessageTimeout = 10 sec Then you can edit the slurm.conf adding this line: MessageTimeout=50 And once you run this: $ scontrol reconfigure You can then verify that the parameter has been updated with the initial command: $ scontrol show config | grep MessageTimeout MessageTimeout = 50 sec The cluster will keep running normally. Well, if I'm right, I expect all the new jobs to start having a proper CPUTotal value and not impossible efficiency values... ;-) Regards, Albert
Hi Albert, Thanks. Today I set MessageTimeout = 50. We will continue to check if this issue occurs. Regards, Yuichi
Hi Albert, I changed the MessageTimeout to 50 on the 19th, but unfortunately this problem has reappeared. I will attach 2 logs out of the 8 reproduced cases. Both are jobs that run on four nodes. Regards, Yuichi
Created attachment 18574 [details] slurmd.log for jobid 222362
Created attachment 18575 [details] slurmd.log for jobid 222423
Created attachment 18576 [details] slurmctld.log on sched1(slurm ctl server)
Created attachment 18577 [details] slurmdbd.log on acct
Created attachment 18578 [details] scontrol show config
Created attachment 18579 [details] Reproduced jobs list
Hi Yuichi, > I changed the MessageTimeout to 50 on the 19th, but unfortunately this > problem has reappeared. I will attach 2 logs out of the 8 reproduced cases. > Both are jobs that run on four nodes. Thanks for the logs, it helps a lot. Although you have been able to reproduce the issue with a higher MessageTimeout, the new logs still show the same problem that I've detected before, related to a timeout. Actually, it seems strange to because althou in the "scontrol show config" the MessageTimeout value is correct, in your logs I can see 2 timeouts with only 10s between them (actually 11s because the code include 1s extra): 222423/slurmd.c19u16n4.log:[2021-03-20T03:06:39.454] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout 222423/slurmd.c19u16n4.log:[2021-03-20T03:06:50.464] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout 222423/slurmd.c19u15n4.log:[2021-03-20T03:06:39.431] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout 222423/slurmd.c19u15n4.log:[2021-03-20T03:06:50.442] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout 222423/slurmd.c19u16n2.log:[2021-03-20T03:06:39.436] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout 222423/slurmd.c19u16n2.log:[2021-03-20T03:06:50.446] [222423.0] debug: slurm_recv_timeout at 0 of 4, timeout Maybe this job was started before you did the reconfigure? Note that only jobs started after the "scontrol reconfigure" will be actually using the higher MessageTimeout. Anyway, if you still have this issue, if you want we can try to increase the MessageTimeout to 100, but i think that that's not necessary. I have other news on the issue: 1) I've been able to reproduce it using MessageTimeout=1 and slowing down some threads to get those timeouts, and I get duplicated (or even triplicated) TotalCPU with the same logs that you get. And that's good news because it allows me to work on a fix. 2) But I've been able to reproduce it only on 20.02, not on 20.11. We've done some changes on how those messages are handled on 20.11, and it would seem that the problem was (collaterally) solved on 20.11, but I need to investigate this a bit more to be sure about it. Regards, Albert
Hi Yuichi, > 2) But I've been able to reproduce it only on 20.02, not on 20.11. > We've done some changes on how those messages are handled on 20.11, and it > would seem that the problem was (collaterally) solved on 20.11, but I need > to investigate this a bit more to be sure about it. I can confirm that the issue is NOT solved on 20.11. The issue is on the communication form the nodes of the job to the main node, and this improvement on 20.11 was making my reproducer not to reproduce the issue because on certain conditions we avoid that communication between nodes and send it directly to the controller: - https://github.com/SchedMD/slurm/commit/30af622a52a9dfc077381223c381a6fb409d8eaa But that improvement is not fixing the issue, just avoiding it in some cases. I've been able to reproduce the issue also on 20.11 changing a bit my reproducer. We'll work on a proper fix for it. Regards, Albert
Hi Albert, > Maybe this job was started before you did the reconfigure? I ran reconfig around 03/19 21:20. [root@ohtaka1 ~]# sacct --allocations -j 222362,222423 -o j,start,end JobID Start End ------------ ------------------- ------------------- 222362 2021-03-20T01:05:09 2021-03-20T01:39:50 222423 2021-03-20T00:04:16 2021-03-20T15:09:11 As the result of sacct shows, these jobs started after reconfig. Will it take some time for the reconfiguration to take effect? If it recurs in the future, the log will be collected again. Regards, Yuichi
Hi Yuichi, > As the result of sacct shows, these jobs started after reconfig. Will it > take some time for the reconfiguration to take effect? It shouldn't. I assume that slurmds have access to the same slurm.conf with the new MessageTimeout option, right? > If it recurs in the > future, the log will be collected again. Great. I'm working on a fix, though. I'm already quite confident about the root cause. Thanks, Albert
Hi Albert, When I checked the jobs from 3/20, I found 140 jobs with CPU time exceeding 100%. The joblist and the log of jobid: 231303 is attached. We are planning to do maintenance on 3/31. Can I stop debug mode at that time? Because the log has grown too large. Regards, Yuichi
Created attachment 18719 [details] joblist
Created attachment 18720 [details] slurmd.log for jobid 231303
Hi Yuichi, > When I checked the jobs from 3/20, I found 140 jobs with CPU time exceeding > 100%. > The joblist and the log of jobid: 231303 is attached. I guess that you are still using MessageTimeout=50, right? I think that if you increase the MessageTimeout the number of wrong jobs will decrease. But that's just a workaround, the actual fix is being reviewed as part of our QA. > We are planning to do maintenance on 3/31. Can I stop debug mode at that > time? Because the log has grown too large. Yes, and you can already do it: 1) Restore your original SlurmctldDebug and SlurmdDebug values in slurm.conf. 2) Run "scontrol reconfigure" 3) Restore your original DebugLevel and DebugFlags in slurmdbd.conf. 4) Restart slurmdbd On you maintenance you can also remove the debug patch. I'll keep you updated about the actual fix of the issue, Albert
Hi Albert, > I guess that you are still using MessageTimeout=50, right? Yes, [root@c19u09n3 ~]# scontrol show config | grep -i timeout BatchStartTimeout = 10 sec EioTimeout = 60 GetEnvTimeout = 2 sec MessageTimeout = 50 sec PrologEpilogTimeout = 65534 ResumeTimeout = 60 sec SlurmctldTimeout = 300 sec SlurmdTimeout = 300 sec SuspendTimeout = 30 sec TCPTimeout = 30 sec UnkillableStepTimeout = 60 sec I will change to normal mode at the maintenance. Thanks for telling me how to remove debug. Regards, Yuichi
Hi Albert, The settings were changed to the following during the regular maintenance on April 1. JobAcctGatherType = jobacct_gather / linux This issue continued to occur. However, the number of cases is about 30% less in linux than in cgroups. ACCOUNT JOBID PART START END ELAPS CPU_EFFICIENCY k006115 237414 F4cpu 2021/04/03/23:04:50 2021/04/04/06:27:57 26587sec 173.64% k015013 237465 F4cpu 2021/04/04/03:33:53 2021/04/04/06:40:14 11181sec 123.53% k035802 238853 F16cpu 2021/04/05/21:41:05 2021/04/06/00:09:30 8905sec 284.14% k046303 239555 F4cpu 2021/04/06/16:44:53 2021/04/07/16:23:01 85088sec 174.06% k046303 240824 F4cpu 2021/04/07/17:44:14 2021/04/08/17:23:28 85154sec 173.96% k035802 241148 F144cpu 2021/04/08/03:08:06 2021/04/08/19:53:23 60317sec 126.72% k016202 241536 F4cpu 2021/04/08/08:39:58 2021/04/08/08:58:20 1102sec 170.01% k016202 241537 F4cpu 2021/04/08/08:39:59 2021/04/08/08:58:09 1090sec 169.61% k046303 242254 F4cpu 2021/04/08/17:04:59 2021/04/08/18:04:41 3582sec 173.31% k016202 242412 F4cpu 2021/04/08/18:51:20 2021/04/09/02:47:32 28572sec 173.45% k016202 242413 F4cpu 2021/04/08/18:51:30 2021/04/09/00:00:26 18536sec 173.38% k046303 242780 F4cpu 2021/04/09/16:31:06 2021/04/09/17:31:17 3611sec 171.86% k046303 242336 F4cpu 2021/04/09/16:31:06 2021/04/10/16:10:40 85174sec 173.94% k006115 242502 F4cpu 2021/04/09/16:31:06 2021/04/10/01:09:43 31117sec 173.77% k027823 242976 L16cpu 2021/04/09/19:19:02 2021/04/11/15:36:38 159456sec 192.17% k006115 243246 F4cpu 2021/04/10/16:31:36 2021/04/11/14:49:03 80247sec 173.92% k046303 244085 F4cpu 2021/04/11/12:22:47 2021/04/11/13:22:41 3594sec 173.31% k046303 244082 F4cpu 2021/04/11/12:22:47 2021/04/12/12:01:48 85141sec 174.00% k006115 244362 F4cpu 2021/04/11/18:34:04 2021/04/12/07:30:34 46590sec 173.89% k046303 244648 F4cpu 2021/04/12/15:42:16 2021/04/12/16:42:09 3593sec 173.00% k046303 245219 F4cpu 2021/04/12/17:27:43 2021/04/12/18:27:38 3595sec 173.06% Regards, Yuichi
Hi Yuichi, > The settings were changed to the following during the regular maintenance on > April 1. > JobAcctGatherType = jobacct_gather / linux > This issue continued to occur. Thanks for the information. The fact that it also happen on jobacct_gather/linux is consistent with the root cause that we've found. > However, the number of cases is about 30% > less in linux than in cgroups. The root problem found is a race condition that happens when the nodes are so busy with jobs that Slurm daemons get timeouts while trying to communicate between them (that's your case, network issue could also lead to those timeouts of course). As any race condition it has some random behavior and unrelated changes in the code could lead to them to happen or (seem to) disappear. The fact that it depends on how busy are the nodes, makes it even more random. So, this 30% less is not important to me at this point. Hopefully, the patch under internal QA will lead to a 0%! ;-) Regards, Albert
Hi Albert, I have a question from a customer who manages this supercomputer. After this issue is fixed, can I fix the wrong CPU time for previous jobs? Regards, Yuichi
Hi Yuichi, > I have a question from a customer who manages this supercomputer. > After this issue is fixed, can I fix the wrong CPU time for previous jobs? Unfortunately there isn't a supported way to change the wrong gathered and accounted data, specially because there is no way to know what data was supposed to be correctly gathered in the first place. Anyway, as far as I've seen from your (extended) logs, you are facing that issue only on jobs that have a really high CPU usage (at least on 1 of the compute nodes). Maybe assuming that values >100% are actually really close to 100% won't be too far from what really happened. Regards, Albert
Hi Yuchi, I'm glad to let you know that this issue has been fixed and the fix will be released as part of the next minor release 20.11.7. More details on the commit: - https://github.com/SchedMD/slurm/commit/c3325f6402d103929e5862a2d4e9ebc7e7f27678 Thanks a lot for your help debugging the issue, Albert
Hi Albert, Thank you for information of bug fix release. Do you have the bug fix based on 20.02 ? This system is running at 20.02.6. Regards, Yuichi
Hi Yuichi, > Do you have the bug fix based on 20.02 ? The bug fix will be released only on 20.11.7, and newer. Once there is a new major release the bug fixing support on the older one is focused on security or very disruptive issues. Regards, Albert
Hi Albert, We will update our systems to 20.11.7 at next maintenance(2021.06.11). Would you like to return the following parameter values to default values? MessageTimeout=50 --> 10 JobAcctGatherType=jobacct_gather/linux --> jobacct_gather/cgroup Also, it was recommended to modify the values of the following parameters in (SlurmVersion)/src/common/pack.c in bug 10735, but should I modify them in 20.11.7 as well? -#define MAX_ARRAY_LEN_SMALL 10000 -#define MAX_ARRAY_LEN_MEDIUM 1000000 -#define MAX_ARRAY_LEN_LARGE 100000000 +#define MAX_ARRAY_LEN_SMALL 100000 +#define MAX_ARRAY_LEN_MEDIUM 10000000 +#define MAX_ARRAY_LEN_LARGE 1000000000 Regards, Yuichi
Hi Yuichi, > We will update our systems to 20.11.7 at next maintenance(2021.06.11). Good. > Would you like to return the following parameter values to default values? > > MessageTimeout=50 --> 10 > JobAcctGatherType=jobacct_gather/linux --> jobacct_gather/cgroup Yes please. Those changes were only part of the debug/diagnose/workaround steps for this bug, but you don't need them anymore. > Also, it was recommended to modify the values of the following parameters in > (SlurmVersion)/src/common/pack.c in bug 10735, but should I modify them in > 20.11.7 as well? > > -#define MAX_ARRAY_LEN_SMALL 10000 > -#define MAX_ARRAY_LEN_MEDIUM 1000000 > -#define MAX_ARRAY_LEN_LARGE 100000000 > +#define MAX_ARRAY_LEN_SMALL 100000 > +#define MAX_ARRAY_LEN_MEDIUM 10000000 > +#define MAX_ARRAY_LEN_LARGE 1000000000 From your bug 10735 comment 23 I assume that the patch is being a good workaround for you now, so I would recommend that you keep it a bit more. I see that we have some internal work in progress related to that issue, but we are still evaluating the possible solutions, therefore we have not yet the final fix and most probably the issue will appear again in your system if you remove the patch. Regards, Albert
Hi Albert, After the version upgrade, I monitored it for one month, but no data with CPU time exceeding elapsed time occurred. I reported to the customer that it was fixed. Thank you for your support. Regards, Yuichi
Hi Yuichi, Thanks for sharing it. I'm glad that it works as expected. Regards, Albert