# sacct -j 12524024 --format=elapsed,cputime,totalcpu,reqtres -p Elapsed|CPUTime|TotalCPU|ReqTRES| 01:50:58|11:05:48|32-08:56:02|billing=6,cpu=6,mem=10G,node=1| 01:50:58|11:05:48|32-08:56:02|| 01:50:58|11:05:48|00:00.002|| How can the totalcpu time from sacct be ~ 64 x the CPU time ( elapsed x # cpus ) Most importantly, might this be affecting users fairshare calculation when using fair_tree ?
Hi Jenny, > How can the totalcpu time from sacct be ~ 64 x the CPU time ( elapsed x # cpus ) It looks like probably job gathering is not working fine. Could you post the output of these commands: $ sacct -j 12524024 -p --format=elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres $ scontrol show config | grep Gather $ scontrol show config | grep ProctrackType > Most importantly, might this be affecting users fairshare calculation when using fair_tree? No. Fair-share is based on what the job asked for (job requested/consumed), not on what the job "physically" used (job gathered). For example, We have two users, Bob and Sue, with the same Share and currently no EffectiveUsage (no jobs run so far), so the same FairShare: $ sshare -u bob,sue Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- development bob 1 0.333333 0 0.000000 0.600000 development sue 1 0.333333 0 0.000000 0.600000 They run one job each asking for the same TRES 1 CPU for one 1 minute (CPUTime), but their actual tasks are much different in terms of CPU consuming (TotalCPU): bob$ srun sleep 60 # CPUTime should be 60s and TotalCPU should be close to 0s sue$ srun stress -t 60 -c 1 # CPUTime should be 60s and TotalCPU should be close to 60s We can see that sacct shows that Sue used much more TotalCPU, but both used the same CPUTime: $ sacct -u bob,sue --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres User JobID Elapsed CPUTime SystemCPU UserCPU TotalCPU ReqTRES --------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- bob 45 00:01:01 00:01:01 00:00:00 00:00:00 00:00:00 billing=1+ 45.extern 00:01:01 00:01:01 00:00:00 00:00:00 00:00:00 45.0 00:01:01 00:01:01 00:00:00 00:00:00 00:00:00 sue 46 00:01:01 00:01:01 00:00.380 00:57.904 00:58.284 billing=1+ 46.extern 00:01:01 00:01:01 00:00:00 00:00:00 00:00:00 46.0 00:01:01 00:01:01 00:00.380 00:57.904 00:58.284 But, although this difference in TotalCPU, if we check again sshare, we can see that their EffectiveUsage is the same, therefore their FairShare factor remain balanced: $ sshare -u bob,sue Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- development bob 1 0.333333 60 0.085959 0.600000 development sue 1 0.333333 60 0.085959 0.600000 Finally, if for example Bob does an extra job (sleep): bob$ srun sleep 60 Then Sue have a higher FairShare: $ sshare -u bob,sue Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- development bob 1 0.333333 120 0.157375 0.400000 development sue 1 0.333333 60 0.079243 0.600000 The FairShare value of sshare is a relation between NormShares (% of TRES that the Account is granted/promised to have) and EffectiveUsage (% of TRES that the Account have consumed so far). Gathered info is not used. Albert
# sacct -j 12524024 -p --format=elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES| 01:50:58|11:05:48|03:11:50|32-05:44:11|32-08:56:02|billing=6,cpu=6,mem=10G,node=1| 01:50:58|11:05:48|03:11:50|32-05:44:11|32-08:56:02|| 01:50:58|11:05:48|00:00.001|00:00.001|00:00.002|| # scontrol show config |egrep "Gather|ProctrackType" AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none JobAcctGatherFrequency = task=15 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) ProctrackType = proctrack/cgroup
Jenny, I'm not certain, but it seems that SystemCPU value is more reasonable than UserCPU. But maybe both are wrong. Is this happening to all jobs or just some? Is this happening to all nodes or just some? From your config these values should be obtained by reading the unormalized values from /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat and dividing/normalizing them by the CPU freq obtained by calling sysconf(_SC_CLK_TCK) (usually 100). We can try to run some tests to inspect these values. You could run these kind of commands and post the values of their respective cpuacct.stat: $ srun -n 1 stress -c 1 t 60 $ srun -n 2 stress -c 1 t 60 And maybe also: $ srun -n 1 sleep 60 $ srun -n 2 sleep 60 What are the values of CPUTime, SystemCPU, UserCPU and TotalCPU of these jobs? If they are fine, you could try to run them in the same node that job 12524024 was run. Also you can build the following code and run it into the failing nodes: #include <stdio.h> #include <unistd.h> int main() { fprintf(stdout, "No. of clock ticks per sec : %ld\n",sysconf(_SC_CLK_TCK)); return 0; }
Jenny, Could you also attach the cgroup.conf? In some scenarios, if you don't ConstrainCores, then TotalCPU could be higher than CPUTime. Are you constraining cores in cgroup.conf? Regards, Albert
Hi Jenny, How is it going? Have you being able to try what we mentioned in comment 3? Albert
Hi Jenny, Just a pinging to know if we still can help you here?
Hi Jenny, I'm closing this bug as timedout, but please don't hesitate to reopen it again if you still face the issue or want further info. Albert
Reopening this case after having updated slurm on the cluster to 18.08-6.2 Listing for every compute node is the same reply for the clock-tick question ---------------- b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201-1208],t[0601-0605] ---------------- No. of clock ticks per sec : 100 ---------------- b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140],g[0301-0310,0312-0316,0601-0605],m1004,off[01-04],t[0601-0605] ---------------- ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- ConstrainRAMSpace=yes CgroupReleaseAgentDir=/etc/slurm/cgroup ConstrainCores=yes TaskAffinity=yes CgroupAutomount=yes $ srun -n 1 stress -c 1 -t 60 srun: job 16781193 queued and waiting for resources srun: job 16781193 has been allocated resources stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd stress: info: [21204] successful run completed in 60s $ srun -n 2 stress -c 1 -t 60 srun: job 16781255 queued and waiting for resources srun: job 16781255 has been allocated resources stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd stress: info: [176692] successful run completed in 60s stress: info: [176693] successful run completed in 60s
Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior.
Hi Jenny, > Reopening this case after having updated slurm on the cluster to 18.08-6.2 > Listing for every compute node is the same reply for the clock-tick question > ---------------- > b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140], > g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201- > 1208],t[0601-0605] > ---------------- > No. of clock ticks per sec : 100 This is quite expected. Ok. > # Slurm cgroup support configuration file > ConstrainRAMSpace=yes > CgroupReleaseAgentDir=/etc/slurm/cgroup > ConstrainCores=yes > TaskAffinity=yes > CgroupAutomount=yes About CgroupReleaseAgentDir: Please note that it was removed / deprecated / igonred in 17.11. You can safetely removed, but it's not causing any problem. Aboyt TaskAffinity=yes This is the related info that you can see in slurm.conf: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces. Jenny, if you are already using task/affinity and you also have TaskAffinity=yes here, this could lead to problems. Please double check it. Anyway, although I'm not totally certain, this shouldn't lead to the issue reported here. > $ srun -n 1 stress -c 1 -t 60 > srun: job 16781193 queued and waiting for resources > srun: job 16781193 has been allocated resources > stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [21204] successful run completed in 60s > > > $ srun -n 2 stress -c 1 -t 60 > srun: job 16781255 queued and waiting for resources > srun: job 16781255 has been allocated resources > stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [176692] successful run completed in 60s > stress: info: [176693] successful run completed in 60s Ok! Now, what's the value of: $ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres Please note that version 18.08.6 has a regression bug 6697, and probably you will need to specify the startime like this: $ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres --starttime 2019-03-01 > Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior. If I'm understanding it correctly, you mean here that the above command are reporting TotalCPU >> CPUTime? In such case, could please replicate the commands and while they are running check the values of /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat in the node where they run?
From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, April 1, 2019 8:18 AM To: Williams, Jenny Avis <jennyw@email.unc.edu> Subject: [Bug 6541] sacct totalcpu too large Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=6541#c10> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com> Hi Jenny, > Reopening this case after having updated slurm on the cluster to 18.08-6.2 > Listing for every compute node is the same reply for the clock-tick question > ---------------- > b[1001-1027],c[0301-0320,0401-0410,0501-0540,0801-0840,0901-0940,1101-1140], > g[0301-0310,0312-0316,0601-0605],m[1001-1002,1004-1006],off[01-04],s[1201- > 1208],t[0601-0605] > ---------------- > No. of clock ticks per sec : 100 This is quite expected. Ok. > # Slurm cgroup support configuration file > ConstrainRAMSpace=yes > CgroupReleaseAgentDir=/etc/slurm/cgroup > ConstrainCores=yes > TaskAffinity=yes > CgroupAutomount=yes About CgroupReleaseAgentDir: Please note that it was removed / deprecated / igonred in 17.11. You can safetely removed, but it's not causing any problem. Aboyt TaskAffinity=yes This is the related info that you can see in slurm.conf: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces. Jenny, if you are already using task/affinity and you also have TaskAffinity=yes here, this could lead to problems. Please double check it. Anyway, although I'm not totally certain, this shouldn't lead to the issue reported here. > $ srun -n 1 stress -c 1 -t 60 > srun: job 16781193 queued and waiting for resources > srun: job 16781193 has been allocated resources > stress: info: [21204] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [21204] successful run completed in 60s > > > $ srun -n 2 stress -c 1 -t 60 > srun: job 16781255 queued and waiting for resources > srun: job 16781255 has been allocated resources > stress: info: [176692] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [176693] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd > stress: info: [176692] successful run completed in 60s > stress: info: [176693] successful run completed in 60s Ok! Now, what's the value of: $ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres Please note that version 18.08.6 has a regression bug 6697<show_bug.cgi?id=6697>, and probably you will need to specify the startime like this: $ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres --starttime 2019-03-01 $ sacct -j 16781193,16781255 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres --starttime 2019-03-01 User|JobID|Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES| jennyw|16781193|00:01:01|00:01:01|00:00.003|01:39:55|01:39:55|billing=1,cpu=1,mem=400M,node=1| |16781193.extern|00:01:01|00:01:01|00:00:00|00:00.001|00:00.001|| |16781193.0|00:01:00|00:01:00|00:00.003|01:39:55|01:39:55|| jennyw|16781255|00:01:01|00:02:02|00:00.005|03:19:46|03:19:46|billing=2,cpu=2,mem=800M,node=1| |16781255.extern|00:01:01|00:02:02|00:00:00|00:00:00|00:00:00|| |16781255.0|00:01:00|00:02:00|00:00.004|03:19:46|03:19:46|| > Surveying just 1 cpu jobs over the last day I am not seeing any jobs that are not showing this behavior. If I'm understanding it correctly, you mean here that the above command are reporting TotalCPU >> CPUTime? In such case, could please replicate the commands and while they are running check the values of /sys/fs/cgroup/cpuacct/slurm/uid_NN/job_NN/cpuacct.stat in the node where they run? ________________________________ You are receiving this mail because: * You reported the bug.
Hi Jenny, From these two commands: > $ srun -n 1 stress -c 1 -t 60 > $ srun -n 2 stress -c 1 -t 60 These UserCPU values obtained are not correct: > User | JobID | Elapsed | CPUTime | SystemCPU | UserCPU | TotalCPU | ReqTRES | > jennyw | 16781193 | 00:01:01 | 00:01:01 | 00:00.003 | 01:39:55 | 01:39:55 | billing=1,cpu=1,mem=400M,node=1 | > | 16781193.extern | 00:01:01 | 00:01:01 | 00:00:00 | 00:00.001 | 00:00.001 | | > | 16781193.0 | 00:01:00 | 00:01:00 | 00:00.003 | 01:39:55 | 01:39:55 | | > jennyw | 16781255 | 00:01:01 | 00:02:02 | 00:00.005 | 03:19:46 | 03:19:46 | billing=2,cpu=2,mem=800M,node=1 | > | 16781255.extern | 00:01:01 | 00:02:02 | 00:00:00 | 00:00:00 | 00:00:00 | | > | 16781255.0 | 00:01:00 | 00:02:00 | 00:00.004 | 03:19:46 | 03:19:46 | | It seems that jobacctgathering is not working properly. In comment 2 you said that you are using jobacct_gather/cgroup. Let's try to see what cgroup is actually reporting. To do it, first allocate into a node: $ salloc -c 1 -n 1 srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL Then run in background 10 seconds of a stress task/step and quickly check cgroups like this: $ srun -o /dev/null -n1 stress -c 1 --timeout 10 & $ for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm_c1/uid_$UID/job_$SLURM_JOB_ID/cpuacct.stat; sleep 1; done user 502 system 3 user 602 system 3 user 702 system 3 user 802 system 3 user 902 system 3 user 1002 system 4 [1]+ Done srun -o /dev/null -n1 stress -c 1 --timeout 10 user 1003 system 4 user 1003 system 4 user 1003 system 4 user 1003 system 4 Note that in my case it is working properly and, as I also have _SC_CLK_TCK = 100 and I'm quering cgroups once per second, the user cpu value reported is increasing by 100 each time, until the srun stops. Then it is just not increased any more. What happen in your case? Please, try also with 2 CPUs and post the output. Note that in such a case the expected increment should be 200 ticks per second: $ salloc -c 1 -n 2 srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL $ srun -o /dev/null -n2 stress -c 1 --timeout 10 & $ for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm_c1/uid_$UID/job_$SLURM_JOB_ID/cpuacct.stat; sleep 1; done user 340 system 2 user 539 system 2 user 739 system 2 user 939 system 3 user 1139 system 3 user 1339 system 3 user 1540 system 3 user 1740 system 3 user 1939 system 3 [1]+ Done srun -o /dev/null -n2 stress -c 1 --timeout 10 user 2000 system 4
I forgot to mention this: > Jenny, if you are already using task/affinity and you also have > TaskAffinity=yes here, this could lead to problems. Please double check it. > Anyway, although I'm not totally certain, this shouldn't lead to the issue > reported here. Could you confirm that you are not mixing task/affinity plugin in slurm.conf and TaskAffinity=yes in cgroups.conf. In fact, could you attach your slurm.conf and cgroups.conf? Also, the output of "uname -a" and "ls -lh /sys/fs/cgroup" of a failing node will also help us to replicate the issue.
Created attachment 9761 [details] cgroup.conf
Created attachment 9762 [details] slurm.conf
Hi Jenny, Looking further this bug looks like a duplicate of bug 6332, but it was fixed for 18.08.5. So, even we didn't solved in your case, or we have another very similar bug, or maybe we are actually facing the same bug? Could you double-check that your currently installed lib/slurm/jobacct_gather_cgroup.so has been actually recompiled for 18.08.6? And propagated/shared to the nodes that now run slurmd 18.08.6..? Sorry, just to be sure. Please note that even with the fix, the jobs run with versions from 18.08.0 to 18.08.4 the values in the database will still be wrong. But for new jobs it shouldn't. Albert
$ slurmd -V slurm 18.08.6-2 $ srun -o /dev/null -n1 stress -c 1 --timeout 10 # in another session - # for ii in {1..40} > do > cat /sys/fs/cgroup/cpuacct/slurm/uid_28534/job_17230410/cpuacct.stat > sleep 1 > done user 31 system 25 user 31 system 25 user 31 system 25 user 31 system 25 user 31 system 25 user 32 system 26 user 49 system 27 user 149 system 27 user 250 system 27 user 350 system 27 user 450 system 27 user 550 system 27 user 651 system 27 user 751 system 27 user 851 system 27 user 951 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27 user 1032 system 27
A run more similar to yours, with commands in same session. $ date; srun -o /dev/null -n1 stress -c 1 --timeout 10 & for ii in {1..10}; do cat /sys/fs/cgroup/cpuacct/slurm/uid_28534/job_17230448/cpuacct.stat ; sleep 1; date; done Mon Apr 1 13:42:07 EDT 2019 [1] 157244 user 2023 system 26 Mon Apr 1 13:42:08 EDT 2019 user 2115 system 27 Mon Apr 1 13:42:09 EDT 2019 user 2215 system 27 Mon Apr 1 13:42:10 EDT 2019 user 2315 system 28 Mon Apr 1 13:42:11 EDT 2019 user 2415 system 28 Mon Apr 1 13:42:12 EDT 2019 user 2515 system 29 Mon Apr 1 13:42:13 EDT 2019 user 2615 system 29 Mon Apr 1 13:42:14 EDT 2019 user 2716 system 30 Mon Apr 1 13:42:15 EDT 2019 user 2817 system 30 Mon Apr 1 13:42:16 EDT 2019 user 2917 system 30 Mon Apr 1 13:42:17 EDT 2019 [jennyw@c0501 ~]$ jobs [1]+ Done srun -o /dev/null -n1 stress -c 1 --timeout 10
(In reply to Albert Gil from comment #16) > Hi Jenny, > > Looking further this bug looks like a duplicate of bug 6332, but it was > fixed for 18.08.5. > So, even we didn't solved in your case, or we have another very similar bug, > or maybe we are actually facing the same bug? > > Could you double-check that your currently installed > lib/slurm/jobacct_gather_cgroup.so has been actually recompiled for 18.08.6? > And propagated/shared to the nodes that now run slurmd 18.08.6..? Sorry, > just to be sure. > > Please note that even with the fix, the jobs run with versions from 18.08.0 > to 18.08.4 the values in the database will still be wrong. > But for new jobs it shouldn't. > > > Albert The lib was installed with the rpm update. This is checking against node on which test ran. # ls -lth /usr/lib64/slurm/jobacct_gather_cgroup.so -rwxr-xr-x 1 root root 339K Mar 11 09:36 /usr/lib64/slurm/jobacct_gather_cgroup.so # rpm -qi slurm Name : slurm Version : 18.08.6 Release : 2.el7 Architecture: x86_64 Install Date: Thu 21 Mar 2019 05:22:23 PM EDT Group : System Environment/Base Size : 59455659 License : GPLv2+ Signature : (none) Source RPM : slurm-18.08.6-2.el7.src.rpm Build Date : Mon 11 Mar 2019 09:38:47 AM EDT Build Host : longleaf-test02.its.unc.edu Relocations : (not relocatable) URL : https://slurm.schedmd.com/ Summary : Slurm Workload Manager Description : Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. Components include machine status, partition management, job management, scheduling and accounting modules
$ sacct -j 17230448 -S 4/1 -p --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres User|JobID|Elapsed|CPUTime|SystemCPU|UserCPU|TotalCPU|ReqTRES| jennyw|17230448|00:11:01|00:11:01|00:00.322|00:30.200|00:30.523|billing=1,cpu=1,mem=4000M,node=1| |17230448.extern|00:11:01|00:11:01|00:00.001|00:00:00|00:00.001|| |17230448.0|00:10:58|00:10:58|00:00.309|00:00.324|00:00.633|| |17230448.1|00:00:11|00:00:11|00:00.002|00:09.998|00:00:10|| |17230448.2|00:00:11|00:00:11|00:00.004|00:09.937|00:09.942|| |17230448.3|00:00:10|00:00:10|00:00.004|00:09.940|00:09.945||
Hi Jenny, Your data is showing some kind of bug in slurm. The info from kernel-cgroups looks fine, but the one reported from sacct is not. I'm not yet able to replicate it, could you post the output of "uname -a" and "cat /etc-/os-release. If I'm not able to replicate it I will provide you with a patch to add extra logging info about gathering, accounting and printing these values to see where exactly it gets corrupted in your cluster.
# cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.6 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.6" PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.6 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.6" # uname -a Linux longleaf-sched.its.unc.edu 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Hi Jenny, I just want to let you know that I'm trying to replicate your issue again replicating as much as possible your same environment. I'll let you know, Albert
If it would help we could do a remote session so you could see it. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, May 22, 2019 8:54 AM To: Williams, Jenny Avis <jennyw@email.unc.edu> Subject: [Bug 6541] sacct totalcpu too large Comment # 25<https://bugs.schedmd.com/show_bug.cgi?id=6541#c25> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com> Hi Jenny, I just want to let you know that I'm trying to replicate your issue again replicating as much as possible your same environment. I'll let you know, Albert ________________________________ You are receiving this mail because: * You reported the bug.
Hi Jenny, Although I've replicated your exact same environment (kernel, os, slurm version and config), I've been not able to replicate your issue. I will attach a patch to add verbose information about this specific values to see what is going on in your cluster. Just to double-check, could you confirm that you are still using slurm 18.08.6-2? Albert
We are using that slurm version yes From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, May 27, 2019 10:33 AM To: Williams, Jenny Avis <jennyw@email.unc.edu> Subject: [Bug 6541] sacct totalcpu too large Comment # 27<https://bugs.schedmd.com/show_bug.cgi?id=6541#c27> on bug 6541<https://bugs.schedmd.com/show_bug.cgi?id=6541> from Albert Gil<mailto:albert.gil@schedmd.com> Hi Jenny, Although I've replicated your exact same environment (kernel, os, slurm version and config), I've been not able to replicate your issue. I will attach a patch to add verbose information about this specific values to see what is going on in your cluster. Just to double-check, could you confirm that you are still using slurm 18.08.6-2? Albert ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 10434 [details] Add debug traces to follow UserCPU Hi Jenny, This small patch adds some extra debug traces to follow how the UserCPU is gathered, stored and printed. Please, apply it to your environment and then run the commands that we already run to quickly replicate the issue: $ srun -n 1 stress -c 1 -t 60 $ srun -n 2 stress -c 1 -t 60 Then also run the sacct with the extra -vv for the two jobs submitted previously (jobid_1 and jobid_2): $ sacct -vvp -j $jobid_1,$jobid_2 --format=user,jobid,elapsed,cputime,systemcpu,usercpu,totalcpu,reqtres Then please attach the slurmctld.log, slurmdbd.log and the slurmd.log (of the node where the job was run). You can remove the patch once that commands are run, but it won't hurt. The patch is based on version 18.08.6-2. Thanks, Albert PS: As mentioned in comment 16, from your data this bug really seems a duplicate of bug 6332. Your UserCPU seems 100 times bigger than it should be, and 100 is your _SC_CLK_TCK. That's exactly what we fixed in 18.08.5: https://github.com/SchedMD/slurm/commit/8eb11cf75e3319802ac47d6c3fe165a863de4f59 Really strange... let's see what the extra debug traces tell us!
Hi Jenny, Have you been able to replicate the issue with the patch with extra debug info of comment 34? Regards, Albert
I am out of the office. If you have questions that require a reply before then please email research@unc.edu. Regards, Virginia Wililams Systems Administrator Research Computing UNC Chapel Hill
Hi Jenny, Are you again on the office? I hope that all went well. Have you been able to replicate the issue with the patch with extra debug info of comment 34? Regards, Albert
Hi Jenny, I'm closing this bug as timedout. Please feel free to reopen when you need it. Regards, Albert
Hi Jenny, In case it helps I just want to let you know that in bug 10723 we have been able to reproduce a very similar issue to this one with a very similar debug patch to the one added here in comment 34. With that we have diagnosed the root cause of the issue and we've fixed it. The fix will be included in the next minor release 20.11.7. Hope it helps, Albert