To check that jobs are using most of the CPUs and memory they request we are examining usage data derived from the sacct field TRESUsageInTot. However we are seeing something which doesn't make sense to us and calls into question the usefulness of the CPU times recorded in that field (and in TRESUsageInAve). When there is only one task in a job step we would expect the minimums, maximums, averages and totals of its TRES usage numbers to all be the same. This is apparently true for most of the TRES, but not for CPU. The exact same numbers appear in the shorter MinCPU and AveCPU fields, so I will describe the problem in terms of those fields. For something like 5% of the one-task Slurm job steps on Mahuika we see MinCPU > AveCPU, eg: Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml $ sacct -j 14092980_8.batch --format elapsed,AveCPU,MinCPU,TotalCPU,AllocCPU%5,NTasks%5,State Elapsed AveCPU MinCPU TotalCPU Alloc NTask State ---------- ---------- ---------- ---------- ----- ----- ---------- 01:16:09 00:08:31 06:47:39 07:54:05 16 1 COMPLETED Shouldn't AveCPU, MinCPU and TotalCPU all be the same there?
Tony, I am going to need more information to try reproducing the issue. Could you send me you slurm.conf, slurmdbd.conf, and logs around the time that these jobs ran? Do you see any patterns to these jobs, are they coming from a specific user or type of request? Does this error always accompany the issue? >Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml Any other info that could help me to reproduce the issue? Thanks, Scott
Please disregard this: Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml This is just JIRA formatting issue message.
Uploaded: slurm.conf slurmdbd.conf slurm controller logs for 05-06 Aug 2020 where this occurred. $ TZ=UTC sacct -X -j 14092980_8 --format JobID,JobIDRaw,start,end JobID JobIDRaw Start End ---------- ---------- -------------- -------------- 14092980_8 14093015 Aug 05 23:30 Aug 06 00:47
Created attachment 15425 [details] slurm.conf
Created attachment 15426 [details] slurmdbd.conf
Created attachment 15427 [details] slurm controller log for 05-06 Aug 2020
Any update on this please? Thanks.
Tony - Scott is out of the office today but will be back in tomorrow. I will have his reply to you then.
Tony, Could you also send the output to: >sacct --starttime 2020-08-05T23:00:00 --endtime 2020-08-06T01:00:00 --format elapsed,AveCPU,MinCPU,TotalCPU,AllocCPU%5,NTasks%5,State,jobid -Scott
Scott: Attached. output_sacct.out Cheers, Tony
Created attachment 15500 [details] sacct output
Tony, Actually, I think I need more info. Can you send me this? >sacct --starttime 2020-08-05T23:00:00 --endtime 2020-08-06T00:30:00 --format ALL -p -Scott
Scott: Uploaded. sacct_output2.out Tony
Created attachment 15518 [details] Sacct Output 2
Scott: Have you found anything? Thanks, Tony
Tony, I'm working on it. If you know any way to reproduce the issue it would be quite helpful. I believe that slurm is acting as expected in this instance but I am not sure. I'll get back to you when I can give you a more complete response. -Scott
Tony, Good news, I was able to reproduce the issue and consequently understand what is going on here. This is an issue with jobacct_gather/linux. Each program in a batch script counts up separately from instead of producing an actual total. Example: >#!/bin/bash >stress -c 1 -t 9 >stress -c 1 -t 5 >exit > > AveCPU MinCPU TotalCPU Elapsed JobID NTasks >---------- ---------- ---------- ---------- ------------ -------- > 00:13.990 00:00:14 57 > 00:00:04 00:00:08 00:13.990 00:00:14 57.batch 1 What is happening here is that /proc/<pid>/stat is not tallying up the numbers as slurm expects it to. It just does for each program instance. So when the first stress finishes it restarts from 0. Because of internal slurm logic, AveCPU (or TRESUsageInTot) takes the number from the last ran program from the script and MinCPU takes its number from the max time program. However, it properly counts them with jobacct_gather/cgroup: > AveCPU MinCPU TotalCPU Elapsed JobID NTasks >---------- ---------- ---------- ---------- ------------ -------- > 00:13.989 00:00:15 59 > 00:00:13 00:00:13 00:13.989 00:00:15 59.batch 1 We may or may not fix this issue. For now, if you really want to fix this I would suggest using jobacct_gather/cgroup or simply requiring each batch script encapsulate each separate process in srun so that there aren't multiple processes in a step. i.e. >#!/bin/bash >srun stress -c 1 -t 9 >srun stress -c 1 -t 5 >exit -Scott
Thanks Scott. Hopefully this could be fixed in a future release or via a patch.
To add: We are also in a bind to switch to jobacc_gather/cgroup as we had an issue before that after some more testing: It seems JobAcctGatherType=jobacct_gather/cgroup does not gather accounting data from the extern job step (which exists to adopt any ssh sessions into the job). Switching slurm.conf back to the default of JobAcctGatherType=jobacct_gather/linux fixes that. So we face an interesting tradeoff. We'll leave things as is for now and hopefully that issue with jobacct_gather/linux will be planned to be fixed in a future release or a patch.
Tony, That is unexpected that you would have trouble with JobAcctGatherType=jobacct_gather/cgroup not gathering accounting data from the extern job step. Could you explain more about the issue? -Scott
Scott: Will get back to you on that. Tony
Tony, We fixed the issue with jobacct_gather/linux. It should be released in 20.11 If you want to take a look at it or run use it now the commit ids are d46722fff35..d86339dc27ed1. It does sound like you have an issue with jobacct_gather/cgroup. If you would like help with that please open a new ticket and reference this ticket in it. I am going to go ahead and close this ticket, if you have questions related to the original issue or this fix feel free to reopen it. -Scott