Ticket 9555 - Slurm sometimes reports AveCPU < MinCPU
Summary: Slurm sometimes reports AveCPU < MinCPU
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-08-10 23:33 MDT by Tony Racho
Modified: 2021-01-11 08:56 MST (History)
1 user (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: NIWA/WELLINGTON
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed: 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (6.67 KB, text/plain)
2020-08-12 18:19 MDT, Tony Racho
Details
slurmdbd.conf (652 bytes, text/plain)
2020-08-12 18:19 MDT, Tony Racho
Details
slurm controller log for 05-06 Aug 2020 (11.03 MB, text/plain)
2020-08-12 18:19 MDT, Tony Racho
Details
sacct output (1.13 MB, text/plain)
2020-08-18 17:39 MDT, Tony Racho
Details
Sacct Output 2 (10.90 MB, text/plain)
2020-08-19 15:00 MDT, Tony Racho
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Tony Racho 2020-08-10 23:33:01 MDT
To check that jobs are using most of the CPUs and memory they request we are examining usage data derived from the sacct field TRESUsageInTot. However we are seeing something which doesn't make sense to us and calls into question the usefulness of the CPU times recorded in that field (and in TRESUsageInAve).

When there is only one task in a job step we would expect the minimums, maximums, averages and totals of its TRES usage numbers to all be the same. This is apparently true for most of the TRES, but not for CPU.

The exact same numbers appear in the shorter MinCPU and AveCPU fields, so I will describe the problem in terms of those fields.

For something like 5% of the one-task Slurm job steps on Mahuika we see MinCPU > AveCPU, eg:

Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

$ sacct -j 14092980_8.batch --format elapsed,AveCPU,MinCPU,TotalCPU,AllocCPU%5,NTasks%5,State
   Elapsed     AveCPU     MinCPU   TotalCPU Alloc NTask      State 
---------- ---------- ---------- ---------- ----- ----- ---------- 
  01:16:09   00:08:31   06:47:39   07:54:05    16     1  COMPLETED 

Shouldn't AveCPU, MinCPU and TotalCPU all be the same there?
Comment 1 Scott Hilton 2020-08-12 15:44:42 MDT
Tony,

I am going to need more information to try reproducing the issue. Could you send me you slurm.conf, slurmdbd.conf, and logs around the time that these jobs ran?

Do you see any patterns to these jobs, are they coming from a specific user or type of request?

Does this error always accompany the issue?
>Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

Any other info that could help me to reproduce the issue?

Thanks,

Scott
Comment 2 Tony Racho 2020-08-12 18:06:00 MDT
Please disregard this:

Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

This is just JIRA formatting issue message.
Comment 3 Tony Racho 2020-08-12 18:18:17 MDT
Uploaded:

slurm.conf
slurmdbd.conf
slurm controller logs for 05-06 Aug 2020 where this occurred.

$ TZ=UTC sacct -X -j 14092980_8 --format JobID,JobIDRaw,start,end
     JobID   JobIDRaw          Start            End 
---------- ---------- -------------- -------------- 
14092980_8   14093015   Aug 05 23:30   Aug 06 00:47
Comment 4 Tony Racho 2020-08-12 18:19:00 MDT
Created attachment 15425 [details]
slurm.conf
Comment 5 Tony Racho 2020-08-12 18:19:23 MDT
Created attachment 15426 [details]
slurmdbd.conf
Comment 6 Tony Racho 2020-08-12 18:19:49 MDT
Created attachment 15427 [details]
slurm controller log for 05-06 Aug 2020
Comment 7 Tony Racho 2020-08-17 15:59:04 MDT
Any update on this please?

Thanks.
Comment 8 Jason Booth 2020-08-17 16:02:24 MDT
Tony - Scott is out of the office today but will be back in tomorrow. I will have his reply to you then.
Comment 9 Scott Hilton 2020-08-18 12:56:15 MDT
Tony, 

Could you also send the output to:

>sacct --starttime 2020-08-05T23:00:00  --endtime 2020-08-06T01:00:00 --format elapsed,AveCPU,MinCPU,TotalCPU,AllocCPU%5,NTasks%5,State,jobid

-Scott
Comment 10 Tony Racho 2020-08-18 17:39:30 MDT
Scott:

Attached. output_sacct.out

Cheers,
Tony
Comment 11 Tony Racho 2020-08-18 17:39:52 MDT
Created attachment 15500 [details]
sacct output
Comment 12 Scott Hilton 2020-08-19 10:33:40 MDT
Tony,

Actually, I think I need more info. Can you send me this?
>sacct --starttime 2020-08-05T23:00:00  --endtime 2020-08-06T00:30:00 --format ALL -p
-Scott
Comment 13 Tony Racho 2020-08-19 14:59:33 MDT
Scott:

Uploaded. sacct_output2.out

Tony
Comment 14 Tony Racho 2020-08-19 15:00:01 MDT
Created attachment 15518 [details]
Sacct Output 2
Comment 15 Tony Racho 2020-08-23 20:35:02 MDT
Scott:

Have you found anything?

Thanks,
Tony
Comment 17 Scott Hilton 2020-08-24 17:26:12 MDT
Tony, 

I'm working on it. If you know any way to reproduce the issue it would be quite helpful.

I believe that slurm is acting as expected in this instance but I am not sure.

I'll get back to you when I can give you a more complete response.

-Scott
Comment 19 Scott Hilton 2020-08-27 10:01:31 MDT
Tony, 

Good news, I was able to reproduce the issue and consequently understand what is going on here. 

This is an issue with jobacct_gather/linux.

Each program in a batch script counts up separately from instead of producing an actual total.

Example:
>#!/bin/bash
>stress -c 1 -t 9
>stress -c 1 -t 5
>exit
> 
>    AveCPU     MinCPU   TotalCPU    Elapsed        JobID   NTasks 
>---------- ---------- ---------- ---------- ------------ -------- 
>                       00:13.990   00:00:14 57                    
>  00:00:04   00:00:08  00:13.990   00:00:14 57.batch            1 

What is happening here is that /proc/<pid>/stat is not tallying up the numbers as slurm expects it to. It just does for each program instance. So when the first stress finishes it restarts from 0.
Because of internal slurm logic, AveCPU (or TRESUsageInTot) takes the number from the last ran program from the script and MinCPU takes its number from the max time program.

However, it properly counts them with jobacct_gather/cgroup:
>   AveCPU     MinCPU   TotalCPU    Elapsed        JobID   NTasks 
>---------- ---------- ---------- ---------- ------------ -------- 
>                       00:13.989   00:00:15 59                    
>  00:00:13   00:00:13  00:13.989   00:00:15 59.batch            1

We may or may not fix this issue. For now, if you really want to fix this I would suggest using jobacct_gather/cgroup or simply requiring each batch script encapsulate each separate process in srun so that there aren't multiple processes in a step.
i.e.
>#!/bin/bash
>srun stress -c 1 -t 9
>srun stress -c 1 -t 5
>exit

-Scott
Comment 22 Tony Racho 2020-08-27 15:50:41 MDT
Thanks Scott.

Hopefully this could be fixed in a future release or via a patch.
Comment 24 Tony Racho 2020-08-27 17:39:37 MDT
To add:

We are also in a bind to switch to jobacc_gather/cgroup as we had an issue before that after some more testing: It seems JobAcctGatherType=jobacct_gather/cgroup does not gather accounting data from the extern job step (which exists to adopt any ssh sessions into the job). Switching slurm.conf back to the default of JobAcctGatherType=jobacct_gather/linux fixes that.

So we face an interesting tradeoff. 

We'll leave things as is for now and hopefully that issue with jobacct_gather/linux will be planned to be fixed in a future release or a patch.
Comment 25 Tony Racho 2020-08-27 17:39:48 MDT
To add:

We are also in a bind to switch to jobacc_gather/cgroup as we had an issue before that after some more testing: It seems JobAcctGatherType=jobacct_gather/cgroup does not gather accounting data from the extern job step (which exists to adopt any ssh sessions into the job). Switching slurm.conf back to the default of JobAcctGatherType=jobacct_gather/linux fixes that.

So we face an interesting tradeoff. 

We'll leave things as is for now and hopefully that issue with jobacct_gather/linux will be planned to be fixed in a future release or a patch.
Comment 27 Scott Hilton 2020-08-31 10:48:30 MDT
Tony, 

That is unexpected that you would have trouble with JobAcctGatherType=jobacct_gather/cgroup not gathering accounting data from the extern job step. Could you explain more about the issue?

-Scott
Comment 28 Tony Racho 2020-08-31 20:51:04 MDT
Scott:

Will get back to you on that.

Tony
Comment 38 Scott Hilton 2020-09-18 16:15:15 MDT
Tony, 

We fixed the issue with jobacct_gather/linux. It should be released in 20.11 

If you want to take a look at it or run use it now the commit ids are d46722fff35..d86339dc27ed1.

It does sound like you have an issue with jobacct_gather/cgroup. If you would like help with that please open a new ticket and reference this ticket in it.

I am going to go ahead and close this ticket, if you have questions related to the original issue or this fix feel free to reopen it.

-Scott