Ticket 6315 - Sacct reports incorrect usercpu value for job array tasks
Summary: Sacct reports incorrect usercpu value for job array tasks
Status: RESOLVED DUPLICATE of ticket 6332
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.4
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-01-04 08:34 MST by Christopher Coffey
Modified: 2019-01-18 08:57 MST (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm conf file (5.07 KB, text/plain)
2019-01-04 08:34 MST, Christopher Coffey
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Christopher Coffey 2019-01-04 08:34:53 MST
Created attachment 8843 [details]
slurm conf file

Hello,

We do not have a support contract, but I thought I'd file this bug anyways. Since upgrading from 17.11 series to 18.08 (18.08.4 to be accurate) we notice that the usercpu value reported from sacct is wrong for job array tasks. 

This is repeatable by running a multicore job array job as so:

------------
#!/bin/bash
#SBATCH --job-name=array_test                  
#SBATCH --workdir=/scratch/cbn35/bigdata          
#SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
#SBATCH --time=20:00  
#SBATCH --array=1-5
#SBATCH -c2
       
srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
-------------

This results in the following stats from sacct:

       JobID  ReqCPUS    UserCPU  Timelimit    Elapsed
    ------------ -------- ---------- ---------- ----------
    15730924_5          2   02:30:14   00:20:00   00:01:08
    15730924_5.+        2  00:00.004              00:01:08
    15730924_5.+        2   00:00:00              00:01:09
    15730924_5.0        2   02:30:14              00:01:05
    15730924_1          2   02:30:48   00:20:00   00:01:08
    15730924_1.+        2  00:00.013              00:01:08
    15730924_1.+        2   00:00:00              00:01:09
    15730924_1.0        2   02:30:48              00:01:05
    15730924_2          2   02:15:52   00:20:00   00:01:07
    15730924_2.+        2  00:00.007              00:01:07
    15730924_2.+        2   00:00:00              00:01:07
    15730924_2.0        2   02:15:52              00:01:06
    15730924_3          2   02:30:20   00:20:00   00:01:08
    15730924_3.+        2  00:00.010              00:01:08
    15730924_3.+        2   00:00:00              00:01:09
    15730924_3.0        2   02:30:20              00:01:05
    15730924_4          2   02:30:26   00:20:00   00:01:08
    15730924_4.+        2  00:00.006              00:01:08
    15730924_4.+        2   00:00:00              00:01:09
    15730924_4.0        2   02:30:25              00:01:05

Note that usercpu should be ~2 x the elapsed value due to having two cores running at 100% for the elapsed time. Yet the usercpu is off by orders of magnitude. 

If you run seff on this job you can see the usercpu issue easily:

 seff 15730924
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.
Job ID: 15730924
Array Job ID: 15730924_5
Cluster: monsoon
User/Group: cbn35/clusterstu
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 03:19:15
CPU Efficiency: 8790.44% of 00:02:16 core-walltime
Job Wall-clock time: 00:01:08
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)

Note above that seff reports efficiency for cpu was 8790%. This value should be something less than 100% of course.

Possibly something else worth mentioning is that we are noticing the following in some slurm job array logs:

slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen

This bug is annoying as we have been trying to train our users to be efficient with their resources. We will look at tracking the bug down but thought I'd mention it here.

I've attached our slurm.conf.

Thanks,
Chris
Comment 1 Christopher Coffey 2019-01-04 15:47:10 MST
Correction: this bug occurs in any type of job, not just array jobs.
Comment 2 Michael Hinton 2019-01-18 08:57:10 MST
Hi Chris,

This was recently fixed. Check the linked bug for more information.

Thanks,
Michael

*** This ticket has been marked as a duplicate of ticket 6332 ***