Created attachment 8843 [details] slurm conf file Hello, We do not have a support contract, but I thought I'd file this bug anyways. Since upgrading from 17.11 series to 18.08 (18.08.4 to be accurate) we notice that the usercpu value reported from sacct is wrong for job array tasks. This is repeatable by running a multicore job array job as so: ------------ #!/bin/bash #SBATCH --job-name=array_test #SBATCH --workdir=/scratch/cbn35/bigdata #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log #SBATCH --time=20:00 #SBATCH --array=1-5 #SBATCH -c2 srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s ------------- This results in the following stats from sacct: JobID ReqCPUS UserCPU Timelimit Elapsed ------------ -------- ---------- ---------- ---------- 15730924_5 2 02:30:14 00:20:00 00:01:08 15730924_5.+ 2 00:00.004 00:01:08 15730924_5.+ 2 00:00:00 00:01:09 15730924_5.0 2 02:30:14 00:01:05 15730924_1 2 02:30:48 00:20:00 00:01:08 15730924_1.+ 2 00:00.013 00:01:08 15730924_1.+ 2 00:00:00 00:01:09 15730924_1.0 2 02:30:48 00:01:05 15730924_2 2 02:15:52 00:20:00 00:01:07 15730924_2.+ 2 00:00.007 00:01:07 15730924_2.+ 2 00:00:00 00:01:07 15730924_2.0 2 02:15:52 00:01:06 15730924_3 2 02:30:20 00:20:00 00:01:08 15730924_3.+ 2 00:00.010 00:01:08 15730924_3.+ 2 00:00:00 00:01:09 15730924_3.0 2 02:30:20 00:01:05 15730924_4 2 02:30:26 00:20:00 00:01:08 15730924_4.+ 2 00:00.006 00:01:08 15730924_4.+ 2 00:00:00 00:01:09 15730924_4.0 2 02:30:25 00:01:05 Note that usercpu should be ~2 x the elapsed value due to having two cores running at 100% for the elapsed time. Yet the usercpu is off by orders of magnitude. If you run seff on this job you can see the usercpu issue easily: seff 15730924 Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624. Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624. Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624. Job ID: 15730924 Array Job ID: 15730924_5 Cluster: monsoon User/Group: cbn35/clusterstu State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 03:19:15 CPU Efficiency: 8790.44% of 00:02:16 core-walltime Job Wall-clock time: 00:01:08 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core) Note above that seff reports efficiency for cpu was 8790%. This value should be something less than 100% of course. Possibly something else worth mentioning is that we are noticing the following in some slurm job array logs: slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen This bug is annoying as we have been trying to train our users to be efficient with their resources. We will look at tracking the bug down but thought I'd mention it here. I've attached our slurm.conf. Thanks, Chris
Correction: this bug occurs in any type of job, not just array jobs.
Hi Chris, This was recently fixed. Check the linked bug for more information. Thanks, Michael *** This ticket has been marked as a duplicate of ticket 6332 ***