Ticket 7617

Summary: slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
Product: Slurm Reporter: Steve Ford <fordste5>
Component: slurmstepdAssignee: Gavin D. Howard <gavin>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: aptivhpcsupport, cinek, marc.caubet, marshall, nate
Version: 19.05.1   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8656
https://bugs.schedmd.com/show_bug.cgi?id=8763
Site: MSU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: lac-014 slurmd log
cgroup.conf

Description Steve Ford 2019-08-22 10:49:22 MDT
Created attachment 11323 [details]
lac-014 slurmd log

We are seeing some unusual errors on one of our nodes:

slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen

Any idea what could cause these?
Comment 3 Marshall Garey 2019-08-22 11:36:22 MDT
Hey Steve,

"This should never happen" bugs are always fun.

These errors mean that some of the accounting information from cgroups isn't being gathered by jobacct_gather, but it's not impacting the actual job in any way. From your slurmd log file, it's mostly the extern and batch steps, but a handful of cases are step 0 of some job.

I haven't been able to reproduce it yet. I suspect a race condition, possibly triggered by some configuration-specific thing. Can you upload your cgroup.conf? I have your slurm.conf file from a recent ticket (7580), so I don't need that.

- Marshall
Comment 4 Steve Ford 2019-08-22 11:38:34 MDT
Created attachment 11324 [details]
cgroup.conf
Comment 9 Marshall Garey 2019-08-23 09:49:10 MDT
Just a quick update - I can reproduce this occasionally simply by submitting a bunch of jobs. I still don't know why it's happening, though.

The good news is that it only appears to happen at the start of the job, it's almost always the extern step or batch step, it's just the cgroup accounting information that isn't gathered, and the cgroup accounting information is gathered in every subsequent try. So, you aren't really losing any data. Feel free to ignore these error messages; I'll keep trying to track down and fix the bug.
Comment 12 Gavin D. Howard 2019-09-12 13:33:42 MDT
Hello,

I just wanted to let you know that I have taken charge of this bug, and I am looking into it. We haven't forgotten.
Comment 15 Gavin D. Howard 2019-11-19 10:29:41 MST
Just a note that I am still working on this bug. It has proven hard for me to reproduce since it is a race condition, but I have not forgotten.
Comment 19 Marc Caubet Serrabou 2020-03-03 02:13:28 MST
Same problem seen in our cluster at PSI. Just let us know if you need any extra information regarding configurations or logs. Relevant setting in slurm.conf:

ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
TaskPlugin=task/affinity,task/cgroup
Comment 28 Gavin D. Howard 2020-05-06 14:03:47 MDT
Apologies for the long delay, but this has now been fixed and committed. See https://github.com/SchedMD/slurm/commit/4c030c03778e65534178449cedca9bbe483bd0ec .