Ticket 7617

Summary:	slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	slurmstepd	Assignee:	Gavin D. Howard <gavin>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	aptivhpcsupport, cinek, marc.caubet, marshall, nate
Version:	19.05.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8656 https://bugs.schedmd.com/show_bug.cgi?id=8763
Site:	MSU	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	20.02.3	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	lac-014 slurmd log cgroup.conf

Description Steve Ford 2019-08-22 10:49:22 MDT

Created attachment 11323 [details]
lac-014 slurmd log

We are seeing some unusual errors on one of our nodes:

slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen

Any idea what could cause these?

Comment 3 Marshall Garey 2019-08-22 11:36:22 MDT

Hey Steve,

"This should never happen" bugs are always fun.

These errors mean that some of the accounting information from cgroups isn't being gathered by jobacct_gather, but it's not impacting the actual job in any way. From your slurmd log file, it's mostly the extern and batch steps, but a handful of cases are step 0 of some job.

I haven't been able to reproduce it yet. I suspect a race condition, possibly triggered by some configuration-specific thing. Can you upload your cgroup.conf? I have your slurm.conf file from a recent ticket (7580), so I don't need that.

- Marshall

Comment 4 Steve Ford 2019-08-22 11:38:34 MDT

Created attachment 11324 [details]
cgroup.conf

Comment 9 Marshall Garey 2019-08-23 09:49:10 MDT

Just a quick update - I can reproduce this occasionally simply by submitting a bunch of jobs. I still don't know why it's happening, though.

The good news is that it only appears to happen at the start of the job, it's almost always the extern step or batch step, it's just the cgroup accounting information that isn't gathered, and the cgroup accounting information is gathered in every subsequent try. So, you aren't really losing any data. Feel free to ignore these error messages; I'll keep trying to track down and fix the bug.

Comment 12 Gavin D. Howard 2019-09-12 13:33:42 MDT

Hello,

I just wanted to let you know that I have taken charge of this bug, and I am looking into it. We haven't forgotten.

Comment 15 Gavin D. Howard 2019-11-19 10:29:41 MST

Just a note that I am still working on this bug. It has proven hard for me to reproduce since it is a race condition, but I have not forgotten.

Comment 19 Marc Caubet Serrabou 2020-03-03 02:13:28 MST

Same problem seen in our cluster at PSI. Just let us know if you need any extra information regarding configurations or logs. Relevant setting in slurm.conf:

ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
TaskPlugin=task/affinity,task/cgroup

Comment 28 Gavin D. Howard 2020-05-06 14:03:47 MDT

Apologies for the long delay, but this has now been fixed and committed. See https://github.com/SchedMD/slurm/commit/4c030c03778e65534178449cedca9bbe483bd0ec .