7617 – slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen

Ticket 7617 - slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen

Summary: slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should n...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	19.05.1
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Gavin D. Howard
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-22 10:49 MDT by Steve Ford
Modified:	2020-05-06 14:03 MDT (History)
CC List:	5 users (show)

See Also:	8656 8763
Site:	MSU
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
lac-014 slurmd log (9.38 MB, application/x-gzip) 2019-08-22 10:49 MDT, Steve Ford	Details
cgroup.conf (184 bytes, text/plain) 2019-08-22 11:38 MDT, Steve Ford	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Steve Ford 2019-08-22 10:49:22 MDT

Created attachment 11323 [details]
lac-014 slurmd log

We are seeing some unusual errors on one of our nodes:

slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen

Any idea what could cause these?

Comment 3 Marshall Garey 2019-08-22 11:36:22 MDT

Hey Steve,

"This should never happen" bugs are always fun.

These errors mean that some of the accounting information from cgroups isn't being gathered by jobacct_gather, but it's not impacting the actual job in any way. From your slurmd log file, it's mostly the extern and batch steps, but a handful of cases are step 0 of some job.

I haven't been able to reproduce it yet. I suspect a race condition, possibly triggered by some configuration-specific thing. Can you upload your cgroup.conf? I have your slurm.conf file from a recent ticket (7580), so I don't need that.

- Marshall

Comment 4 Steve Ford 2019-08-22 11:38:34 MDT

Created attachment 11324 [details]
cgroup.conf

Comment 9 Marshall Garey 2019-08-23 09:49:10 MDT

Just a quick update - I can reproduce this occasionally simply by submitting a bunch of jobs. I still don't know why it's happening, though.

The good news is that it only appears to happen at the start of the job, it's almost always the extern step or batch step, it's just the cgroup accounting information that isn't gathered, and the cgroup accounting information is gathered in every subsequent try. So, you aren't really losing any data. Feel free to ignore these error messages; I'll keep trying to track down and fix the bug.

Comment 12 Gavin D. Howard 2019-09-12 13:33:42 MDT

Hello,

I just wanted to let you know that I have taken charge of this bug, and I am looking into it. We haven't forgotten.

Comment 15 Gavin D. Howard 2019-11-19 10:29:41 MST

Just a note that I am still working on this bug. It has proven hard for me to reproduce since it is a race condition, but I have not forgotten.

Comment 19 Marc Caubet Serrabou 2020-03-03 02:13:28 MST

Same problem seen in our cluster at PSI. Just let us know if you need any extra information regarding configurations or logs. Relevant setting in slurm.conf:

ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
TaskPlugin=task/affinity,task/cgroup

Comment 28 Gavin D. Howard 2020-05-06 14:03:47 MDT

Apologies for the long delay, but this has now been fixed and committed. See https://github.com/SchedMD/slurm/commit/4c030c03778e65534178449cedca9bbe483bd0ec .