Ticket 3749 - cg group error
Summary: cg group error
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 16.05.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-04-28 15:20 MDT by mengxing cheng
Modified: 2018-04-26 10:40 MDT (History)
1 user (show)

See Also:
Site: University of Chicago
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description mengxing cheng 2017-04-28 15:20:40 MDT
Slurm reports cg error on some nodes.

[pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0414 --account=pi-lgrandi --partition=xenon1t hostname
slurmstepd-midway2-0414: error: task/cgroup: unable to add task[pid=34313] to memory cg '(null)'
midway2-0414.rcc.local

but no error on another node 
[pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0421 --account=pi-lgrandi --partition=xenon1t hostname
midway2-0421.rcc.local

Do you know what is wrong with the node reporting error? Thank you!

Mengxing
Comment 1 Tim Wickberg 2017-04-28 15:29:20 MDT
This appears to be similar to an earlier bug 3364. I believe the underlying problem identified there is an issue in the Linux kernel itself, not in Slurm.

I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error?

If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so.
Comment 2 Tim Wickberg 2017-05-03 07:28:36 MDT
(In reply to Tim Wickberg from comment #1)
> This appears to be similar to an earlier bug 3364. I believe the underlying
> problem identified there is an issue in the Linux kernel itself, not in
> Slurm.
> 
> I'm guessing that the afflicted node has run more jobs than the other okay
> nodes? If you reboot midway2-0414 does that clear up the error?

Hey Mengxing - 

Have you had a chance to test this out as a work-around?

Unfortunately, as I'd described before I believe this is a problem with the Linux kernel, and not something Slurm can directly resolve. Knowing if the reboot clears things up, and if the node had run more jobs than the average would help verify that as the cause.

- Tim
Comment 3 Tim Wickberg 2017-05-23 18:43:13 MDT
I'm marking this resolved/timedout as I still haven't seen a response to comment 1 or comment 2.

Please re-open if you'd like to continue to pursue this.

- Tim
Comment 4 mengxing cheng 2017-05-24 15:09:37 MDT
Tim, thank you for support!

Mengxing
Comment 5 Marshall Garey 2018-04-26 10:40:31 MDT
*** Ticket 5082 has been marked as a duplicate of this ticket. ***