Summary: | Jobs escaping cgroup device controls after some amount of time. | ||
---|---|---|---|
Product: | Slurm | Reporter: | sabobbin |
Component: | Other | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | UMIACS | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
sabobbin
2018-06-12 15:21:53 MDT
I'm guessing you're on a RHEL7-based distribution? Do you have something that may be triggering 'systemctl daemon-reload' or similar overnight? This is likely caused by systemd undoing the cgroup hierarchy that Slurm had established - bug 5292 is tracking some discussion with BYU related to this at the moment. - Tim Yes, specifically RHEL 7.5 And looking for occurrences of "systemd: Reloading." in the syslog on these hosts, it seems like it. Usually triggered when puppet makes a change to a service or updates a package. (In reply to sabobbin from comment #2) > Yes, specifically RHEL 7.5 > > And looking for occurrences of "systemd: Reloading." in the syslog on these > hosts, it seems like it. Usually triggered when puppet makes a change to a > service or updates a package. That matches up with exactly what Levi's described thus far. There's a preliminary patch that may help with this, alongside a chance to the slurmd.service file, all described on that bug. I'm going to close this as a duplicate of that, and move further discussion over there. - Tim *** This bug has been marked as a duplicate of bug 5302 *** *** This bug has been marked as a duplicate of bug 5061 *** |