Bug 5300 - Jobs escaping cgroup device controls after some amount of time.
Summary: Jobs escaping cgroup device controls after some amount of time.
Status: RESOLVED DUPLICATE of bug 5292
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 17.11.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-12 15:21 MDT by sabobbin
Modified: 2018-06-12 16:36 MDT (History)
0 users

See Also:
Site: UMIACS
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description sabobbin 2018-06-12 15:21:53 MDT
This is a resubmit of bug 5061 now that we're under support....


We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping their cgroup controls on GPU devices.


For example we have the following steps running on a single node:

# ps auxn | grep [s]lurmstepd
       0  2380  0.0  0.0 538436  3700 ?        Sl   07:22   0:02 slurmstepd: [46609.0]
       0  5714  0.0  0.0 472136  3952 ?        Sl   Apr11   0:03 slurmstepd: [46603.0]
       0 17202  0.0  0.0 538448  3724 ?        Sl   Apr11   0:03 slurmstepd: [46596.0]
       0 28673  0.0  0.0 538380  3696 ?        Sl   Apr10   0:39 slurmstepd: [46262.0]
       0 44832  0.0  0.0 538640  3964 ?        Sl   Apr11   1:12 slurmstepd: [46361.0]


But not all of those are reflected in the cgroup device hierarchy:

# lscgroup | grep devices | grep slurm
devices:/slurm
devices:/slurm/uid_2093
devices:/slurm/uid_2093/job_46609
devices:/slurm/uid_2093/job_46609/step_0
devices:/slurm/uid_11477
devices:/slurm/uid_11477/job_46603
devices:/slurm/uid_11477/job_46603/step_0
devices:/slurm/uid_11184
devices:/slurm/uid_11184/job_46596
devices:/slurm/uid_11184/job_46596/step_0


This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected.  In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day:

# squeue -j 46609,46603,46596,46262,46361
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             46596     dpart     bash     yhng  R   10:56:00      1 vulcan14
             46609 scavenger     bash    yaser  R    1:52:37      1 vulcan14
             46603 scavenger     bash  jxzheng  R    9:47:26      1 vulcan14
             46361     dpart     bash  jxzheng  R 1-08:31:14      1 vulcan14
             46262     dpart Weighted  umahbub  R 1-18:07:07      1 vulcan14


So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what.

We're seeing this happen on multiple different clusters, and it has been occurring since at least 17.02.
Comment 1 Tim Wickberg 2018-06-12 15:29:06 MDT
I'm guessing you're on a RHEL7-based distribution?

Do you have something that may be triggering 'systemctl daemon-reload' or similar overnight?

This is likely caused by systemd undoing the cgroup hierarchy that Slurm had established - bug 5292 is tracking some discussion with BYU related to this at the moment.

- Tim
Comment 2 sabobbin 2018-06-12 15:42:09 MDT
Yes, specifically RHEL 7.5

And looking for occurrences of "systemd: Reloading." in the syslog on these hosts, it seems like it. Usually triggered when puppet makes a change to a service or updates a package.
Comment 3 Tim Wickberg 2018-06-12 16:26:16 MDT
(In reply to sabobbin from comment #2)
> Yes, specifically RHEL 7.5
> 
> And looking for occurrences of "systemd: Reloading." in the syslog on these
> hosts, it seems like it. Usually triggered when puppet makes a change to a
> service or updates a package.

That matches up with exactly what Levi's described thus far. There's a preliminary patch that may help with this, alongside a chance to the slurmd.service file, all described on that bug. I'm going to close this as a duplicate of that, and move further discussion over there.

- Tim

*** This bug has been marked as a duplicate of bug 5302 ***
Comment 4 Tim Wickberg 2018-06-12 16:33:51 MDT

*** This bug has been marked as a duplicate of bug 5061 ***
Comment 5 Tim Wickberg 2018-06-12 16:36:53 MDT
Sorry for the extra noise. Bug 5292 is what I should have indicated as a duplicate originally.

*** This bug has been marked as a duplicate of bug 5292 ***