We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping their cgroup controls on GPU devices. For example we have the following steps running on a single node: # ps auxn | grep [s]lurmstepd 0 2380 0.0 0.0 538436 3700 ? Sl 07:22 0:02 slurmstepd: [46609.0] 0 5714 0.0 0.0 472136 3952 ? Sl Apr11 0:03 slurmstepd: [46603.0] 0 17202 0.0 0.0 538448 3724 ? Sl Apr11 0:03 slurmstepd: [46596.0] 0 28673 0.0 0.0 538380 3696 ? Sl Apr10 0:39 slurmstepd: [46262.0] 0 44832 0.0 0.0 538640 3964 ? Sl Apr11 1:12 slurmstepd: [46361.0] But not all of those are reflected in the cgroup device hierarchy: # lscgroup | grep devices | grep slurm devices:/slurm devices:/slurm/uid_2093 devices:/slurm/uid_2093/job_46609 devices:/slurm/uid_2093/job_46609/step_0 devices:/slurm/uid_11477 devices:/slurm/uid_11477/job_46603 devices:/slurm/uid_11477/job_46603/step_0 devices:/slurm/uid_11184 devices:/slurm/uid_11184/job_46596 devices:/slurm/uid_11184/job_46596/step_0 This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected. In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day: # squeue -j 46609,46603,46596,46262,46361 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 46596 dpart bash yhng R 10:56:00 1 vulcan14 46609 scavenger bash yaser R 1:52:37 1 vulcan14 46603 scavenger bash jxzheng R 9:47:26 1 vulcan14 46361 dpart bash jxzheng R 1-08:31:14 1 vulcan14 46262 dpart Weighted umahbub R 1-18:07:07 1 vulcan14 So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what. We're seeing this happen on multiple different clusters, and it has been occurring since at least 17.02.
*** Bug 5300 has been marked as a duplicate of this bug. ***
duplicate issue *** This bug has been marked as a duplicate of bug 5292 ***