I'm running into a problem with slurm and cgroups, where systemd seems to be messing me up, pulling my processes out of the slurm-created cgroups. I'm wondering if you're aware of the issue, know of a workaround, etc.
We're working on deploying a new RHEL7 based cluster system, which will be our first cluster using systemd. We're using several cgroup plugins, and slurmd seems to put the processes into uid- and jobid-specific cgroups, as expected (output abbreviated):
[lbrown@m9g-2-20 ~]$ salloc --time=10:00 --mem=1G -C rhel7
bash-4.2$ cat /proc/self/cgroup
But when I go to another window, log into that same host as root, run "systemctl daemon-reload" to refresh all the systemd unit files, and then look at the cgroups again, I see this (abbreviated again):
bash-4.2$ cat /proc/self/cgroup
So, while some (cpuset, freezer) stayed in place, others (memory, and "cpuacct,cpu"), did not. Note that we won't be running the "daemon-reload" often, but I have reason to suspect that it, or something like it, is happening anyway periodically. On an interactive node (not managed by slurm), I've seen similar symptoms happening to my existing sessions every few days.
In theory, this could be handled by some form of cgroup subtree delegation (https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=), which I'm experimenting with for interactive logins. But, of course, there's a just-resolved-today bug in systemd (https://github.com/systemd/systemd/issues/8364), that will take a while to percolate down to the Linux distributions, I'm sure.
In the meantime, I'm just wondering if SchedMD has, or knows of, any kind of workaround or re-adoption mechanism. We can work on one, but if someone else has already built it, I'm happy to just use it.
Have you tried Delegate=yes in the service file? It's not clear to me that systemd will actually respect that or not, and I'm also seeing a lot of chatter that the pam_systemd module can also impact this.
I don't have a quick fix, but it does appear we need to do some work on our end. Systemd is continuing to push forward with their agenda of exclusively owning the cgroup hierarchy, which obviously conflicts with Slurm's long-standing use.
Up until this morning, I had only done testing on an interactive environment, and assumed that slurm would act the same, within the delegation. Namely that if I got it to successfully delegate a cgroup subtree (eg. "/system.slice/sshd.service"), put processes into special-purpose subtrees of that delegated subtree (eg. "/system.slice/sshd.service/user_lbrown"), and subsequently did a daemon-reload, the processes ended up back in the root of the delegated subtree (eg. "/system.slice/sshd.service"). When I did the same thing without the delegated subtree, they just ended up in the global cgroup root ("/").
I'm seeing some inconsistent behavior, especially between using "Delegate=yes" and "Delegate=cpuset,memory,cpuacct,cpu,freezer" being defined in the unit file. I will try to nail down the behavior more thoroughly, and get back to you.
Having said that, if we assume that a delegated subtree is the right answer going forward, is there a good way to define that in cgroup.conf currently? I currently have 'CgroupMountpoint="/sys/fs/cgroup"' in there, but the delegated subtrees seem to have paths like "/sys/fs/cgroup/CONSTRAINTTYPE/system.slice/slurmd.service", where CONSTRAINTTYPE is "memory", "cpu,cpuacct", etc. I'm not sure how to define that in this file. Is there a placeholder I can use in the CgroupMountpoint definition, a la 'CgroupMountpoint="/sys/fs/cgroup/%C/system.slice/slurmd.service"' or similar?
Created attachment 7066 [details]
slurm/systemd/cgroup scenarios and assigned cgroups
I just attached a text file containing 6 different combinations of delegations, slurm cgroups enabled/disabled, etc. I'm still evaluating them, but I do have a couple of observations.
The scenarios that have slurm cgroup plugins disabled (AFAICT), help me determine whether the delegation worked or not. For some unknown reason, the "delegate=yes" seemed to work, where the "delegate=list,of,constraints" did not.
The scenarios where slurm's cgroup plugins were enabled, make it clear that the slurm-assigned cgroups are outside the delegated subtree; they start with "/slurm", instead of the delegated "/system.slice/slurmd.service". That's expected, since I don't know how to define the delegated subtree inside slurm's cgroup.conf.
Scenario 4 is particularly interesting. I've done the test several times, and it looks like it's not being affected by the bug that moves the process back to the cgroup root (or delegated subtree root) when I do a daemon-reload. I wish I knew why, so I could apply it to my interactive node scenario, too.
If I have any other insights to add here, I'll let you know.
FYI, I am also working on a simple python script to re-adopt processes back into the appropriate cgroups. It's quick and dirty, but I'll be happy to contribute it back once it's working.
Created attachment 7070 [details]
Change cgroup_prepend to point into slurmd.service area.
(In reply to lloyd_brown_schedmdbugzilla from comment #4)
> I just attached a text file containing 6 different combinations of
> delegations, slurm cgroups enabled/disabled, etc. I'm still evaluating
> them, but I do have a couple of observations.
> The scenarios that have slurm cgroup plugins disabled (AFAICT), help me
> determine whether the delegation worked or not. For some unknown reason,
> the "delegate=yes" seemed to work, where the "delegate=list,of,constraints"
> did not.
> The scenarios where slurm's cgroup plugins were enabled, make it clear that
> the slurm-assigned cgroups are outside the delegated subtree; they start
> with "/slurm", instead of the delegated "/system.slice/slurmd.service".
> That's expected, since I don't know how to define the delegated subtree
> inside slurm's cgroup.conf.
We don't have an option to set this currently, it's hard coded.
I'm attaching a patch which changes this into where systemd wants us to be cooped up, let me know if you're willing to test that out.
> Scenario 4 is particularly interesting. I've done the test several times,
> and it looks like it's not being affected by the bug that moves the process
> back to the cgroup root (or delegated subtree root) when I do a
> daemon-reload. I wish I knew why, so I could apply it to my interactive
> node scenario, too.
> If I have any other insights to add here, I'll let you know.
> FYI, I am also working on a simple python script to re-adopt processes back
> into the appropriate cgroups. It's quick and dirty, but I'll be happy to
> contribute it back once it's working.
I wouldn't mind seeing it as an attachment here for reference, but do not want to ship such a creation at any point.
There's obviously some work to do here with deciding how best to cooperate. I think we've lost the fight over systemd-as-init, and they've moved on and are executing on their stated goal of systemd-as-sole-userspace-cgroup-manipulator which obviously poses a conflict.
If the attached patch, alongside the service file change, appear to get things to cooperate for now, that'll likely be the 18.08 approach. Longer-term I need to sort out if we need to bite the bullet and make task/systemd and proctrack/systemd using their APIs rather than the direct against cgroup, lest we end up with no access to this functionality.
*** Bug 5300 has been marked as a duplicate of this bug. ***
Just one warning when testing that patch out - I haven't tested this locally, but based on some other research that may lead to a problem where job processes are terminated if the slurmd is restarted.
So... YMMV. I'm looking into this further, there are some subtle interactions that would change here by building out underneath the systemd hierarchy instead of our own.
I will continue testing today, but preliminarily I can say that using Delegate=yes seems to let us avoid any problems with a daemon-reload operation, even when slurm is not within the delegated cgroup. I can't explain that. But I'm willing to live with it for now, since I have enough other things to do. I will try to find time to test your patch, but I can't guarantee anything.
I can say that restarting slurmd doesn't seem to cancel jobs currently, though I haven't applied your patch, so that may make a difference.
As far as the readoption tool goes, I can't blame you. My coding skills aren't great. Frankly, anyone with a small amount of coding and RegEx skills could've done what I have. I'm not sure it's worth sharing.
In the end, it's just parsing the output of "scontrol listpids", getting the UID of the process from /proc/PID/status for the specific PID, and then echoing the PID back into the corresopnding cgroup path, eg "echo PID > /sys/fs/cgroup/memory/slurm/uid_UID/step_STEPID/task_TASKID/tasks". The only thing I really need to figure out is where to grab the TASKID from.
And even if it works, at best it's something that would be run periodically (eg. a cron or a systemd timer), so there would still be brief periods of time where the processes are outside of the cgroup, before they get readopted. It's not a great approach, but sometimes that's what you have to do. I keep hoping it won't be necessary this time.
Commit cecb39ff087731d2 adds Delegate=yes to the slurmd.service file, and will be included in 17.02.8 when released later today.
I'm dropping this one level to Sev-4, and keeping this open for now until we can get a fix in 18.08 to use the systemd-preferred location to build our hierarchies rather than our own defaults.
Updating to mark this as resolved by the prior commit to add Delegate=yes to the slurmd.service file.
I'd looked into behaving "properly" and limiting slurmd's cgroup usage to assigned portions of the hierarchy (https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/) but as systemd doesn't create equivalent structures in the freezer or cpuset controllers it doesn't appear to be worth the effort at this time.
Longer-term we are looking into alternatives to better cooperate with systemd on the compute node, but that's being tracked elsewhere.
*** Bug 5061 has been marked as a duplicate of this bug. ***
*** Bug 6474 has been marked as a duplicate of this bug. ***