5920 – Work-around for systemd-logind memory cgroup issues

Bug 5920 - Work-around for systemd-logind memory cgroup issues

Summary: Work-around for systemd-logind memory cgroup issues

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	17.11.8
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Felip Moll
QA Contact:	Tim Wickberg

URL:

Duplicates (2):	6019 11154 (view as bug list)
Depends on:
Blocks:

Reported:	2018-10-25 07:20 MDT by hpc-admin
Modified:	2023-01-25 03:47 MST (History)
CC List:	21 users (show)

See Also:	3912 3674 5026 6019 6029 6540 8656 10255 10551 11128 12255 14870
Site:	Ghent
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	19.05.2+
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
pam_slurm_adopt-19.05.01-2 patch to correct pam_systemd conicident usage (7.60 KB, patch) 2019-08-14 10:41 MDT, mike coyne	Details \| Diff
bug5920_customer_workaround_20.11.patch - not supported - (6.78 KB, patch) 2021-02-16 06:14 MST, Felip Moll	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description hpc-admin 2018-10-25 07:20:19 MDT

Hi,


The slurm-users list had a thread regarding systemd-logind needing to be disabled on the compute nodes, if pam_slurm_adopt is used to avoid the adopted process to be moved out of the memory cgroup again.

I took a look at the pam_slurm_adopt code and I was wondering if you'd be OK with the following changes.

- move all the code from pam_sm_acct_mgmt into a separate function, so this function can also be called from pam_sm_open_session.
- add a new option, say action_adopt which does _not_ adopt but only checks for existing jobs if the check_only flag is given. if it is not given the default behaviour is what currently exists in the pam_slurm_adopt code.

This allows the following changes in the sshd pam config, which effectively provide a workaround for the issue described above.

We can now use two lines, e.g.:

account sufficient pam_slurm_adopt.so action_no_jobs=deny,action_adopt_failure=deny,debug,action_adopt=check_only

session sufficient pam_slurm_adopt.so action_no_jobs=ignore,action_adopt_failure=deny,debug

Placing the last line at the end, will adopt the process after systemd-logind has passed by and place the process in the correct slurm-controlled cgroups.

The original behaviour is still available when using only the first line without the action_adopt=check_only flag, so nothing needs to change for legacy behaviour.

But using both lines, we now get:

vsc40075@node2801 (banette) ~> cat /proc/self/cgroup
11:hugetlb:/
10:pids:/
9:freezer:/slurm/uid_2540075/job_67119295/step_extern
8:perf_event:/
7:cpuset:/slurm/uid_2540075/job_67119295/step_extern
6:devices:/
5:net_prio,net_cls:/
4:memory:/slurm/uid_2540075/job_67119295/step_extern/task_0
3:blkio:/
2:cpuacct,cpu:/slurm/uid_2540075/job_67119295/step_extern/task_0
1:name=systemd:/user.slice/user-2540075.slice/session-9963.scope

Rather than

vsc40075@node2801 (banette) ~> cat /proc/self/cgroup
11:hugetlb:/
10:pids:/
9:freezer:/slurm/uid_2540075/job_67119295/step_extern
8:perf_event:/
7:cpuset:/slurm/uid_2540075/job_67119295/step_extern
6:devices:/
5:net_prio,net_cls:/
4:memory:/
3:blkio:/
2:cpuacct,cpu:/
1:name=systemd:/user.slice/user-2540075.slice/session-9965.scope

when ssh'ing into a compute node that has a job running.


Current code is written against 17.11.8, but I see no reason why it cannot be easily ported (by us) to master. A patch added here will be against the 19.05 (master) branch in any case.

Before I submit a patch, I'd like to discuss this some more. Current changes are at https://github.com/hpcugent/slurm/pull/28

Granted, it needs some more documentation at this point.

Kind regards,
-- Andy

Comment 2 Marshall Garey 2018-10-25 10:23:56 MDT

For posterity, here is the link to the slurm-users email thread referenced:

https://groups.google.com/forum/#!topic/slurm-users/3vZUoZ_DXJI

Comment 8 Felip Moll 2019-02-15 10:26:14 MST

*** Bug 6019 has been marked as a duplicate of this bug. ***

Comment 11 Felip Moll 2019-02-20 03:49:47 MST

*** Bug 6540 has been marked as a duplicate of this bug. ***

Comment 17 Ryan Novosielski 2019-03-20 15:07:47 MDT

Can someone from SchedMD comment on this situation? It will inform our plans going forward. While this is informative, I don't see anyplace either in this bug or the GitHub where SchedMD has replied.

Comment 18 Felip Moll 2019-03-21 03:26:55 MDT

(In reply to Ryan Novosielski from comment #17)
> Can someone from SchedMD comment on this situation? It will inform our plans
> going forward. While this is informative, I don't see anyplace either in
> this bug or the GitHub where SchedMD has replied.

I have tested and checked your approach and I think that splitting the current module into a session and account one is the way to go and would fix our issues. I also created some modifications and adapted your code to 19.05 for testing, but in the end we will probably need to do a major refactor to pam_slurm_adopt to avoid iterating over the same steps and functions in both the account and session modules. All this work is pending for review by our development team; right now I cannot give you an estimate time when this will be accepted.

I will inform you as soon as we get a conclusion on how we will proceed with this issue.

Comment 20 Ryan Novosielski 2019-03-21 09:33:23 MDT

Thanks, Felip. I'm a watcher on this bug (creator of related/duplicate 6540 -- I'm going to reopen that to ask an additional question as it's not relevant to this bug) because I have similar questions and thank you for the update -- the actual work was Andy's.

Comment 24 Felip Moll 2019-04-16 08:33:56 MDT

Hi,

This is a quick update about this bug.

Situation is still the same than in comment 18, but now since we are facing with the new 19.05 release, things have been delayed until at least end of May.

We are not forgetting about this issue.

I will keep you posted.

Comment 26 mike coyne 2019-08-14 10:37:39 MDT

We have  been following the progress of this ticket for some time, i reworked your patch for 19.05.0`

Comment 27 mike coyne 2019-08-14 10:41:14 MDT

Created attachment 11231 [details]
pam_slurm_adopt-19.05.01-2 patch to correct pam_systemd conicident usage

Comment 31 hpc-admin 2019-09-10 03:22:46 MDT

Hi Mike,

Thanks for the fix!

We build and deployed 19.05.2 on some of our machines with the above patch provided. When logging into the node where a job is running, I see

[ageorges@snorlax quattor]$ ssh -A vsc40075@node3300.joltik.os
Warning: Permanently added 'node3300.joltik.os,10.141.8.1' (ECDSA) to the list of known hosts.
Last login: Tue Sep 10 10:35:07 2019
vsc40075@node3300 ~> echo $$
255175
vsc40075@node3300 ~> cat /proc/255175/cgroup
11:blkio:/user.slice
10:memory:/slurm/uid_2540075/job_40000667/step_extern/task_0
9:hugetlb:/
8:freezer:/slurm/uid_2540075/job_40000667/step_extern
7:cpuset:/slurm/uid_2540075/job_40000667/step_extern
6:cpuacct,cpu:/slurm/uid_2540075/job_40000667/step_extern/task_0
5:perf_event:/
4:pids:/user.slice
3:net_prio,net_cls:/
2:devices:/slurm/uid_2540075/job_40000667/step_extern
1:name=systemd:/user.slice/user-2540075.slice/session-4133.scope

So this seems to be doing what is expected.

I think this can be closed. I'll leave it to you to do that and indicate the version the code landed in :)

Kind regards,
-- Andy

Comment 32 Marco Induni 2019-09-10 03:23:15 MDT

I'm out of office.
I will be back on September 16th 2019

Kind regards,
Marco Induni

Comment 33 Felip Moll 2019-09-10 05:16:08 MDT

Hi,

This bug will remain open a bit more since we're evaluating whether to include this or a modification of this in an official release.

Thanks

Comment 34 Matt Jay 2019-09-11 09:29:29 MDT

I just wanted to chime in that this issue impacts our site (UW) directly, and we'd very much like to see this (or something to address this issue) as part of an official release.

Comment 41 Felip Moll 2019-09-26 03:53:21 MDT

Hello to everybody following this bug.

I've been working recently in a better solution than the one provided in the patch here. The public patch adding the 'check_only' feature makes the pam_slurm_adopt module to duplicate actions in account and session module, for example checking the user or the jobs in the system.

The better solution I have currently in place (unpublished) is to completely refactor the plugin and separate the functionalities for account and session. In account it would check the users and find the job, and in session it would receive the selected job by the account module in order to add the pid into the corresponding cgroup.

So SchedMD is not going to include this public patch as is as a supported solution. Having said that, the QA team is also questioning some issues here that I want to discuss with you. These are basic questions:

First of all, systemd intentions are to completely take control of cgroups. Any attempt to steal processes from systemd does not guarantee that at some point in time the processes are recovered again by him. This should be avoided in a unit file with Delegate=yes and in our situation we should set it into /usr/lib/systemd/system/user-.slice.d/10-defaults.conf

The way it works now is that pam_systemd.so contacts through dbus to logind daemon and asks it to create a session. The logind daemon creates the session and passes the information to the pam_systemd module, which in turn sets the XDG_SESSION_ID variable. Creation of runtime dirs are also responsibility of this communication with logind. What I have seen so far is that setting Delegate=yes in the user slice, doesn't avoid that pam_systemd.so to modify the memory cgroup of a process which is doing login (ssh in our case), fact that confirms that we are fighting again systemd even if we play with the order of modules in pam configuration.

The two bad points here are: setting a default for all login sessions to Delegate=yes, and fighting with systemd cgroup control.

Secondly, what I already commented. The public patch in this bug is not efficient and duplicates operations.

Thirdly, QA team is wondering:

- Why do you really need pam_systemd.so? This module just creates a session in systemd which provides a unique SESSION_ID and creates the user directories. When the session finishes it removes the directories. Cannot this be done directly by a prolog, or by a slight modification of the current pam_slurm_adopt module? I know that there are applications which require XDG_SESSION_ID, which are those apps?

- Isn't that possible to run the mentioned applications through srun instead of ssh'ing into the node? Maybe just exporting the environment from the login node is enough.

- The pam_systemd.so module only takes effect on ssh sessions, slurm daemons don't have the mentioned variables.

The intentions behind these questions are to really know if pam_systemd.so is needed at all in a try to not keep the fight of stealing processes back and forth with systemd.

I can still see the benefits of splitting the module in account and session, for example I know that when using keyboard-interactive autenticacion in ssh, a subprocess of ssh is launched to manage the account modules, which causes pam_slurm_adopt to adopt the wrong pid since when account modules actions complete, the parent process is the one that ends up with the user, so adopting in session is what must be done instead.

Your feedback will be greatly valuable to continue with the decision of what to do here.

Comment 42 hpc-admin 2019-09-26 05:49:10 MDT

hi felip,

thanks for looking into a proper patch (although the remark "I can still see the benefits of splitting the module in account and session" is a bit odd, you're supposed to split it no, isn't that what pam assumes?)

wrt the why, although i don't think we have a concrete case, i think proper support to live together with systemd is more future proof (who knows what else will be added; or what we otherwise would need to put in pro/epilogue scripts).

the more problematic issues are indeed fighting with systemd (we have had no issues with systemd reclaiming processes though) or the system wide default login settings (the fact that you cannot set slice defaults for groups or users with uid higher than x is imho a bug/missing feature in systemd; we ran into that somewhere else, so i'm claiming that systemd has no issues ;)

i do hope that you still split the pam module, so we for now have the option to keep pam_systemd (if stuff goes haywire, worst case, we can always remove pam_systemd, but i consider it worst case; not the default solution)

stijn

Comment 43 Felip Moll 2019-09-27 06:38:30 MDT

(In reply to hpc-admin from comment #42)
> hi felip,
> 
> thanks for looking into a proper patch (although the remark "I can still see
> the benefits of splitting the module in account and session" is a bit odd,
> you're supposed to split it no, isn't that what pam assumes?)
> 

Current pam_slurm_adopt is not split in account and session.

With my sentence I meant that I see the benefit of separating the actions that pam_slurm_adopt module currently does into 'account' and 'session'. Currently it does everything in 'account'. The benefits I can see is a more logical design and the flexibility to reorder actions in pam configuration, but these are not reasons strong enough to split it at the moment.


> wrt the why, although i don't think we have a concrete case, i think proper
> support to live together with systemd is more future proof (who knows what
> else will be added; or what we otherwise would need to put in pro/epilogue
> scripts).

Unfortunately we cannot know which path will systemd follow, it seems it will take control of the full tree with the inclusion of cgroup v2, which would leave us without possibly modifying the cgroups directly like the current or splitted adopt module does.

> the more problematic issues are indeed fighting with systemd (we have had no
> issues with systemd reclaiming processes though) or the system wide default
> login settings (the fact that you cannot set slice defaults for groups or
> users with uid higher than x is imho a bug/missing feature in systemd; we
> ran into that somewhere else, so i'm claiming that systemd has no issues ;)
> 
> i do hope that you still split the pam module, so we for now have the option
> to keep pam_systemd (if stuff goes haywire, worst case, we can always remove
> pam_systemd, but i consider it worst case; not the default solution)
> 
> stijn

First, we need strong reason to split it. What would be the reason you want to keep pam_systemd module enabled?

I am waiting also for feedback on the other sites CC'ed in this bug.

Comment 44 Kilian Cavalotti 2019-09-27 08:48:12 MDT

He Felip and Stijn,

(In reply to Felip Moll from comment #43)
> First, we need strong reason to split it. What would be the reason you want
> to keep pam_systemd module enabled?
> 
> I am waiting also for feedback on the other sites CC'ed in this bug.

We've been disabling pam_systemd on compute nodes for the last two years at least, and manually defining XDG_RUNTIME_DIR in user's profile when necessary.

It's been working great for us, we never found any downside in practice with this approach.

Cheers,
-- 
Kilian

Comment 45 aseishas 2019-09-27 10:05:20 MDT

Is the issue with keyboard-interactive authentication not reason enough for the split? Right now, pam_slurm_adopt is broken in environments where two-factor authentication is required, whether pam_systemd is enabled or not.

--
Adam Seishas

Comment 47 Felip Moll 2019-09-30 02:43:08 MDT

(In reply to aseishas from comment #45)
> Is the issue with keyboard-interactive authentication not reason enough for
> the split? Right now, pam_slurm_adopt is broken in environments where
> two-factor authentication is required, whether pam_systemd is enabled or not.
> 
> --
> Adam Seishas

There's some discussion here too.

One Slurm equivalent command to ssh into a node is:

srun --jobid <job> --pty /bin/bash

so we're wondering why ssh is really required. Secondly despite two-factor on the login nodes makes sense, why is two-factor required in the compute nodes? Can you specify the use case?

Comment 48 hpc-admin 2019-09-30 05:01:53 MDT

Hi Felip,


I am not quite following what you meant by this:


> - Isn't that possible to run the mentioned applications through srun instead
> of ssh'ing into the node? Maybe just exporting the environment from the login 
> node is enough.

> - The pam_systemd.so module only takes effect on ssh sessions, slurm daemons 
> don't have the mentioned variables.


What is required (at least for us) is that ssh sessions are gathered in the correct cgroups, which means that for memory, cpu, it should be the cgroup of the job to ensure these sshd processes do not affect other users that are running jobs on the same node.

Obviously, the applications run in the job are started using slurm itself and not by ssh'ing into the node. However, quite often a users wants to peek on the node to see what is happening. 

So we definitely need pam_systemd.so, but we also need the external task to be gathered in the slurm job cgroup.

I might be missing something here.

Kind regards,
-- Andy

Comment 49 Felip Moll 2019-09-30 05:08:07 MDT

(In reply to hpc-admin from comment #48)
> Hi Felip,
> 
> 
> I am not quite following what you meant by this:
> 
> 
> > - Isn't that possible to run the mentioned applications through srun instead
> > of ssh'ing into the node? Maybe just exporting the environment from the login 
> > node is enough.
> 
> > - The pam_systemd.so module only takes effect on ssh sessions, slurm daemons 
> > don't have the mentioned variables.
> 
> 
> What is required (at least for us) is that ssh sessions are gathered in the
> correct cgroups, which means that for memory, cpu, it should be the cgroup
> of the job to ensure these sshd processes do not affect other users that are
> running jobs on the same node.
> 
> Obviously, the applications run in the job are started using slurm itself
> and not by ssh'ing into the node. However, quite often a users wants to peek
> on the node to see what is happening. 

I meant that by:

srun --jobid <job> --pty /bin/bash

You get access to the node and do not need ssh. 

> So we definitely need pam_systemd.so

Why? Even if you decide to use ssh, pam_systemd.so opens a session into logind, sets XDG_* variables and creates/destroys a tmp directory for the session, it is an optional component not necessary at all for ssh'ing.

Comment 50 aseishas 2019-09-30 18:56:48 MDT

> so we're wondering why ssh is really required. Secondly despite two-factor on the login nodes makes sense, why is two-factor required in the compute nodes? Can you specify the use case?

Two-factor is required as a general policy by our Information Security Office, in particular for interactive logins. Another common ssh use-case on our systems is port forwarding, for applications that may be listening on ports that would not otherwise be exposed to the network. I don’t think srun can forward an arbitrary port?

--
Adam Seishas

Comment 52 Felip Moll 2019-10-01 02:44:47 MDT

(In reply to aseishas from comment #50)
> > so we're wondering why ssh is really required. Secondly despite two-factor on the login nodes makes sense, why is two-factor required in the compute nodes? Can you specify the use case?
> 
> Two-factor is required as a general policy by our Information Security
> Office, in particular for interactive logins. Another common ssh use-case on
> our systems is port forwarding, for applications that may be listening on
> ports that would not otherwise be exposed to the network. I don’t think srun
> can forward an arbitrary port?
> 
> --
> Adam Seishas

I can understand that enabling two-factor in the interactive login nodes is required and in fact a best practice. Nevertheless enabling two-factor in compute nodes does not provide any improved security since users are already able to run arbitrary commands through srun/sbatch on the nodes associated to their jobs, making the two-factor layer in *compute nodes* to provide a false security feeling.

I can also understand how port forwarding may be the reason you need to keep ssh enabled in compute nodes, but note I am not saying you should disable ssh but instead I am questioning the utility of two-factor authentication in compute nodes, which is one of the pieces that makes pam_slurm_adopt "incompatible" right now.

Just to remark: in login nodes I perfectly see how you would want to enable the two-factor, but remember that pam_slurm_adopt *shouldn't* be enabled in those nodes.

Comment 53 Felip Moll 2019-10-08 07:33:02 MDT

To all, please read:

https://github.com/systemd/systemd/issues/13535

Official systemd response: "Sorry but this is not and will not be supported."

I'm not entirely sure Lennart read carefully my words, but assuming he did, splitting the module and still keeping pam_systemd.so enabled is not safe and thus stealing a process from a subtree that doesn't have a "delegate=yes" is not a valid solution.

(Though direct, it's an expected and not new response.)

Comment 55 peter.georg 2019-11-07 05:09:14 MST

(In reply to Felip Moll from comment #49)
> One Slurm equivalent command to ssh into a node is:
> 
> srun --jobid <job> --pty /bin/bash

This might solve the problem for us. However I do have one question about this alternative (it is not equivalent, some subtle differences exist) to ssh: Is there any way to make it work with a multi-cluster configuration without having to (partially) copy the slurm.conf from the slurmctld host to the (shared) login nodes?

I.e. is there any way to make the following command work (currently fails on 18.08.5):
srun --clusters <cluster> --jobid <job> --pty /bin/bash
As the manpage correctly states, --clusters is simply ignored in this case.

My current work-round is top copy all cluster configs to the login nodes and run:
SLURM_CONF=/etc/slurm/slurm.<cluster> srun --jobid <job> --pty /bin/bash

I understand why the compute nodes must be accessible to and from the submission host (i.e. login nodes), but I do not understand why it is required to copy the configuration files.

This question is probably a little bit off-topic, but it would solve this issue for us.

Thanks!

Comment 56 Felip Moll 2019-11-07 05:42:59 MST

> I.e. is there any way to make the following command work (currently fails on
> 18.08.5):
> srun --clusters <cluster> --jobid <job> --pty /bin/bash
> As the manpage correctly states, --clusters is simply ignored in this case.

I cannot see where the manpage says it is ignored.

Theoretically if you have a single slurmdbd, srun will talk to the database (slurmdb_get_info_cluster()) to get the address and port of the requested cluster and will send the command there directly.

The slurmdbd keeps records of each cluster's address and port when the cluster registers with the slurmdbd.
e.g: sacctmgr show clusters.

It seems a different issue than the one discussed here.

> My current work-round is top copy all cluster configs to the login nodes and
> run:
> SLURM_CONF=/etc/slurm/slurm.<cluster> srun --jobid <job> --pty /bin/bash

That workaround is correct if multiple slurmdbds are used, though as you have seen is tricky to manage.

> I understand why the compute nodes must be accessible to and from the
> submission host (i.e. login nodes), but I do not understand why it is
> required to copy the configuration files.
> 
> This question is probably a little bit off-topic, but it would solve this
> issue for us.

This question should go into a different bug, please open one apart.

Thanks

Comment 57 peter.georg 2019-11-08 04:14:17 MST

> This question should go into a different bug, please open one apart.

Sorry for posting here again: Just wanted to let you, and probably other affected users reading this, know that after reading your response and searching for more specifics I found my bug to be a duplicate of https://bugs.schedmd.com/show_bug.cgi?id=7863

Thanks!

Comment 58 Felip Moll 2019-11-12 19:44:39 MST

(In reply to Peter.Georg from comment #57)
> > This question should go into a different bug, please open one apart.
> 
> Sorry for posting here again: Just wanted to let you, and probably other
> affected users reading this, know that after reading your response and
> searching for more specifics I found my bug to be a duplicate of
> https://bugs.schedmd.com/show_bug.cgi?id=7863
> 
> Thanks!

Ok, so you were running the command from within an allocation. Glad you found it.

Comment 59 Felip Moll 2020-01-10 07:05:17 MST

For the moment I haven't received any strong argument for having to split the module, though I understand it would be a cleaner design.

If there are no more opinions on this I will mark this bug as an enhancement for do a cleanup and reestructuration of pam_slurm_adopt.

I currently have a proposal that must be reviewed by our QA team.

I'll inform about any progress in this bug.

Comment 63 Felip Moll 2021-02-16 06:14:05 MST

Created attachment 17948 [details]
bug5920_customer_workaround_20.11.patch - not supported -

Comment 64 Mikael Öhman 2021-03-23 03:53:45 MDT

*** Bug 11154 has been marked as a duplicate of this bug. ***

Comment 65 aseishas 2023-01-25 03:47:03 MST

I am out of the office until Monday, January 30, and will respond to your
message when I return. If you need immediate assistance please contact:
srcc-support@stanford.edu.

--
Adam Seishas
Research Computing Specialist
Stanford | University IT
aseishas@stanford.edu | 650.725.7490

akkornel
aseishas
bart
brian
bsantos
craig.x.kapfer
csc-slurm-tickets
fullop
juergen.salk
kilian
lyeager
marco.induni
marshall
mattjay
mcoyne
mikael.ohman
nate
novosirj
peter.georg
sts
tim