Ticket 3912 - When using pam_slurm_adopt in systemd, ssh is not containerized in any job container
Summary: When using pam_slurm_adopt in systemd, ssh is not containerized in any job co...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Contributions (show other tickets)
Version: 17.02.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-06-21 02:03 MDT by Felip Moll
Modified: 2018-10-26 02:25 MDT (History)
1 user (show)

See Also:
Site: BSC-MN4
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
systemd cgroup tree (3.59 KB, text/plain)
2017-06-21 02:34 MDT, Felip Moll
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Felip Moll 2017-06-21 02:03:13 MDT
We switched from pam_slurm to pam_slurm_adopt since we will have interactive nodes. We are using SLES 12 SP2 that comes with changes in systemd. There seems to be a problem with the plugin and cgroups when a users does ssh to a machine, the ssh is not put into cgroup container and therefore there's no resource containment for this session.

s01r1b01:/etc/pam.d # cat sshd
#%PAM-1.0
auth        requisite   pam_nologin.so
auth        include     common-auth
account     requisite   pam_nologin.so
account     include     common-account
password    include     common-password
session     required    pam_loginuid.so
session     include     common-session
session  optional       pam_lastlog.so   silent noupdate showfailed
account    sufficient   pam_access.so
account    required     pam_slurm.so

- Options in pam_slurm_adopt, default ones.
- PrologFlags=Alloc,Contain
Comment 1 Felip Moll 2017-06-21 02:34:23 MDT
Created attachment 4799 [details]
systemd cgroup tree
Comment 2 Felip Moll 2017-06-21 02:43:47 MDT
This is the relevant fragment when having one single job on the s01r1b02 node and at the same time accessing through ssh.

As you can see access by pam_access is denied but then acces by pam_slurm_adopt is granted.

2017-06-21T10:28:40.724896+02:00 s01r1b02 sshd[153840]: pam_access(sshd:account): access denied for user `bsc99968' from `10.2.8.230'
2017-06-21T10:28:40.738637+02:00 s01r1b02 pam_slurm_adopt[153840]: Connection by user bsc99968: user has only one job 3510
2017-06-21T10:28:40.752534+02:00 s01r1b02 pam_slurm_adopt[153840]: Process 153840 adopted into job 3510
2017-06-21T10:28:40.752826+02:00 s01r1b02 sshd[153840]: Accepted publickey for bsc99968 from 10.2.8.230 port 44690 ssh2: DSA SHA256:km+Gtd3ncSq+4UO6Y9ifepPBKcDmqw66aISFC0nK6Kg
2017-06-21T10:28:40.754282+02:00 s01r1b02 sshd[153840]: pam_unix(sshd:session): session opened for user bsc99968 by (uid=0)
2017-06-21T10:28:40.763083+02:00 s01r1b02 systemd[1]: Created slice User Slice of bsc99968.
2017-06-21T10:28:40.765170+02:00 s01r1b02 systemd[1]: Starting User Manager for UID 1109...
2017-06-21T10:28:40.767303+02:00 s01r1b02 systemd-logind[1972]: New session 813 of user bsc99968.
2017-06-21T10:28:40.768470+02:00 s01r1b02 systemd[1]: Started Session 813 of user bsc99968.
2017-06-21T10:28:40.780091+02:00 s01r1b02 systemd: pam_unix(systemd-user:session): session opened for user bsc99968 by (uid=0)
2017-06-21T10:28:40.810746+02:00 s01r1b02 systemd[153845]: Reached target Timers.
2017-06-21T10:28:40.811001+02:00 s01r1b02 systemd[153845]: Reached target Sockets.
2017-06-21T10:28:40.811191+02:00 s01r1b02 systemd[153845]: Reached target Paths.
2017-06-21T10:28:40.811396+02:00 s01r1b02 systemd[153845]: Reached target Basic System.
2017-06-21T10:28:40.811600+02:00 s01r1b02 systemd[153845]: Reached target Default.
2017-06-21T10:28:40.811866+02:00 s01r1b02 systemd[153845]: Startup finished in 22ms.
2017-06-21T10:28:40.812062+02:00 s01r1b02 systemd[1]: Started User Manager for UID 1109.

After this, we can see the cgroup tree (attached in file). Relevant lines are:

Control group /:
-.slice
├─system.slice
...
│ ├─slurmd.service
│ │ ├─  2570 /usr/sbin/slurmd -M
│ │ ├─153820 slurmstepd: [3510.4294967295
│ │ ├─153824 sleep 1000000
│ │ ├─153827 slurmstepd: [3510.0]
│ │ └─153833 /usr/bin/sleep 3600
....
└─user.slice
  ....
  └─user-1109.slice
    ├─user@1109.service
    │ └─init.scope
    │   ├─153845 /usr/lib/systemd/systemd --user
    │   └─153851 (sd-pam)  
    └─session-813.scope
      ├─153840 sshd: bsc99968 [priv
      ├─153854 sshd: bsc99968@pts/2
      └─153855 -bash

At this point we have pam.d configured this way:

pam.d/sshd:

#%PAM-1.0
auth        requisite   pam_nologin.so
auth        include     common-auth
account     requisite   pam_nologin.so
account     include     common-account
password    include     common-password

account	    sufficient 	pam_access.so
account	    required	pam_slurm_adopt.so

session     required    pam_loginuid.so
session     include     common-session
session     optional	pam_lastlog.so   silent noupdate showfailed


pam.d/common-session:
#%PAM-1.0
#
# This file is autogenerated by pam-config. All changes
# will be overwritten.
#
# Session-related modules common to all services
#
# This file is included from other service-specific PAM config files,
# and should contain a list of modules that define tasks to be performed
# at the start and end of sessions of *any* kind (both interactive and
# non-interactive
#
session	required	pam_limits.so	
session	required	pam_unix.so	try_first_pass 
session	optional	pam_umask.so	
session	optional	pam_systemd.so
session	optional	pam_env.so	


But we also tried commenting out pam_systemd.so in common-session and it also failed to be in job container:

2017-06-21T10:39:37.694830+02:00 s01r1b02 sshd[154135]: pam_access(sshd:account): access denied for user `bsc99968' from `10.2.8.230'
2017-06-21T10:39:37.709097+02:00 s01r1b02 pam_slurm_adopt[154135]: Connection by user bsc99968: user has only one job 3510
2017-06-21T10:39:37.724393+02:00 s01r1b02 pam_slurm_adopt[154135]: Process 154135 adopted into job 3510
2017-06-21T10:39:37.724622+02:00 s01r1b02 sshd[154135]: Accepted publickey for bsc99968 from 10.2.8.230 port 44962 ssh2: DSA SHA256:km+Gtd3ncSq+4UO6Y9ifepPBKcDmqw66aISFC0nK6Kg
2017-06-21T10:39:37.726256+02:00 s01r1b02 sshd[154135]: pam_unix(sshd:session): session opened for user bsc99968 by (uid=0)

Control group /:
-.slice
├─init.scope
│ └─1 /sbin/init
├─system.slice
...
│ ├─slurmd.service
│ │ ├─  2570 /usr/sbin/slurmd -M
│ │ ├─153820 slurmstepd: [3510.4294967295]
│ │ ├─153824 sleep 1000000
│ │ ├─153827 slurmstepd: [3510.0]
│ │ └─153833 /usr/bin/sleep 3600
...
│ ├─sshd.service
│ │ ├─  2762 /usr/sbin/sshd -D
│ │ ├─154135 sshd: bsc99968 [priv]
│ │ ├─154140 sshd: bsc99968@pts/1
│ │ └─154141 -bash
...
Comment 3 Alejandro Sanchez 2017-06-21 03:21:31 MDT
We've disabled pam_systemd from common-session which was in conflict with the pam_slurm_adopt module and now it works. Marking as resolved/infogiven.