Bug 4098

Summary: setting up slurm pam module on RHEL7
Product: Slurm Reporter: Hadrian <hxd58>
Component: ConfigurationAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: Ole.H.Nielsen, ruth.a.braun
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
Site: Case Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Hadrian 2017-08-22 10:07:10 MDT
Hi,

Do you have a quick "how-to" instruction to set up the pam_slurm module to protect the compute nodes from random ssh access on RHEL7?

So the questions:

1. We have installed the pam_slurm rpm, and just wondering what/where to put the additional pam lines (/etc/pam.d/sshd or /etc/pam.d/system-auth or password-auth. We want to have hpcadmin and root group also able to access the compute nodes besides the users that have running jobs.

2. Is "UsePAM=1" the only thing required in the slurm.conf needed to enable pam slurm?

3. What is the difference between pam_slurm.so and pam_slurm_adopt.so? Do we need to have both?

Thanks,
Hadrian
Comment 1 Tim Wickberg 2017-08-22 23:00:22 MDT
(In reply to Hadrian from comment #0)
> Hi,
> 
> Do you have a quick "how-to" instruction to set up the pam_slurm module to
> protect the compute nodes from random ssh access on RHEL7?

The best documentation for it is the README that's provided. (See https://github.com/SchedMD/slurm/tree/master/contribs/pam_slurm_adopt .)

Unfortunately we haven't produced an internal version - we do have a bug open tracking that, but haven't gotten it done just yet.

> So the questions:
> 
> 1. We have installed the pam_slurm rpm, and just wondering what/where to put
> the additional pam lines (/etc/pam.d/sshd or /etc/pam.d/system-auth or
> password-auth. We want to have hpcadmin and root group also able to access
> the compute nodes besides the users that have running jobs.

sshd is usually the easiest place to adjust.

The one other caveat is that you will need to comment out any 'pam_systemd' module lines in the config.

Root is always permitted by the module, but hpcadmin would need a special exemption. There is some discussion in the documentation file as to how to whitelist a group by altering /etc/security/access.conf in conjunction with the pam_access plugin.

Please let me know if the documentation isn't sufficient, and I'll get a RHEL7 system to test on and get an explicit set of directions for you.

> 2. Is "UsePAM=1" the only thing required in the slurm.conf needed to enable
> pam slurm?

You can ignore that setting - it won't have any impact on either of these modules.

It's an alternative way of enforcing resource limits, and we obviously need to clarify the documentation here or move to remove that older deprecated functionality.

> 3. What is the difference between pam_slurm.so and pam_slurm_adopt.so? Do we
> need to have both?


pam_slurm only prevents a user from logging in if they have no jobs on the node. It does not "attach" processes they launch from that connection to the job itself, set any resource limits, or ensure those processes are cleaned up.

pam_slurm_adopt works alongside the Slurm cgroup support to ensure any processes launched are contained correctly, and cleaned up on job termination. I highly recommend using it, especially if you permit multiple jobs per node.
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2017-09-06 06:06:29 MDT
I'm trying to set up pam_slurm_adopt as well, and I appreciate Tim's explanation.  I'd like to pose additional questions:

I'm confused about "UsePAM=1" in slurm.conf: Tim writes that it isn't required any more, but what does it do then?  I guess I don't fully understand the definition in the slurm.conf man-page.

Question 1: Should "UsePAM=1" be added or not in slurm.conf when configuring pam_slurm_adopt?

In https://hpcworks.wordpress.com/2017/05/29/setup-slurm-pam-plugin-on-centos-7/ it is stated (near the bottom) that one must also create a file /etc/pam.d/slurm, otherwise PAM error messages are encountered.

Question 2: Is the /etc/pam.d/slurm file still required with the latest Slurm 17.02?  Or is it solely needed together with "UsePAM=1"?

The PrologFlags=contain must be set in slurm.conf before enabling the pam_slurm_adopt module (according to the pam_slurm_adopt README).

Question 3: Is there a safe procedure to enable PrologFlags=contain and then pam_slurm_adopt on a production cluster?  Or must the entire cluster be taken down first?

Thanks,
Ole
Comment 3 Tim Wickberg 2017-09-06 11:48:40 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #2)
> I'm trying to set up pam_slurm_adopt as well, and I appreciate Tim's
> explanation.  I'd like to pose additional questions:
> 
> I'm confused about "UsePAM=1" in slurm.conf: Tim writes that it isn't
> required any more, but what does it do then?  I guess I don't fully
> understand the definition in the slurm.conf man-page.

It uses PAM to setup a users' environment on the compute nodes, rather than just restoring a version of their profile that was captured on the login nodes.

The behavior is unrelated to pam_slurm_adopt, and that option is not needed.

> Question 1: Should "UsePAM=1" be added or not in slurm.conf when configuring
> pam_slurm_adopt?
> 
> In
> https://hpcworks.wordpress.com/2017/05/29/setup-slurm-pam-plugin-on-centos-7/
> it is stated (near the bottom) that one must also create a file
> /etc/pam.d/slurm, otherwise PAM error messages are encountered.

No, you do not need to set that.

I have no idea who wrote that guide, they're mistaken.

> Question 2: Is the /etc/pam.d/slurm file still required with the latest
> Slurm 17.02?  Or is it solely needed together with "UsePAM=1"?

It's only needed with UsePAM=1. I do not recommend using that.

> The PrologFlags=contain must be set in slurm.conf before enabling the
> pam_slurm_adopt module (according to the pam_slurm_adopt README).
> 
> Question 3: Is there a safe procedure to enable PrologFlags=contain and then
> pam_slurm_adopt on a production cluster?  Or must the entire cluster be
> taken down first?

PrologFlags=contain is safe to enable; you will need to run 'scontrol reconfigure' to make the change take effect.

Only after that's been done should you roll out changes to the PAM configuration.

One caveat here - if you have jobs that are already running and may still need to make new SSH connections to complete, those jobs would fail if you enable this at a later point in time. It'd be safest to drain the nodes, then make the PAM config changes, then mark the nodes as back in service.
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2017-09-06 12:57:32 MDT
Thanks a lot Tim:

(In reply to Tim Wickberg from comment #3)
> > I'm confused about "UsePAM=1" in slurm.conf: Tim writes that it isn't
> > required any more, but what does it do then?  I guess I don't fully
> > understand the definition in the slurm.conf man-page.
> 
> It uses PAM to setup a users' environment on the compute nodes, rather than
> just restoring a version of their profile that was captured on the login
> nodes.
> 
> The behavior is unrelated to pam_slurm_adopt, and that option is not needed.

Thanks so much for this clarification! Many people write that "UsePAM=1" is required in slurm.conf for pam_slurm_adopt, and that's just plain wrong!

> > In
> > https://hpcworks.wordpress.com/2017/05/29/setup-slurm-pam-plugin-on-centos-7/
> > it is stated (near the bottom) that one must also create a file
> > /etc/pam.d/slurm, otherwise PAM error messages are encountered.
> 
> No, you do not need to set that.
> 
> I have no idea who wrote that guide, they're mistaken.

Thanks!  I guess this advice derives from the slurm.conf man-page in the section explaining UsePAM (which we shouldn't use). 

Perhaps the man-page should contain a warning against UsePAM, especially together with pam_slurm_adopt.

> > Question 2: Is the /etc/pam.d/slurm file still required with the latest
> > Slurm 17.02?  Or is it solely needed together with "UsePAM=1"?
> 
> It's only needed with UsePAM=1. I do not recommend using that.
> 
> > The PrologFlags=contain must be set in slurm.conf before enabling the
> > pam_slurm_adopt module (according to the pam_slurm_adopt README).
> > 
> > Question 3: Is there a safe procedure to enable PrologFlags=contain and then
> > pam_slurm_adopt on a production cluster?  Or must the entire cluster be
> > taken down first?
> 
> PrologFlags=contain is safe to enable; you will need to run 'scontrol
> reconfigure' to make the change take effect.
> 
> Only after that's been done should you roll out changes to the PAM
> configuration.

Thanks a lot!  It's great to know that PrologFlags=contain should be done *before* changing the PAM setup, and that it's non-disruptive.

> One caveat here - if you have jobs that are already running and may still
> need to make new SSH connections to complete, those jobs would fail if you
> enable this at a later point in time. It'd be safest to drain the nodes,
> then make the PAM config changes, then mark the nodes as back in service.

OK, it makes sense to modify the PAM setup only on drained nodes.  This might be done in a rolling fashion (somehow).
Comment 5 Tim Wickberg 2017-09-26 15:30:24 MDT
Marking closed as resolved/infogiven.

FYI - bug 3567 tracks our progress on adding documentation for pam_slurm_adopt.