Ticket 9355 - pam_slurm_adopt with ConstrainRAMSpace=no
Summary: pam_slurm_adopt with ConstrainRAMSpace=no
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-07-07 14:51 MDT by Juergen Salk
Modified: 2020-10-28 09:09 MDT (History)
0 users

See Also:
Site: Ulm University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.6 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
cgroup.conf (154 bytes, text/x-matlab)
2020-07-08 06:51 MDT, Juergen Salk
Details
slurm.conf (3.20 KB, text/plain)
2020-07-08 06:52 MDT, Juergen Salk
Details
bug9355 patch (2.12 KB, patch)
2020-10-09 09:45 MDT, Tim McMullan
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Juergen Salk 2020-07-07 14:51:11 MDT
Hi,

I have just noticed an unexpected behaviour with pam_slurm_adopt.

With ConstrainRAMSpace=no set in cgroup.conf, pam_slurm_adopt works fine with exactly one running job on a node but refuses ssh login with more than one job on the node.

This is a screenshot from the terminal:

[username@login01 jobs]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[username@login01 jobs]$ ssh n0111
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.0.1.11 port 22

[username@login01 jobs]$ sbatch  --wrap "sleep 300"
Submitted batch job 194344
[username@login01 jobs]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            194344  standard     wrap username  R       0:01      1 n0111
[username@login01 jobs]$ ssh n0111
Last login: Tue Jul  7 22:03:02 2020 from login01
[username@n0111 ~]$ exit
logout
Connection to n0111 closed.

[username@login01 jobs]$ sbatch  --wrap "sleep 300"
Submitted batch job 194345
[username@login01 jobs]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            194345  standard     wrap username  R       0:03      1 n0111
            194344  standard     wrap username  R       0:18      1 n0111
[username@login01 jobs]$ ssh n0111
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.0.1.11 port 22
[username@login01 jobs]$

This is from /var/log/secure on node n0111:

Jul  7 22:19:14 n0111 sshd[2590239]: userauth_hostbased: key type ECDSA not in HostbasedAcceptedKeyTypes [preauth]
Jul  7 22:19:14 n0111 sshd[2590239]: userauth_hostbased: key type ED25519 not in HostbasedAcceptedKeyTypes [preauth]
Jul  7 22:19:14 n0111 pam_slurm_adopt[2590239]: Connection by user username: user has only one job 194344
Jul  7 22:19:14 n0111 pam_slurm_adopt[2590239]: Process 2590239 adopted into job 194344
Jul  7 22:19:14 n0111 sshd[2590239]: Accepted hostbased for username from 10.0.101.1 port 58752 ssh2: RSA SHA256:vlsc1v39WBmrGHFz6YuBiWOkUk0qh3r7nEGtVxuJhwM
Jul  7 22:19:14 n0111 sshd[2590239]: pam_unix(sshd:session): session opened for user username by (uid=0)
Jul  7 22:19:17 n0111 sshd[2590244]: Received disconnect from 10.0.101.1 port 58752:11: disconnected by user
Jul  7 22:19:17 n0111 sshd[2590244]: Disconnected from user username 10.0.101.1 port 58752
Jul  7 22:19:17 n0111 sshd[2590239]: pam_unix(sshd:session): session closed for user username
Jul  7 22:19:27 n0111 sshd[2590332]: userauth_hostbased: key type ECDSA not in HostbasedAcceptedKeyTypes [preauth]
Jul  7 22:19:27 n0111 sshd[2590332]: userauth_hostbased: key type ED25519 not in HostbasedAcceptedKeyTypes [preauth]
Jul  7 22:19:27 n0111 pam_slurm_adopt[2590332]: From 10.0.101.1 port 58762 as username: unable to determine source job
Jul  7 22:19:27 n0111 pam_slurm_adopt[2590332]: Couldn't stat path '/sys/fs/cgroup/memory/slurm/uid_900020/job_194345': No such file or directory
Jul  7 22:19:27 n0111 pam_slurm_adopt[2590332]: Couldn't stat path '/sys/fs/cgroup/memory/slurm/uid_900020/job_194344': No such file or directory
Jul  7 22:19:27 n0111 pam_slurm_adopt[2590332]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node
Jul  7 22:19:27 n0111 sshd[2590332]: pam_access(sshd:account): access denied for user `username' from `login01'
Jul  7 22:19:27 n0111 sshd[2590332]: fatal: Access denied for user username by PAM account configuration [preauth]

Is this expected behaviour?

With ConstrainRAMSpace=yes pam_slurm_adopt works as expected for me, i.e. ssh connections always succeed with one or more jobs running on the node.
Can we get this with ConstrainRAMSpace=no as well?

In case it matters, this is /etc/pam.d/ssh on the compute node:

#%PAM-1.0
auth       substack     password-auth
auth       include      postlogin
account    required     pam_sepermit.so
account    required     pam_nologin.so
account    include      password-auth
-account   sufficient   pam_slurm_adopt.so action_adopt_failure=deny
account    required     pam_access.so
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    optional     pam_motd.so
session    include      password-auth
session    include      postlogin


Thank you in advance.

Best regards
Jürgen
Comment 1 Tim McMullan 2020-07-08 06:37:38 MDT
Hi Jürgen,

That doesn't sound like expected behavior, and when I went to replicate it in my environment it seemed to work just fine with ConstrainRAMSpace=no set.  Would you be able to attach your slurm.conf and cgroup.conf files so I can better replicate what you have?  What OS are you running on?

Thanks!
--Tim
Comment 2 Juergen Salk 2020-07-08 06:51:30 MDT
Created attachment 14947 [details]
cgroup.conf
Comment 3 Juergen Salk 2020-07-08 06:52:19 MDT
Created attachment 14948 [details]
slurm.conf
Comment 4 Juergen Salk 2020-07-08 06:54:24 MDT
Hi Tim,

I have attached our slurm.conf and cgroup.conf files. This is running on CentOS 8.2.

Best regards
Jürgen
Comment 6 Tim McMullan 2020-07-08 08:52:30 MDT
(In reply to Juergen Salk from comment #4)
> Hi Tim,
> 
> I have attached our slurm.conf and cgroup.conf files. This is running on
> CentOS 8.2.
> 
> Best regards
> Jürgen

Thank you!  I was able to reproduce this behavior with that information.  I'll update you when I have a fix for you!

Thanks again!
--Tim
Comment 9 Juergen Salk 2020-07-23 03:46:46 MDT
Hi Tim,

is there any news on that?

Best regards
Jürgen
Comment 10 Tim McMullan 2020-07-23 05:03:36 MDT
(In reply to Juergen Salk from comment #9)
> Hi Tim,
> 
> is there any news on that?
> 
> Best regards
> Jürgen

Hi Jürgen!

Yes, sorry about that!  I've written and tested a patch that fixes this issue, it is currently just awaiting review!

Thanks,
--Tim
Comment 11 Juergen Salk 2020-09-28 03:19:26 MDT
Hi Tim,

we are about to upgrade to Slurm version 20.02.05 during our next scheduled cluster maintenance. Is this issue fixed in version 20.02.05? I've looked through the announcements for versions 20.02.04 and 20.02.05 but could not not find any indications.

Best regards
Jürgen
Comment 12 Tim McMullan 2020-10-01 07:09:27 MDT
Hi Jürgen,

The patch for this didn't quite make it into 20.02.5 unfortunately.  I'm working on getting the patch in as soon as possible.  If you need it, I can provide a patch to you that should be close to what ends up landing!  Let me know if this would be helpful for you!

Thanks!
-Tim
Comment 13 Juergen Salk 2020-10-01 07:09:36 MDT
Vielen Dank für Ihre Nachricht.

Bis zum 02.10.2020 bin ich leider im Büro nicht erreichbar. Ich werde mich aber schnellstmöglichst mit Ihnen in Verbindung setzen.

In dringenden Fällen wenden Sie sich bitte an: kiz.hpc-admin@uni-ulm.de

Mit freundlichen Grüßen,

Jürgen Salk

-----------------------------------------------------------------------------------------

Thank you for your e-mail.

I am out of office until Oct 2nd 2020. I will have limited access to my e-mail during this period but will answer your message as soon as possible.

If you have immediate questions or concerns, please contact 
kiz.hpc-admin@uni-ulm.de

Best regards,

Juergen Salk
Comment 14 Juergen Salk 2020-10-09 06:28:33 MDT
(In reply to Tim McMullan from comment #12)

> The patch for this didn't quite make it into 20.02.5 unfortunately.  I'm
> working on getting the patch in as soon as possible.  If you need it, I can
> provide a patch to you that should be close to what ends up landing!  Let me
> know if this would be helpful for you!

Hi Tim,

yes, it would probably be useful for us to get the patch beforehand unless version 20.02.06 is going to be released the next couple of days and will then include your patch.

Best regards
Jürgen
Comment 15 Tim McMullan 2020-10-09 09:45:57 MDT
Created attachment 16179 [details]
bug9355 patch

Here is the patch!  Let me know if you have any issues with it!

Thanks!
--Tim
Comment 18 Tim McMullan 2020-10-28 06:41:30 MDT
Hi Jürgen,

I'm happy to report that this patch was merged ahead of 20.02.6, so it should be in the next release!

Thank you for your patience on this one.  I'm going to resolve this ticket for now, but please let me know if you have any other questions!

Thanks!
--Tim
Comment 19 Juergen Salk 2020-10-28 09:09:58 MDT
Thank you Tim. 

We are right in the middle of our scheduled cluster maintenance and have just updated from 19.05.5 to 20.02.5. However, we have backported your patch to 20.02.5 and is seems to work very well.

Thanks again.

Best regards
Jürgen