Hi, I have just noticed an unexpected behaviour with pam_slurm_adopt. With ConstrainRAMSpace=no set in cgroup.conf, pam_slurm_adopt works fine with exactly one running job on a node but refuses ssh login with more than one job on the node. This is a screenshot from the terminal: [username@login01 jobs]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [username@login01 jobs]$ ssh n0111 Access denied by pam_slurm_adopt: you have no active jobs on this node Connection closed by 10.0.1.11 port 22 [username@login01 jobs]$ sbatch --wrap "sleep 300" Submitted batch job 194344 [username@login01 jobs]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 194344 standard wrap username R 0:01 1 n0111 [username@login01 jobs]$ ssh n0111 Last login: Tue Jul 7 22:03:02 2020 from login01 [username@n0111 ~]$ exit logout Connection to n0111 closed. [username@login01 jobs]$ sbatch --wrap "sleep 300" Submitted batch job 194345 [username@login01 jobs]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 194345 standard wrap username R 0:03 1 n0111 194344 standard wrap username R 0:18 1 n0111 [username@login01 jobs]$ ssh n0111 Access denied by pam_slurm_adopt: you have no active jobs on this node Connection closed by 10.0.1.11 port 22 [username@login01 jobs]$ This is from /var/log/secure on node n0111: Jul 7 22:19:14 n0111 sshd[2590239]: userauth_hostbased: key type ECDSA not in HostbasedAcceptedKeyTypes [preauth] Jul 7 22:19:14 n0111 sshd[2590239]: userauth_hostbased: key type ED25519 not in HostbasedAcceptedKeyTypes [preauth] Jul 7 22:19:14 n0111 pam_slurm_adopt[2590239]: Connection by user username: user has only one job 194344 Jul 7 22:19:14 n0111 pam_slurm_adopt[2590239]: Process 2590239 adopted into job 194344 Jul 7 22:19:14 n0111 sshd[2590239]: Accepted hostbased for username from 10.0.101.1 port 58752 ssh2: RSA SHA256:vlsc1v39WBmrGHFz6YuBiWOkUk0qh3r7nEGtVxuJhwM Jul 7 22:19:14 n0111 sshd[2590239]: pam_unix(sshd:session): session opened for user username by (uid=0) Jul 7 22:19:17 n0111 sshd[2590244]: Received disconnect from 10.0.101.1 port 58752:11: disconnected by user Jul 7 22:19:17 n0111 sshd[2590244]: Disconnected from user username 10.0.101.1 port 58752 Jul 7 22:19:17 n0111 sshd[2590239]: pam_unix(sshd:session): session closed for user username Jul 7 22:19:27 n0111 sshd[2590332]: userauth_hostbased: key type ECDSA not in HostbasedAcceptedKeyTypes [preauth] Jul 7 22:19:27 n0111 sshd[2590332]: userauth_hostbased: key type ED25519 not in HostbasedAcceptedKeyTypes [preauth] Jul 7 22:19:27 n0111 pam_slurm_adopt[2590332]: From 10.0.101.1 port 58762 as username: unable to determine source job Jul 7 22:19:27 n0111 pam_slurm_adopt[2590332]: Couldn't stat path '/sys/fs/cgroup/memory/slurm/uid_900020/job_194345': No such file or directory Jul 7 22:19:27 n0111 pam_slurm_adopt[2590332]: Couldn't stat path '/sys/fs/cgroup/memory/slurm/uid_900020/job_194344': No such file or directory Jul 7 22:19:27 n0111 pam_slurm_adopt[2590332]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node Jul 7 22:19:27 n0111 sshd[2590332]: pam_access(sshd:account): access denied for user `username' from `login01' Jul 7 22:19:27 n0111 sshd[2590332]: fatal: Access denied for user username by PAM account configuration [preauth] Is this expected behaviour? With ConstrainRAMSpace=yes pam_slurm_adopt works as expected for me, i.e. ssh connections always succeed with one or more jobs running on the node. Can we get this with ConstrainRAMSpace=no as well? In case it matters, this is /etc/pam.d/ssh on the compute node: #%PAM-1.0 auth substack password-auth auth include postlogin account required pam_sepermit.so account required pam_nologin.so account include password-auth -account sufficient pam_slurm_adopt.so action_adopt_failure=deny account required pam_access.so password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session optional pam_motd.so session include password-auth session include postlogin Thank you in advance. Best regards Jürgen
Hi Jürgen, That doesn't sound like expected behavior, and when I went to replicate it in my environment it seemed to work just fine with ConstrainRAMSpace=no set. Would you be able to attach your slurm.conf and cgroup.conf files so I can better replicate what you have? What OS are you running on? Thanks! --Tim
Created attachment 14947 [details] cgroup.conf
Created attachment 14948 [details] slurm.conf
Hi Tim, I have attached our slurm.conf and cgroup.conf files. This is running on CentOS 8.2. Best regards Jürgen
(In reply to Juergen Salk from comment #4) > Hi Tim, > > I have attached our slurm.conf and cgroup.conf files. This is running on > CentOS 8.2. > > Best regards > Jürgen Thank you! I was able to reproduce this behavior with that information. I'll update you when I have a fix for you! Thanks again! --Tim
Hi Tim, is there any news on that? Best regards Jürgen
(In reply to Juergen Salk from comment #9) > Hi Tim, > > is there any news on that? > > Best regards > Jürgen Hi Jürgen! Yes, sorry about that! I've written and tested a patch that fixes this issue, it is currently just awaiting review! Thanks, --Tim
Hi Tim, we are about to upgrade to Slurm version 20.02.05 during our next scheduled cluster maintenance. Is this issue fixed in version 20.02.05? I've looked through the announcements for versions 20.02.04 and 20.02.05 but could not not find any indications. Best regards Jürgen
Hi Jürgen, The patch for this didn't quite make it into 20.02.5 unfortunately. I'm working on getting the patch in as soon as possible. If you need it, I can provide a patch to you that should be close to what ends up landing! Let me know if this would be helpful for you! Thanks! -Tim
Vielen Dank für Ihre Nachricht. Bis zum 02.10.2020 bin ich leider im Büro nicht erreichbar. Ich werde mich aber schnellstmöglichst mit Ihnen in Verbindung setzen. In dringenden Fällen wenden Sie sich bitte an: kiz.hpc-admin@uni-ulm.de Mit freundlichen Grüßen, Jürgen Salk ----------------------------------------------------------------------------------------- Thank you for your e-mail. I am out of office until Oct 2nd 2020. I will have limited access to my e-mail during this period but will answer your message as soon as possible. If you have immediate questions or concerns, please contact kiz.hpc-admin@uni-ulm.de Best regards, Juergen Salk
(In reply to Tim McMullan from comment #12) > The patch for this didn't quite make it into 20.02.5 unfortunately. I'm > working on getting the patch in as soon as possible. If you need it, I can > provide a patch to you that should be close to what ends up landing! Let me > know if this would be helpful for you! Hi Tim, yes, it would probably be useful for us to get the patch beforehand unless version 20.02.06 is going to be released the next couple of days and will then include your patch. Best regards Jürgen
Created attachment 16179 [details] bug9355 patch Here is the patch! Let me know if you have any issues with it! Thanks! --Tim
Hi Jürgen, I'm happy to report that this patch was merged ahead of 20.02.6, so it should be in the next release! Thank you for your patience on this one. I'm going to resolve this ticket for now, but please let me know if you have any other questions! Thanks! --Tim
Thank you Tim. We are right in the middle of our scheduled cluster maintenance and have just updated from 19.05.5 to 20.02.5. However, we have backported your patch to 20.02.5 and is seems to work very well. Thanks again. Best regards Jürgen