Summary: | Some users not able to run jobs when sssd enumerate is off on compute nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Bill Marmagas <zorba> |
Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | - Unsupported Older Versions | ||
Hardware: | Cray CS | ||
OS: | Linux | ||
Site: | VTech BI | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Bill Marmagas
2020-07-09 14:33:46 MDT
One other difference on the 19.05.5 cluster that does not exhibit the problem is that I have implemented pam_slurm_adopt on it, but not on the others. Not sure if that has an effect. I can probably test that on one of the older Slurm systems. Bill, >[...]because they do not get their SLURM_JOB_USER set properly It looks that both SLUR_JOB_USER and BASH prompt are not set correctly because of some issues in the environment. I have a hypothesis that we can confirm by the following sequence of commands on the computing node: #getent passwd 986791 #getent passwd USER_NAME #getent passwd 986791 where you should substitute USER_NAME with an appropriate name. SLURM_JOB_USER variable gets when calling srun is set based on uid. Slurm code is rather simple here and just executes getpwuid_r glibc function, if it fails to get user name then we're setting it to hardcoded "nobody". >issue fixed in 19.05 that may be directly related: -- srun - do not continue with job launch if --uid fails. CVE-2019-19728. This is not related to the issue you described. cheers, Marcin I logged in as root to a compute node that I just had trouble launching a job on using my regular user account and ran those commands: [root@tc219 ~]# getent passwd 986791 [root@tc219 ~]# getent passwd zorba zorba:*:986791:986791:William Gregory Marmagas:/home/zorba:/bin/bash [root@tc219 ~]# getent passwd 986791 zorba:*:986791:986791:William Gregory Marmagas:/home/zorba:/bin/bash Thanks. Bill, As you see uid->username resolution in your configuration works only after first username->uid call. This is outside of SchedMD expertise, however, there are two most probable reasons for that: -> Your backend (IAM) database configuration doesn't correctly handle queries originated by sssd in case of direct user name specified (enumeration disabled). -> You're using sssd-ldap with algorithmic mapping and users in different slices. In this case mapping from uid to user name will only work when the first username->uid resolution for this slice was already performed, since this will assign the slice. Looking back at comment 0: >We recently had to turn off the sssd enumerate option on our compute nodes while implementing our latest cluster because the large number of nodes My recommendation would be to try nss_slurm[1] which is Name Service Switch module that can be used on top of other sources (like sssd) to reduce the load on the backend IAM databases. Simply speaking it handles all "getent passwd" like queries happening inside the job step from Slurm cached instead of querying over the network. It was added in Slurm 19.05 release. This way you can reenable enumeration, which as I understand was rather a workaround for high load than intentional configuration change. Checking Slurm code for potential workarounds I have an idea that may, or may not work depending on configuration details, version, your workload specifics, and... the root cause details. Could you please check if the SLURM_JOB_USER is correctly set in a job prolog? If it's you can execute a query like: `getent passwd $SLURM_JOB_USER` in prolog script before Slurm will execv user job process to "pre-load" the information to sssd. This will effectively require configuration of: >PrologFlags=Alloc cheers, Marcin [1]https://slurm.schedmd.com/nss_slurm.html Thanks for the detailed reply. >-> Your backend (IAM) database configuration doesn't correctly handle queries originated by sssd in case of direct user name specified (enumeration disabled). Yes, the issue seems to be that individuals can have their user name suppressed in our IAD, and those are the accounts having the issues with sssd enumeration disabled. I'm not sure why it is not a problem on our Slurm 19.05 cluster. >My recommendation would be to try nss_slurm[1] which is Name Service Switch module that can be used on top of other sources (like sssd) to reduce the load on on the backend IAM databases. Thank you for that tip! I did not know about that new module. That's a good reason for us to ask the integrator to perform the upgrade to at least 19.05. >Checking Slurm code for potential workarounds I have an idea that may, or may not work depending on configuration details, version, your workload specifics, and... the root cause details. Could you please check if the SLURM_JOB_USER is correctly set in a job prolog? Yes, I had that same thought about using a prolog, but the SLURM_JOB_USER gets set to nobody and so it did not work when I first tried it. However, I did not try it with PrologFlags=Alloc so I will set up a new test with that configured and let you know how it works. It appears that the prolog with PrologFlags=Alloc set worked! -> The test prolog: #!/bin/bash GETENT=/usr/bin/getent if [ ! -z $SLURM_JOB_USER ] then echo $SLURM_JOB_USER $GETENT passwd $SLURM_JOB_USER else echo "The SLURM_JOB_USER variable is empty." exit 101 fi -> The job on a node that had just previously failed: [zorba@tinkercliffs2 ~]$ srun --pty --reservation=slurm_testing $SHELL Inactive Modules: 1) DefaultModules [zorba@tc001 ~]$ id uid=986791(zorba) gid=986791(zorba) groups=986791(zorba),16521(arc.arcadm),7228715(arc.openondemand),7863760(arc.sysadmin),7937010(arc.haswell) [zorba@tc001 ~]$ echo $SLURM_JOB_USER zorba [zorba@tc001 ~]$ I'm going to do some repeat testing to verify, but I'm pretty sure that was it. Thanks! Bill, I'm going to go ahead and close the case as "info given". Should you have any questions please don't hesitate to reopen. cheers, Marcin |