In an attempt to "get around" a node's occasional, and temporary, inability to do an LDAP lookup, and so fail to access PATHs that were being constructed using $USER from the OS, we'd suggested that a user made use of $SLURM_JOB_USER as that got populated as part of the job payload that arrived on the nodes. Our user has since seen one job where $SLURM_JOB_USER got populated with the value "nobody", as in that appeared in the PATH. I'm trying to get my head around a) why it would b) when it would c) why it wouldn't stop the job from being sent to a node, or if it could be made to Any clues/thoughts?
(In reply to Kevin Buckley from comment #0) > Our user has since seen one job where $SLURM_JOB_USER got populated > with the value "nobody", as in that appeared in the PATH. When Slurm can't lookup a user name by user id, it will usually resolve the user as "nobody". Generally, Slurm considers this to a be a temporary error. Is it possible to get your slurmctld logs for the time of this job? There should be an error to verify.
magnus-smw:~ 11:44:07# grep 5126374 /var/opt/cray/log/slurmctld-202006* /var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T14:56:23.260003+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 - _slurm_rpc_submit_batch_job: JobId5126374 InitPrio=5259 usec=7521 /var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:52.142265+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 - backfill: Started JobId=5126374 in workq on nid00947 /var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:55.669612+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 - _job_complete: JobId=5126374 WEXITSTATUS 127 /var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:55.671225+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 - _job_complete: JobId=5126374 done
Created attachment 14852 [details] Slurm CTLD log from 2020-06-28 Trimmed file after last mention of job that saw the issue.
Thank you for the log. I am going to start looking at it, but I would also like to have your slurm.conf.
Created attachment 14874 [details] Slurm config file Should be the same as the file provided for SchedMD Bug 9250
After looking at the code, there are no log messages I can look for. However, the possible errors are quite limited, so maybe we can troubleshoot this. First thing would be to check that the uid exists on the node where the slurmd ran. If so, the next thing would be to check that the uid is actually that of the user in question; it could be that the uid's got mixed up. Please let me know the results of those checks.
On 2020/07/07 06:54, bugs@schedmd.com wrote: > > First thing would be to check that the uid exists on the node where the slurmd > ran. If so, the next thing would be to check that the uid is actually that of > the user in question; it could be that the uid's got mixed up. > > Please let me know the results of those checks. It's all done through LDAP.
How many LDAP servers do you use? If there is more than one, do you have some sort of replication?
On 2020/07/08 00:16, bugs@schedmd.com wrote: > https://bugs.schedmd.com/show_bug.cgi?id=9318 > > --- Comment #9 from Gavin D. Howard <gavin@schedmd.com> --- > How many LDAP servers do you use? If there is more than one, do you have some > sort of replication? There are three LDAP servers. The configuration points to a list of the three.
> --- Comment #9 from Gavin D. Howard <gavin@schedmd.com> --- > How many LDAP servers do you use? If there is more than one, do you have some > sort of replication? Since been informed as follows, re replication: master:master - any one of them can have new info and it replicates to the other 2
Slightly "wider" question for me here would be why Slurm thinks it's "OK" to create a "job payload" with the nobody user, suggesting a FAILURE to do a lookup of user details, in it? One would have thought that would be a configurable action, so as to give the site a chance to investigate ?
You have a point, and I have been investigating how to do that.
It turns out that this is expected behavior. There are two reasons for this: 1. Slurm is attempting, in this situation, to have job resiliency. Sites do not want mass job failures when there is an LDAP outage and have asked for this type of resiliency. 2. We do not intend to add further logic around this code to trigger different scheduling behaviors. Instead we recommend fixing the LDAP issues which are outside of Slurm. You can use some type of monitoring tool to flag off the server or the nobody user as a way to know action is needed. Slurm should still run the job but just as "nobody". It is not the most desirable outcome or message but neither is a failed LDAP server which can not return a valid user to the Scheduler.
Thanks for the clarification. There's a lovely irony in that we were looking to use the SLURM_JOB_x variables being set at job creation time on the SlurmCLTD node, as a way to avoid LDAP issues being seen at job start-up time on the compute nodes, but, in doing so, have merely moved the issue in both time and node. We do have ideas as to how the end-users might avoid the effects of failed LDAP lookups, but they all feel a bit of kludge, and, more importantly, rely on the users to do something. Indeed, the least kludgey idea would seem to be to do away with LDAP and just use a fully populated /etc/passwd, or to have the users make use of their UID, which then would not be subject to an LDAP lookup. Having said, a user testing for both $SLURM_JOB_USER and $USER would have to be very unlucky to have both not resolved by per-job-synchronised intermittent LDAP lookup failures. Just for completeness, and in case it helps inform anyone who comes across this ticket in future, the underlying issue, aside from the intermittent LDAP lookup failures, is that one researcher is writing a job submission script that a load of other folk will use, and is trying to branch on $USER as a way to make the generic script more specific to the user coming to run it. Thanks for looking into this.
I apologize. Yes, it is ironic. I am going to close this bug, but feel free to reopen if you have more questions about this topic.