Bug 9318 - SLURM_JOB_USER set to "nobody"
Summary: SLURM_JOB_USER set to "nobody"
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 19.05.5
Hardware: Cray XC Linux
: --- 3 - Medium Impact
Assignee: Gavin D. Howard
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-30 21:26 MDT by Kevin Buckley
Modified: 2020-07-14 10:13 MDT (History)
1 user (show)

See Also:
Site: Pawsey
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: Other
Machine Name: magnus
CLE Version: 6 UP07
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm CTLD log from 2020-06-28 (1.97 MB, application/x-bzip)
2020-06-30 22:57 MDT, Kevin Buckley
Details
Slurm config file (4.91 KB, text/plain)
2020-07-01 19:40 MDT, Kevin Buckley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Buckley 2020-06-30 21:26:52 MDT
In an attempt to "get around" a node's occasional, and temporary,
inability to do an LDAP lookup, and so fail to access PATHs that
were being constructed using $USER from the OS, we'd suggested
that a user made use of $SLURM_JOB_USER as that got populated as
part of the job payload that arrived on the nodes.

Our user has since seen one job where $SLURM_JOB_USER got populated
with the value "nobody", as in that appeared in the PATH.

I'm trying to get my head around

a) why it would
b) when it would
c) why it wouldn't stop the job from being sent to a node,
   or if it could be made to

Any clues/thoughts?
Comment 1 Nate Rini 2020-06-30 21:39:06 MDT
(In reply to Kevin Buckley from comment #0)
> Our user has since seen one job where $SLURM_JOB_USER got populated
> with the value "nobody", as in that appeared in the PATH.

When Slurm can't lookup a user name by user id, it will usually resolve the user as "nobody". Generally, Slurm considers this to a be a temporary error.

Is it possible to get your slurmctld logs for the time of this job? There should be an error to verify.
Comment 2 Kevin Buckley 2020-06-30 21:46:49 MDT
magnus-smw:~ 11:44:07# grep 5126374 /var/opt/cray/log/slurmctld-202006*

/var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T14:56:23.260003+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 -  _slurm_rpc_submit_batch_job: JobId5126374 InitPrio=5259 usec=7521

/var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:52.142265+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 -  backfill: Started JobId=5126374 in workq on nid00947

/var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:55.669612+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 -  _job_complete: JobId=5126374 WEXITSTATUS 127

/var/opt/cray/log/slurmctld-20200628:<30>1 2020-06-28T22:00:55.671225+08:00 c0-0c0s1n1 slurmctld 16087 p0-20200602t134201 -  _job_complete: JobId=5126374 done
Comment 4 Kevin Buckley 2020-06-30 22:57:27 MDT
Created attachment 14852 [details]
Slurm CTLD log from 2020-06-28

Trimmed file after last mention of job that saw the issue.
Comment 5 Gavin D. Howard 2020-07-01 15:00:25 MDT
Thank you for the log. I am going to start looking at it, but I would also like to have your slurm.conf.
Comment 6 Kevin Buckley 2020-07-01 19:40:48 MDT
Created attachment 14874 [details]
Slurm config file

Should be the same as the file provided for SchedMD Bug 9250
Comment 7 Gavin D. Howard 2020-07-06 16:54:31 MDT
After looking at the code, there are no log messages I can look for. However, the possible errors are quite limited, so maybe we can troubleshoot this.

First thing would be to check that the uid exists on the node where the slurmd ran. If so, the next thing would be to check that the uid is actually that of the user in question; it could be that the uid's got mixed up.

Please let me know the results of those checks.
Comment 8 Kevin Buckley 2020-07-06 18:34:18 MDT
On 2020/07/07 06:54, bugs@schedmd.com wrote:
> 
> First thing would be to check that the uid exists on the node where the slurmd
> ran. If so, the next thing would be to check that the uid is actually that of
> the user in question; it could be that the uid's got mixed up.
> 
> Please let me know the results of those checks.

It's all done through LDAP.
Comment 9 Gavin D. Howard 2020-07-07 10:16:35 MDT
How many LDAP servers do you use? If there is more than one, do you have some sort of replication?
Comment 11 Kevin Buckley 2020-07-07 19:56:06 MDT
On 2020/07/08 00:16, bugs@schedmd.com wrote:
> https://bugs.schedmd.com/show_bug.cgi?id=9318
> 
> --- Comment #9 from Gavin D. Howard <gavin@schedmd.com> ---
> How many LDAP servers do you use? If there is more than one, do you have some
> sort of replication?

There are three LDAP servers.

The configuration points to a list of the three.
Comment 12 Kevin Buckley 2020-07-07 21:07:38 MDT
> --- Comment #9 from Gavin D. Howard <gavin@schedmd.com> ---
> How many LDAP servers do you use? If there is more than one, do you have some
> sort of replication?

Since been informed as follows, re replication:

master:master - any one of them can have new info and it replicates to the other 2
Comment 13 Kevin Buckley 2020-07-07 22:01:44 MDT
Slightly "wider" question for me here would be why Slurm thinks it's "OK"
to create a "job payload" with the nobody user, suggesting a FAILURE to
do a lookup of user details, in it?

One would have thought that would be a configurable action, so as to give
the site a chance to investigate ?
Comment 14 Gavin D. Howard 2020-07-08 09:49:25 MDT
You have a point, and I have been investigating how to do that.
Comment 15 Gavin D. Howard 2020-07-13 13:52:06 MDT
It turns out that this is expected behavior.

There are two reasons for this:

1. Slurm is attempting, in this situation, to have job resiliency. Sites do not want mass job failures when there is an LDAP outage and have asked for this type of resiliency.
2. We do not intend to add further logic around this code to trigger different scheduling behaviors.

Instead we recommend fixing the LDAP issues which are outside of Slurm. You can use some type of monitoring tool to flag off the server or the nobody user as a way to know action is needed. Slurm should still run the job but just as "nobody". It is not the most desirable outcome or message but neither is a failed LDAP server which can not return a valid user to the Scheduler.
Comment 16 Kevin Buckley 2020-07-13 23:34:28 MDT
Thanks for the clarification.

There's a lovely irony in that we were looking to use the
SLURM_JOB_x variables being set at job creation time on
the SlurmCLTD node, as a way to avoid LDAP issues being 
seen at job start-up time on the compute nodes, but, in 
doing so, have merely moved the issue in both time and 
node.

We do have ideas as to how the end-users might avoid the 
effects of failed LDAP lookups, but they all feel a bit 
of kludge, and, more importantly, rely on the users to 
do something.

Indeed, the least kludgey idea would seem to be to do away 
with LDAP and just use a fully populated /etc/passwd, or to 
have the users make use of their UID, which then would not 
be subject to an LDAP lookup.

Having said, a user testing for both $SLURM_JOB_USER and $USER
would have to be very unlucky to have both not resolved by
per-job-synchronised intermittent LDAP lookup failures.

Just for completeness, and in case it helps inform anyone
who comes across this ticket in future, the underlying 
issue, aside from the intermittent LDAP lookup failures, 
is that one researcher is writing a job submission script 
that a load of other folk will use, and is trying to branch
on $USER as a way to make the generic script more specific
to the user coming to run it.

Thanks for looking into this.
Comment 17 Gavin D. Howard 2020-07-14 10:13:57 MDT
I apologize. Yes, it is ironic.

I am going to close this bug, but feel free to reopen if you have more questions about this topic.