Bug 5510

Summary: X11 Forwarding: can't open display
Product: Slurm Reporter: Bill Abbott <babbott>
Component: SchedulingAssignee: Jason Booth <jbooth>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: ledfordd, tim
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3647
Site: Rutgers Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: x11 slurmctld log
x11 slurmd log

Description Bill Abbott 2018-07-31 14:24:30 MDT
We're seeing exactly the same behavior as bug 4691 - after upgrading to 17.11.7 via the OpenHPC packages --x11 doesn't work:

babbott@amarel4:~$ srun --reservation=slurm_upgrade --x11 xterm              
X11 connection rejected because of wrong authentication.
/usr/bin/xterm: Xt error: Can't open display: localhost:46.0
srun: error: slepner010: task 0: Exited with exit code 1


rsa keys are fine, slurm.conf has PrologFlags=x11, previous spank rpm was removed.  Bug 4691 was closed as "Info given", but was the underlying issue ever resolved?
Comment 1 Jason Booth 2018-08-01 09:18:28 MDT
Hi Bill,

  I will look into this and see what I can find. What version did you upgrade from?

Best regards,
Jason
Comment 2 Bill Abbott 2018-08-01 09:26:53 MDT
16.05.10
Comment 4 Jason Booth 2018-08-01 10:13:08 MDT
Hi Bill,

Starting at 17.11 we use a built-in X11 feature based on libssh2.
 https://slurm.schedmd.com/faq.html#x11

If you attach a copy of the slurmd logs then this might give us some additional details as to what is going on. Also, it is possible to go back to the SPANK mode but you would have to build with the "--disable-x11" as mentioned in the faq link above.

Best regards,
Jason
Comment 5 Bill Abbott 2018-08-01 13:20:03 MDT
Created attachment 7485 [details]
x11 slurmctld log
Comment 6 Bill Abbott 2018-08-01 13:20:26 MDT
Created attachment 7486 [details]
x11 slurmd log
Comment 7 Bill Abbott 2018-08-01 13:23:15 MDT
I've attached the relevant slurmd and slurmctld logs with debug level 5.  From the login node it looks like this:

babbott@nixon:~$ ssh -X perceval1.hpc.rutgers.edu
babbott@perceval1:~$ srun --reservation=slurm_upgrade --x11 xterm
X11 connection rejected because of wrong authentication.
/usr/bin/xterm: Xt error: Can't open display: localhost:49.0
srun: error: node131: task 0: Exited with exit code 1
babbott@perceval1:~$ srun --reservation=slurm_upgrade --x11 --pty bash -i
babbott@node131:~$ xterm
X11 connection rejected because of wrong authentication.
xterm: Xt error: Can't open display: localhost:87.0
babbott@node131:~$ exit
exit
srun: error: node131: task 0: Exited with exit code 1
babbott@perceval1:~$ exit
logout
Connection to perceval1.hpc.rutgers.edu closed.



Running xterm from the login node (perceval1) works fine, and the rsa keys seem to be set up correctly.  I set StrictHostKeyChecking=no in ssh_config, no change.  Libssh2 is installed on all nodes.  We didn't compile this ourselves; this is via the OpenHPC rpms.  The slurm.conf files has PrologFlags=x11.
Comment 8 Jason Booth 2018-08-01 17:07:20 MDT
Hi Bill,

The error is not specifically generated by SLURM.

"X11 connection rejected because of wrong authentication."

The issue seems tied to the ".Xauthority" as outlined by the following two sites.

https://www.cyberciti.biz/faq/x11-connection-rejected-because-of-wrong-authentication/

https://access.redhat.com/solutions/1473133

Best regards,
Jason
Comment 9 Bill Abbott 2018-08-02 07:52:29 MDT
I'll investigate, thanks.
Comment 10 Jason Booth 2018-08-09 11:44:35 MDT
Hi Bill,

 Were you able to look into the ".Xauthority" and did that help resolve the issue with "X11 connection rejected because of wrong authentication."?

Best regards,
Jason
Comment 11 Bill Abbott 2018-08-14 09:32:47 MDT
Hi Jason,

I haven't been able to do testing on this.  Please set the importance to minor until we can.

Bill
Comment 12 Donald Ledford 2018-08-14 14:57:38 MDT
I can confirm we're seeing the same thing here. 

We had a test OpenHPC 1.3.3 cluster setup with the OHPC Slurm 17.02.9 package and a custom compiled SPANK X11 plugin. Everything "just worked" on that setup in regards to X11 forwarding.

We encountered the bug described here while building the production server using OpenHPC 1.3.5 with OHPC SLURM 17.11.7.

Launching an X11 program on the compute node via "srun --x11 --pty xterm" results in the "X11 connection rejected because of wrong authentication." error.

Opening a compute node shell using "srun --x11 --pty /bin/bash" shows the following:

$DISPLAY=localhost:99.0

xauth listing:

headnode.full.domain.name/unix:10 MIT-MAGIC-COOKIE.1 AABBCCDDEE
compute-node/unix:99 MIT-MAGIC-COOKIE.1 AABCCDDEE

Running "xterm" from the prompt gives the authentication rejection error.

I have found that manually adding an xauth cookie for localhost:99 on the compute node gets things working, i.e.:

xauth add localhost:99 MIT-MAGIC-COOKIE.1 AABBCCDDEE

but attempting to do that automatically via, say, a TaskProlog script causes an xauth allocation time out and a downed compute node.

Starting a non-X11 srun job, then doing "ssh -X cluster-node" to the node allows X forwarding to work fine from the SSH session.

Both head and compute nodes are CentOS 7.5.1804 with the latest updates.

We are using "X11Forwarding yes" and "X11UseLocalhost no" on both the head node and compute node. We are also using user PubKey authentication and RSA keys.

The .Xauthority files are at $HOME/.Xauthority which is an NFS mount shared between the head node and the compute nodes. We have not encountered xauth locking issues.

Our head node has 2 DNS names, a FQDN for our campus network and a cluster specific name for the compute nodes if that has any bearing on the issue. The "/etc/hosts" file created and distributed by Warewulf has the cluster side IP/DNS names setup properly.

If there is a way to script around this until SLURM gets updated by OHPC that would fantastic.
Comment 15 Tim Wickberg 2018-12-05 16:02:04 MST
Hey Bill -

I'm closing this as a duplicate of the original X11 forwarding plugin bug, and will be updating that as additional configuration entries are added to support, for example, different hostname patterns in the xauth file.

These changes will only happen on the newer 18.08 release however. We do not expect to make any further 17.11 maintenance release at this time, and that release is also missing the X11Parameters configuration option which will give you control over these settings.

If you have further questions, please add them over on bug 3647.

thanks,
- Tim

*** This bug has been marked as a duplicate of bug 3647 ***