Bug 5729 - Slurm 18.08.0 - X11 connection rejected because of wrong authentication.
Summary: Slurm 18.08.0 - X11 connection rejected because of wrong authentication.
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 18.08.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-13 08:22 MDT by Lee Hobson
Modified: 2019-03-22 10:25 MDT (History)
2 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lee Hobson 2018-09-13 08:22:48 MDT
I am able to use X11 forwarding just fine, so if I SSH into the cluster with the -X or -Y flags I can then forward the X window from any other compute node within the cluster without any issues. 

The exact same also applies when I connect to the cluster using the x2Go client. I can then open up a terminal and run 'ssh -Y <node> 'xterm'' without any problems at all. 

Right now I am using version 18.08.0-1 of Slurm, but previously was running 17.11.9-2 which also had this exact same issue - on both occasions the Slurm RPM's have been compiled on the headnode which had libssh2 and libssh2-devel present. The libssh2 pacckage is installed on all nodes throughout the cluster. 

This cluster currently has CentOS Linux release 7.5.1804 (Core) installed throughout with kernel 3.10.0-862.11.6. 


The issues start when I then attempt to execute a job in Slurm with X11 support: 

srun  --x11 'xterm'
srun: job 200141 queued and waiting for resources
srun: job 200141 has been allocated resources
X11 connection rejected because of wrong authentication.
/usr/bin/xterm: Xt error: Can't open display: localhost:88.0
srun: error: rjm-compute001: task 0: Exited with exit code 1

The verbose version... 

srun -vvv --x11 'xterm'
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user           : `rjm_test'
srun: uid            : 2800008
srun: gid            : 2800010
srun: cwd            : /home/rjm_test
srun: ntasks         : 1 (default)
srun: nodes          : 1 (default)
srun: jobid          : 4294967294 (default)
srun: partition      : default
srun: profile        : `NotSet'
srun: job name       : `xterm'
srun: reservation    : `(null)'
srun: burst_buffer   : `(null)'
srun: wckey          : `(null)'
srun: cpu_freq_min   : 4294967294
srun: cpu_freq_max   : 4294967294
srun: cpu_freq_gov   : 4294967294
srun: switches       : -1
srun: wait-for-switches : -1
srun: distribution   : unknown
srun: cpu-bind       : default (0)
srun: mem-bind       : default (0)
srun: verbose        : 3
srun: slurmd_debug   : 0
srun: immediate      : false
srun: label output   : false
srun: unbuffered IO  : false
srun: overcommit     : false
srun: threads        : 60
srun: checkpoint_dir : /var/slurm/checkpoint
srun: wait           : 0
srun: nice           : -2
srun: account        : (null)
srun: comment        : (null)
srun: dependency     : (null)
srun: exclusive      : false
srun: bcast          : false
srun: qos            : (null)
srun: constraints    :
srun: reboot         : yes
srun: preserve_env   : false
srun: network        : (null)
srun: propagate      : NONE
srun: prolog         : (null)
srun: epilog         : (null)
srun: mail_type      : NONE
srun: mail_user      : (null)
srun: task_prolog    : (null)
srun: task_epilog    : (null)
srun: multi_prog     : no
srun: sockets-per-node  : -2
srun: cores-per-socket  : -2
srun: threads-per-core  : -2
srun: ntasks-per-node   : -2
srun: ntasks-per-socket : -2
srun: ntasks-per-core   : -2
srun: plane_size        : 4294967294
srun: core-spec         : NA
srun: power             :
srun: cpus-per-gpu      : 0
srun: gpus              : (null)
srun: gpu-bind          : (null)
srun: gpu-freq          : (null)
srun: gpus-per-node     : (null)
srun: gpus-per-socket   : (null)
srun: gpus-per-task     : (null)
srun: mem-per-gpu       : 0
srun: remote command    : `xterm'
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=4096
srun: debug:  propagating RLIMIT_NOFILE=100000
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=36447
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 33993
srun: debug:  Entering _msg_thr_internal
srun: debug:  Munge authentication plugin loaded
srun: debug2: Pending job allocation 200140
srun: job 200140 queued and waiting for resources
srun: debug2: got message connection from rjm-mgmt01-ib:42836
srun: debug2: resource allocation response received
srun: job 200140 has been allocated resources
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: Nodes rjm-compute001 are ready for job
srun: jobid 200140: nodes(1):`rjm-compute001', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 200140, user 2800008, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name xterm, relative 65534
srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  Using mpi/openmpi
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 33185
srun: debug:  Started IO server thread (47043913246464)
srun: debug:  Entering _launch_tasks
srun: launching 200140.0 on host rjm-compute001, 1 tasks: 0
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 15
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 192.168.102.30, node rank 0, sd=16
srun: debug2: eio_message_socket_accept: got message connection from 192.168.102.30:58866 17
srun: debug2: received task launch
srun: Node rjm-compute001, 1 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
X11 connection rejected because of wrong authentication.
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/bin/xterm: Xt error: Can't open display: localhost:74.0
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 192.168.102.30:58870 16
srun: debug2: received task exit
srun: Received task exit notification for 1 task of step 200140.0 (status=0x0100).
srun: error: rjm-compute001: task 0: Exited with exit code 1
srun: debug:  task 0 done
srun: debug2:   false, shutdown
srun: debug2:   false, shutdown
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2:   false, shutdown
srun: debug:  IO thread exiting
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug2:   false, shutdown
srun: debug:  Leaving _msg_thr_internal


The interesting thing here is, when I run this exact command but on the x2Go Xfce4 desktop (where X11 is tried and tested) I get a completly different error: 

srun -vvvvv --pty --x11 'xterm'
srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays.


sshd_config: 

AllowAgentForwarding yes
AllowTcpForwarding yes
#GatewayPorts no
X11Forwarding yes
X11DisplayOffset 50
X11UseLocalhost no
#PermitTTY yes
#PrintMotd yes
#PrintLastLog yes
TCPKeepAlive no
#UseLogin no
UsePrivilegeSeparation sandbox          # Default for new installations.
#PermitUserEnvironment no
#Compression delayed
ClientAliveInterval 30
ClientAliveCountMax 240
#ShowPatchLevel no
#UseDNS yes
#PidFile /var/run/sshd.pid
#MaxStartups 10:30:100
#PermitTunnel no
#ChrootDirectory none
#VersionAddendum none


Could anyone please help point me in the right direction. 

Thanks, 
Lee.
Comment 1 Lee Hobson 2018-09-13 08:43:27 MDT
Just a quick follow up comment on this. 


Taken from https://groups.google.com/forum/#!topic/slurm-users/cpzsGqoqcCI for reference - bottom post by Marcus Wagner: If I run the following on a forwarded SSH session (not x2Go) the job does start. 

salloc srun --x11 'xterm'

Is there any reasoning behind this, or is this the intended way X11 support in Slurm should work? 

Interestingly, if I run the same command on an X2Go SSH remote desktop session, I will then see the following: 

salloc: Pending job allocation 200145
salloc: job 200145 queued and waiting for resources
salloc: job 200145 has been allocated resources
salloc: Granted job allocation 200145
salloc: Waiting for resource configuration
salloc: Nodes rjm-compute001 are ready for job
srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays.
salloc: Relinquishing job allocation 200145
Comment 2 Jason Booth 2018-09-13 09:11:23 MDT
Hi Lee,

 Do you work with Chris Hardacre at OCF? We noticed that you did not tag a site when you opened this issue and this has put this ticket into an unsupported status. We are trying to confirm support for this issue and your response will help us expedite that process.

-Jason
Comment 3 Lee Hobson 2018-09-13 09:14:00 MDT
Hi, 

Yes I do, apologies for not updating that properly.


Thanks, 
Lee (lhobson@ocf.co.uk)
Comment 4 Jacob Jenson 2018-09-13 10:19:56 MDT
Lee, 

Can you please verify which Site this request is from? 

Thanks,
Jacob
Comment 5 Lee Hobson 2018-09-13 10:24:13 MDT
I assume you are referring to which OCF site, this request has come from the Sheffield office.
Comment 6 Lee Hobson 2018-10-25 09:07:13 MDT
I'm currently on annual leave until the 29th October 2018 with very limiited access to email.


Please contact Chris Devine/Faye Exton (project manager), Chris Hardacre (Support) or Russel Slack (Operations Director) should your query be urgent.
Comment 7 Luca Capello 2018-10-25 09:30:31 MDT
Hi there,

at the University of Geneva (Switzerland) we are experiencing the very same issue with 18.08.1, but without the --x11 option.

I have just seen that the latest stable version is 18.08.3, I will build that and come back with the results.

Thx, bye,
Luca
Comment 8 Luca Capello 2018-11-15 09:08:33 MST
Hi there,

(In reply to Luca Capello from comment #7)
> I have just seen that the latest stable version is 18.08.3, I will build
> that and come back with the results.

No changes, X11 forwarding is disabled without the --x11 option and with I get the same error message as in comment #0 :
=====
srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays.
=====

Thx, bye,
Luca
Comment 9 Jason Booth 2019-03-22 10:25:32 MDT
Hi Luca and Lee,

There are a number of changes coming in 19.05 that address issues like this with X11.

Here are some of the commits related to the re-work.

9c8be2689e078756d020d19d8fb9ab2c09a88be5
91170a04641d28d8020d1e4708af080ceb1e3279
f2da4d7c174a0baf4e15301b947e5625fb747c56
c97284691b6a0df57493a13132787a1a908a749f
2a58e3e228c4b0b589e2d6456159fe725e21d32d
3b7d1625c470d479d1c5d8cb492ae8918d551d7f
6985ccbac42a442c73fe91d5ee6146fe901058f1

We have tracked this via Bug #3647 as well. 

In regards to the error message:
> srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays.


Our current X11 forwarding implementation cannot connect to unix sockets at this time.

Two options if you plan to continue using this version are:

- Use "ssh -X localhost", then run "srun --x11" within that SSH session. SSH itself will handle translation between a TCP socket that Slurm's implementation can use to the local unix socket.

- Disable our build-in integration, and use the SPANK X11 plugin instead. Due to differences in how it forwards traffic, it can accommodate use of a unix socket instead of a network socket.

-Jason