I am able to use X11 forwarding just fine, so if I SSH into the cluster with the -X or -Y flags I can then forward the X window from any other compute node within the cluster without any issues. The exact same also applies when I connect to the cluster using the x2Go client. I can then open up a terminal and run 'ssh -Y <node> 'xterm'' without any problems at all. Right now I am using version 18.08.0-1 of Slurm, but previously was running 17.11.9-2 which also had this exact same issue - on both occasions the Slurm RPM's have been compiled on the headnode which had libssh2 and libssh2-devel present. The libssh2 pacckage is installed on all nodes throughout the cluster. This cluster currently has CentOS Linux release 7.5.1804 (Core) installed throughout with kernel 3.10.0-862.11.6. The issues start when I then attempt to execute a job in Slurm with X11 support: srun --x11 'xterm' srun: job 200141 queued and waiting for resources srun: job 200141 has been allocated resources X11 connection rejected because of wrong authentication. /usr/bin/xterm: Xt error: Can't open display: localhost:88.0 srun: error: rjm-compute001: task 0: Exited with exit code 1 The verbose version... srun -vvv --x11 'xterm' srun: defined options for program `srun' srun: --------------- --------------------- srun: user : `rjm_test' srun: uid : 2800008 srun: gid : 2800010 srun: cwd : /home/rjm_test srun: ntasks : 1 (default) srun: nodes : 1 (default) srun: jobid : 4294967294 (default) srun: partition : default srun: profile : `NotSet' srun: job name : `xterm' srun: reservation : `(null)' srun: burst_buffer : `(null)' srun: wckey : `(null)' srun: cpu_freq_min : 4294967294 srun: cpu_freq_max : 4294967294 srun: cpu_freq_gov : 4294967294 srun: switches : -1 srun: wait-for-switches : -1 srun: distribution : unknown srun: cpu-bind : default (0) srun: mem-bind : default (0) srun: verbose : 3 srun: slurmd_debug : 0 srun: immediate : false srun: label output : false srun: unbuffered IO : false srun: overcommit : false srun: threads : 60 srun: checkpoint_dir : /var/slurm/checkpoint srun: wait : 0 srun: nice : -2 srun: account : (null) srun: comment : (null) srun: dependency : (null) srun: exclusive : false srun: bcast : false srun: qos : (null) srun: constraints : srun: reboot : yes srun: preserve_env : false srun: network : (null) srun: propagate : NONE srun: prolog : (null) srun: epilog : (null) srun: mail_type : NONE srun: mail_user : (null) srun: task_prolog : (null) srun: task_epilog : (null) srun: multi_prog : no srun: sockets-per-node : -2 srun: cores-per-socket : -2 srun: threads-per-core : -2 srun: ntasks-per-node : -2 srun: ntasks-per-socket : -2 srun: ntasks-per-core : -2 srun: plane_size : 4294967294 srun: core-spec : NA srun: power : srun: cpus-per-gpu : 0 srun: gpus : (null) srun: gpu-bind : (null) srun: gpu-freq : (null) srun: gpus-per-node : (null) srun: gpus-per-socket : (null) srun: gpus-per-task : (null) srun: mem-per-gpu : 0 srun: remote command : `xterm' srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=100000 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=36447 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 33993 srun: debug: Entering _msg_thr_internal srun: debug: Munge authentication plugin loaded srun: debug2: Pending job allocation 200140 srun: job 200140 queued and waiting for resources srun: debug2: got message connection from rjm-mgmt01-ib:42836 srun: debug2: resource allocation response received srun: job 200140 has been allocated resources srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: Nodes rjm-compute001 are ready for job srun: jobid 200140: nodes(1):`rjm-compute001', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 200140, user 2800008, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name xterm, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/openmpi srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 33185 srun: debug: Started IO server thread (47043913246464) srun: debug: Entering _launch_tasks srun: launching 200140.0 on host rjm-compute001, 1 tasks: 0 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: route default plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug2: Activity on IO listening socket 15 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 192.168.102.30, node rank 0, sd=16 srun: debug2: eio_message_socket_accept: got message connection from 192.168.102.30:58866 17 srun: debug2: received task launch srun: Node rjm-compute001, 1 tasks started srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable X11 connection rejected because of wrong authentication. srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write /usr/bin/xterm: Xt error: Can't open display: localhost:74.0 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 192.168.102.30:58870 16 srun: debug2: received task exit srun: Received task exit notification for 1 task of step 200140.0 (status=0x0100). srun: error: rjm-compute001: task 0: Exited with exit code 1 srun: debug: task 0 done srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal The interesting thing here is, when I run this exact command but on the x2Go Xfce4 desktop (where X11 is tried and tested) I get a completly different error: srun -vvvvv --pty --x11 'xterm' srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. sshd_config: AllowAgentForwarding yes AllowTcpForwarding yes #GatewayPorts no X11Forwarding yes X11DisplayOffset 50 X11UseLocalhost no #PermitTTY yes #PrintMotd yes #PrintLastLog yes TCPKeepAlive no #UseLogin no UsePrivilegeSeparation sandbox # Default for new installations. #PermitUserEnvironment no #Compression delayed ClientAliveInterval 30 ClientAliveCountMax 240 #ShowPatchLevel no #UseDNS yes #PidFile /var/run/sshd.pid #MaxStartups 10:30:100 #PermitTunnel no #ChrootDirectory none #VersionAddendum none Could anyone please help point me in the right direction. Thanks, Lee.
Just a quick follow up comment on this. Taken from https://groups.google.com/forum/#!topic/slurm-users/cpzsGqoqcCI for reference - bottom post by Marcus Wagner: If I run the following on a forwarded SSH session (not x2Go) the job does start. salloc srun --x11 'xterm' Is there any reasoning behind this, or is this the intended way X11 support in Slurm should work? Interestingly, if I run the same command on an X2Go SSH remote desktop session, I will then see the following: salloc: Pending job allocation 200145 salloc: job 200145 queued and waiting for resources salloc: job 200145 has been allocated resources salloc: Granted job allocation 200145 salloc: Waiting for resource configuration salloc: Nodes rjm-compute001 are ready for job srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. salloc: Relinquishing job allocation 200145
Hi Lee, Do you work with Chris Hardacre at OCF? We noticed that you did not tag a site when you opened this issue and this has put this ticket into an unsupported status. We are trying to confirm support for this issue and your response will help us expedite that process. -Jason
Hi, Yes I do, apologies for not updating that properly. Thanks, Lee (lhobson@ocf.co.uk)
Lee, Can you please verify which Site this request is from? Thanks, Jacob
I assume you are referring to which OCF site, this request has come from the Sheffield office.
I'm currently on annual leave until the 29th October 2018 with very limiited access to email. Please contact Chris Devine/Faye Exton (project manager), Chris Hardacre (Support) or Russel Slack (Operations Director) should your query be urgent.
Hi there, at the University of Geneva (Switzerland) we are experiencing the very same issue with 18.08.1, but without the --x11 option. I have just seen that the latest stable version is 18.08.3, I will build that and come back with the results. Thx, bye, Luca
Hi there, (In reply to Luca Capello from comment #7) > I have just seen that the latest stable version is 18.08.3, I will build > that and come back with the results. No changes, X11 forwarding is disabled without the --x11 option and with I get the same error message as in comment #0 : ===== srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. ===== Thx, bye, Luca
Hi Luca and Lee, There are a number of changes coming in 19.05 that address issues like this with X11. Here are some of the commits related to the re-work. 9c8be2689e078756d020d19d8fb9ab2c09a88be5 91170a04641d28d8020d1e4708af080ceb1e3279 f2da4d7c174a0baf4e15301b947e5625fb747c56 c97284691b6a0df57493a13132787a1a908a749f 2a58e3e228c4b0b589e2d6456159fe725e21d32d 3b7d1625c470d479d1c5d8cb492ae8918d551d7f 6985ccbac42a442c73fe91d5ee6146fe901058f1 We have tracked this via Bug #3647 as well. In regards to the error message: > srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. Our current X11 forwarding implementation cannot connect to unix sockets at this time. Two options if you plan to continue using this version are: - Use "ssh -X localhost", then run "srun --x11" within that SSH session. SSH itself will handle translation between a TCP socket that Slurm's implementation can use to the local unix socket. - Disable our build-in integration, and use the SPANK X11 plugin instead. Due to differences in how it forwards traffic, it can accommodate use of a unix socket instead of a network socket. -Jason