Bug 4691

Summary: X11 Forwarding: can't open display
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.2   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2018-01-26 13:09:48 MST
I'm trying to see if we could use the native X11 feature in 17.11, and although everything looks to start ok, the display is actually not working from within a job.

We have compiled Slurm 17.11 with libssh2, disabled the SPANK X11 plugin, and set PrologFlags=X11.


But still, "srun --x11" sets $DISPLAY, but it's apparently not usable:

On the submission host:
$ xauth list
sh-ln02.stanford.edu/unix:13  MIT-MAGIC-COOKIE-1  2c56e44e2182224b4496d78a11c55120

Starting the job:
$ srun --x11 -p test -w sh-101-59 --pty bash
$ hostname
sh-101-59
$ xeyes
Error: Can't open display: localhost:29.0
$ xauth list
sh-ln02.stanford.edu/unix:13  MIT-MAGIC-COOKIE-1  2c56e44e2182224b4496d78a11c55120
sh-101-59/unix:29  MIT-MAGIC-COOKIE-1  2c56e44e2182224b4496d78a11c55120



On the compute node, slurmd logs this (debug5)

-- 8< -------------------------------------------------------------------
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: state for jobid 6471832: ctime:1516995764 revoked:0 expires:2147483647
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: state for jobid 6472104: ctime:1516996852 revoked:1516996923 expires:1516997044
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: state for jobid 6472116: ctime:1516996925 revoked:1516996933 expires:1516997054
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug:  Checking credential with 384 bytes of sig data
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug2: _insert_job_state: we already have a job state for job 6472123.  No big deal, just an FYI.
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug:  Calling /usr/sbin/slurmstepd spank prolog
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug:  [job 6472123] attempting to run prolog [/etc/slurm/scripts/prolog.sh]
Jan 26 12:03:12 sh-101-59 slurmd[123347]: _run_prolog: run job script took usec=88931
Jan 26 12:03:12 sh-101-59 slurmd[123347]: _run_prolog: prolog with lock for job 6472123 ran for 0 seconds
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: slurmstepd rank 0 (sh-101-59), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Jan 26 12:03:12 sh-101-59 slurmstepd[123547]: task/cgroup: /slurm/uid_215845/job_6472123: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
Jan 26 12:03:12 sh-101-59 slurmstepd[123547]: task/cgroup: /slurm/uid_215845/job_6472123/step_extern: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
Jan 26 12:03:12 sh-101-59 slurmstepd[123551]: X11 forwarding established on DISPLAY=sh-101-59:29.0
Jan 26 12:03:12 sh-101-59 slurmd[123347]: debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd 0
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug3: in the service_connection
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug2: got this type of message 6001
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
Jan 26 12:03:13 sh-101-59 slurmd[123347]: launch task 6472123.0 request from 215845.32264@10.10.0.62 (port 33515)
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug:  Checking credential with 384 bytes of sig data
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug:  Leaving stepd_get_x11_display
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug2: _setup_x11_display: setting DISPLAY=localhost:29:0 for job 6472123 step 0
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug:  Waiting for job 6472123's prolog to complete
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug:  Finished wait for job 6472123's prolog to complete
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug3: slurmstepd rank 0 (sh-101-59), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Jan 26 12:03:13 sh-101-59 slurmd[123347]: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
Jan 26 12:03:13 sh-101-59 slurmstepd[123558]: task/cgroup: /slurm/uid_215845/job_6472123: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
Jan 26 12:03:13 sh-101-59 slurmstepd[123558]: task/cgroup: /slurm/uid_215845/job_6472123/step_0: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
Jan 26 12:03:13 sh-101-59 slurmstepd[123558]: in _window_manager
Jan 26 12:03:20 sh-101-59 slurmstepd[123551]: error: _handle_channel: remote disconnected
Jan 26 12:03:20 sh-101-59 slurmstepd[123551]: error: _handle_channel: exiting thread
-- 8< -------------------------------------------------------------------


Any suggestion on what to check next?

Thanks!
-- 
Kilian
Comment 1 Tim Wickberg 2018-01-26 14:58:02 MST
What distro and release are you on? RHEL6 or some close variant by any chance?

The logs indicate the tunnel setup worked properly, as did propagation of the xauth cookie.

There are some subtle differences in how the xauth cookies are handled that I'm still working out.

One thing you could test would be adding a few variants of the xauth cookie manually - changing out host for "localhost:(display number)", or "sh-101-59.standford.edu/unix:(display number)" may reveal a working combo on your systems.
Comment 2 Kilian Cavalotti 2018-01-26 15:04:30 MST
(In reply to Tim Wickberg from comment #1)
> What distro and release are you on? RHEL6 or some close variant by any
> chance?

CentOS 7.4.

> The logs indicate the tunnel setup worked properly, as did propagation of
> the xauth cookie.

Yes, that's the disturbing part. :)

> There are some subtle differences in how the xauth cookies are handled that
> I'm still working out.
> 
> One thing you could test would be adding a few variants of the xauth cookie
> manually - changing out host for "localhost:(display number)", or
> "sh-101-59.standford.edu/unix:(display number)" may reveal a working combo
> on your systems.

Ok, I'll try that, I reverted to the SPANK X11 plugin for now, as it's been working reliably for years.

Cheers,
-- 
Kilian
Comment 3 Tim Wickberg 2018-02-06 22:32:47 MST
Updating to resolved/infogiven. If you do get a chance to test out the following, I'd love to know if one of these works.

For 18.08 I'll be extending this with an X11Parameters option to give some ways to tweak the format and better accommodate some of these subtle differences, but there's not too much I can do on 17.11 unfortunately.

- Tim

> One thing you could test would be adding a few variants of the xauth cookie
> manually - changing out host for "localhost:(display number)", or
> "sh-101-59.standford.edu/unix:(display number)" may reveal a working combo
> on your systems.