Bug 6532 - Error parsing DISPLAY environment variable. Cannot use X11
Summary: Error parsing DISPLAY environment variable. Cannot use X11
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 18.08.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-15 12:34 MST by Daniel P Davis
Modified: 2019-02-27 10:10 MST (History)
0 users

See Also:
Site: EM
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.0pre2, 18.08.6
Target Release: ---
DevPrio: ---


Attachments
Xorg log (18.70 KB, text/plain)
2019-02-18 09:00 MST, Daniel P Davis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel P Davis 2019-02-15 12:34:58 MST
Getting X11 working natively has been on our backlog for a long time.  We recently upgraded to 18.08.4 and tested again.  Getting the following error:

$ srun -p interactive --pty --x11 bash
srun: error: Error parsing DISPLAY environment variable. Cannot use X11 forwarding.
$ echo $DISPLAY
172.20.0.13:65

I don't see anything in the slurmd or slurmctld logs when this happens.

$ scontrol show config | grep X11
PrologFlags             = Alloc,Contain,X11
X11Parameters           = (null)

I have tried manually setting DISPLAY to different hostnames, but without success.

Thoughts on how to debug?
Comment 2 Nate Rini 2019-02-15 13:41:22 MST
(In reply to Daniel P Davis from comment #0)
> $ srun -p interactive --pty --x11 bash
> srun: error: Error parsing DISPLAY environment variable. Cannot use X11
> forwarding.
> $ echo $DISPLAY
> 172.20.0.13:65

The DISPLAY variable lacks a screen number:
> 172.20.0.13:65.0

Does putting ".0" at the end work as a functional workaround?

--Nate
Comment 3 Daniel P Davis 2019-02-15 14:21:06 MST
Adding .0 to the end lets me land on a node. (Thanks!)

However, I get an error when launching an app now.

$ xcalc
No protocol specified
Error: Can't open display: localhost:60.0
$ echo $DISPLAY
localhost:60.0

If I ssh -Y to this same node from another terminal I can launch the app.  The display variable looks similar, just with a different port. localhost:10.0
Comment 4 Nate Rini 2019-02-15 15:53:53 MST
(In reply to Daniel P Davis from comment #0)
> Getting X11 working natively has been on our backlog for a long time.  We
> recently upgraded to 18.08.4 and tested again.

Please note that X11 support is under active development for 19.05 and that work is being tracked through https://bugs.schedmd.com/show_bug.cgi?id=3647.
Comment 5 Nate Rini 2019-02-15 16:01:06 MST
(In reply to Nate Rini from comment #2)
> (In reply to Daniel P Davis from comment #0)
> > $ srun -p interactive --pty --x11 bash
> > srun: error: Error parsing DISPLAY environment variable. Cannot use X11
> > forwarding.
> > $ echo $DISPLAY
> > 172.20.0.13:65
> 
> The DISPLAY variable lacks a screen number:
> > 172.20.0.13:65.0

Can you please provide the output of "xauth list" on the source node and the from inside of the job. Please censor (replace with XXXX) out the magic cookie hex values.

Please also provide the Xorg.log of the forwarded display. Please make sure to XXXX any keys or other private information. That error is likely a incompatibility between the clients.

Please also provide ldd of xclock or xterm on all the hosts.

> If I ssh -Y to this same node from another terminal I can launch the app.
Are you connecting to the calling node using -X or -Y in ssh? Are there any errors in the SSH log on the client?

--Nate
Comment 6 Daniel P Davis 2019-02-18 08:56:02 MST
$ echo $DISPLAY
172.20.0.13:67
$ export DISPLAY=172.20.0.13:67.0
$ xauth list | cut -f1,3 -d' '
login3.descartes:65 MIT-MAGIC-COOKIE-1
$ srun -p test --pty --x11 bash
[SLURM]$ xauth list | cut -d' ' -f1   
159.70.70.203:2
rambo.na.xom.com/unix:2
159.70.70.203:1
rambo.na.xom.com/unix:1
login2-eth1.descartes:11
login1-eth1.descartes:11
login1-eth1.descartes:16
login2-eth1.descartes:12
login3-eth1.descartes:27
login1-eth1.descartes:10
login1-eth1.descartes:1
clnhpc01/unix:1
login1-eth1.descartes:4
clnhpc01/unix:4
login1-eth1.descartes:5
clnhpc01/unix:5
login2-eth1.descartes:6
clnhpc02/unix:6
login4-eth1.descartes:1
clnhpc04/unix:1
clnhpc03/unix:26
clnxcat01/unix:12
159.70.88.217:1
clndnode25.na.xom.com/unix:1
clnhpc04/unix:23
clnhpc03/unix:12
clnxcat02/unix:10
clnhpc02/unix:12
clnhpc01/unix:10
159.70.71.201:1
clndnode15.na.xom.com/unix:1
clnsand02.na.xom.com:1
clnsand02.na.xom.com/unix:1
clnhpc02/unix:16
clnhpc02/unix:11
clnhpc02/unix:17
clnhpc03/unix:17
clndnode02.na.xom.com/unix:10
clnhpc01/unix:19
clnhpc02/unix:15
n0106/unix:10
clnhpc02/unix:18
clndnode11.na.xom.com/unix:10
login2.descartes:2
clnhpc02/unix:2
login2.descartes:4
clnhpc02/unix:4
login2.descartes:6
login4.descartes:1
n0101/unix:10
n0103/unix:10
gpu1.descartes/unix:11
gpu1.descartes/unix:10
clnxcat01/unix:14
login2-eth1.descartes:5
clnhpc02/unix:5
clnsand01.na.xom.com/unix:10
clndnode01.na.xom.com/unix:10
clnxcat01/unix:18
clnxcat01/unix:19
clnxcat01/unix:22
clnxcat01/unix:20
clnhpc03/unix:15
n0103/unix:84
n0102/unix:60
n0102.descartes/unix:10
n0102/unix:56
n0102/unix:48
n0102/unix:91
n0102/unix:49
[SLURM]$ xcalc
No protocol specified
Error: Can't open display: localhost:49.0
[SLURM]$ ldd /usr/bin/xcalc
	linux-vdso.so.1 =>  (0x00007ffc3bd91000)
	libXaw.so.7 => /lib64/libXaw.so.7 (0x00007ffb27977000)
	libXt.so.6 => /lib64/libXt.so.6 (0x00007ffb2770f000)
	libX11.so.6 => /lib64/libX11.so.6 (0x00007ffb273d1000)
	libm.so.6 => /lib64/libm.so.6 (0x00007ffb270cf000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffb26d01000)
	libXext.so.6 => /lib64/libXext.so.6 (0x00007ffb26aef000)
	libXmu.so.6 => /lib64/libXmu.so.6 (0x00007ffb268d4000)
	libXpm.so.4 => /lib64/libXpm.so.4 (0x00007ffb266c1000)
	libSM.so.6 => /lib64/libSM.so.6 (0x00007ffb264b9000)
	libICE.so.6 => /lib64/libICE.so.6 (0x00007ffb2629d000)
	libxcb.so.1 => /lib64/libxcb.so.1 (0x00007ffb26074000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007ffb25e70000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb27beb000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007ffb25c6b000)
	libXau.so.6 => /lib64/libXau.so.6 (0x00007ffb25a66000)


FYI, my xauth list on the client has a ton of old entries.  Not sure how that would be persisting, as these are stateless nodes.
Comment 7 Daniel P Davis 2019-02-18 09:00:04 MST
Created attachment 9211 [details]
Xorg log
Comment 8 Daniel P Davis 2019-02-18 09:08:03 MST
Trying again after clearing xauth entries on both login and client node.

$ echo $DISPLAY
172.20.0.13:65
$ xauth list | cut -d' ' -f1
login3.descartes:65
$ export DISPLAY=172.20.0.13:65.0
$ srun -p test --x11 --pty bash
srun: job 31850 queued and waiting for resources
srun: job 31850 has been allocated resources
[SLURM]$ xauth list
n0102/unix:97  MIT-MAGIC-COOKIE-1  dd778dc51948e78de6c3aa994b93a67c
[SLURM]$ xcalc
No protocol specified
Error: Can't open display: localhost:97.0
Comment 9 Daniel P Davis 2019-02-18 09:08:50 MST
Forgot to remove my magic cookie from the log, but I have cleared it on my side now.
Comment 10 Nate Rini 2019-02-20 14:46:10 MST
(In reply to Daniel P Davis from comment #8)
> $ echo $DISPLAY
> 172.20.0.13:65

Are you calling 'ssh -X' into the login node to generate this DISPLAY? I would expect it to point to localhost.

--Nate
Comment 11 Daniel P Davis 2019-02-21 06:22:27 MST
I use ssh -Y
Comment 12 Daniel P Davis 2019-02-21 06:39:12 MST
Here is a look at the ssh -Y approach that works:

$ echo $DISPLAY
172.20.0.13:65
$ ssh -Y n0101
[n0101]$ echo $DISPLAY
localhost:10.0
[n0101]$ xcalc

Runs as expected.  Also, I did not need to update my DISPLAY with a screen number.

ssh -X also works:

$ ssh -X n0101
Last login: Thu Feb 21 08:34:19 2019 from login3.descartes
[n0101]$ echo $DISPLAY
localhost:11.0
[n0101]$ xcalc
Comment 13 Daniel P Davis 2019-02-21 06:41:26 MST
I do notice that my magic cookies are fqdn for the ssh -Y/X cases:

[n0101]$ xauth list | cut -d' ' -f1
n0101/unix:88
n0101/unix:57
n0101.descartes/unix:10
n0101.descartes/unix:11
Comment 14 Daniel P Davis 2019-02-21 06:52:03 MST
OK, adding the fqdn entry to Xauthority fixes this issue:

[SLURM]$ xauth list | cut -d' ' -f1
n0101/unix:88
n0101/unix:57
n0101.descartes/unix:10
n0101.descartes/unix:11
***n0101.descartes/unix:57***

I can run xcalc after adding that entry.

Sooo... where does this leave us?
Comment 15 Nate Rini 2019-02-21 10:22:10 MST
(In reply to Daniel P Davis from comment #14)
> Sooo... where does this leave us?

A code review to determine how best move forward with handling FQDN.

--Nate
Comment 16 Nate Rini 2019-02-21 12:37:35 MST
Daniel,

Are both hostnames resolvable?

> getent hosts n0101
> getent hosts n0101.descartes

--Nate
Comment 18 Daniel P Davis 2019-02-21 13:04:43 MST
$ getent hosts n0101
172.20.1.1      n0101.descartes
$ getent hosts n0101.descartes
172.20.1.1      n0101.descartes
Comment 21 Daniel P Davis 2019-02-22 11:09:48 MST
Can we also determine why I need to add the screen number (.0) to the DISPLAY variable?  My regular ssh -X/Y tests do not require this.
Comment 23 Nate Rini 2019-02-22 11:20:53 MST
(In reply to Daniel P Davis from comment #21)
> Can we also determine why I need to add the screen number (.0) to the
> DISPLAY variable?  My regular ssh -X/Y tests do not require this.

Yes, Xorg does not require screen to be present. Doing QA on patch now.

--Nate
Comment 44 Nate Rini 2019-02-27 10:10:57 MST
Daniel,

These commits should fix the issues:
https://github.com/SchedMD/slurm/commit/db14d9472eeddd925a25cb58afb13ec514ba891chttps://github.com/SchedMD/slurm/commit/55d1927e97f57165fca803c4a98af8e15ce8e719

Please add this to your slurm.conf:
> X11Parameters=use_raw_hostname

Please reply to this ticket if you have any more issues or questions.

--Nate