Bug 6543 - x11 support sending wrong ip address
Summary: x11 support sending wrong ip address
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other bugs)
Version: 18.08.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-19 13:17 MST by Michael DiDomenico
Modified: 2019-04-01 13:15 MDT (History)
0 users

See Also:
Site: IDACCR
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
test patch (1.02 KB, patch)
2019-03-04 11:12 MST, Nate Rini
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael DiDomenico 2019-02-19 13:17:09 MST
i'm trying to get native x11 support working.  following a smattering of help web/mailing list, it looks like i have everything setup correctly.  however, when i kick off an srun/salloc the DISPLAY variable on the compute node is set to the wrong IP address of the login server where the job was launched.

to clarify, our login nodes are gateways to the compute nodes.  so the login node has two addresses

login1 connection a is 10.0.0.1/24 hostname login1.domain.com
login1 connection b is 10.0.1.1/24 hostname login1.connb.domain.com
compu1 connection a is 10.0.1.2/24 hostname compu1.domain.com

when i look at DISPLAY on the compute node it's showing up as 10.0.0.1 instead of 10.0.1.1.

my understanding is that slurm tries to do a reverse ssh connection from the compute node to the DISPLAY variable address.  this doesn't work for me.  it needs to be the b connection.

is the address settable somewhere in the slurm config?
Comment 1 Nate Rini 2019-02-20 14:44:38 MST
(In reply to Michael DiDomenico from comment #0)
> to clarify, our login nodes are gateways to the compute nodes.  so the login
> node has two addresses
This is a normal to have multiple subnets in a cluster.

> is the address settable somewhere in the slurm config?
Can you please provide the raw value of DISPLAY on your login before calling salloc/srun and then the value inside of the job?

Please note that X11 support is under active development for 19.05 and that work is being tracked through https://bugs.schedmd.com/show_bug.cgi?id=3647.

--Nate
Comment 2 Michael DiDomenico 2019-02-21 07:02:29 MST
login1$ echo $DISPLAY
10.0.0.1:11.0
login1$ xterm
-- xterm works --

login1$ salloc -n 1
-- long pause, seems to be stuck with ssh to 10.0.0.1 which fail --
-- can ctrl-c out --
cpu1$ echo $DISPLAY
10.0.0.1:11.0
cpu1$ xterm
-- xterm fails --

from dns perspective login1.domain corresponds to 10.0.0.1

however from cpu1's perspective it needs to be 10.0.1.1
Comment 3 Nate Rini 2019-02-21 09:28:52 MST
(In reply to Michael DiDomenico from comment #2)
> login1$ echo $DISPLAY
> 10.0.0.1:11.0

Do you have X11UseLocalhost set to no in your sshd_config?

Can you please attach the slurmd log from the node your connecting to and run the following and attach that log:
> salloc -vvvvv -x11=first -n 1 xcalc

Can you also check your sshd logs for any errors?

--Nate
Comment 4 Michael DiDomenico 2019-02-21 10:10:58 MST
(In reply to Nate Rini from comment #3)
> 
> Do you have X11UseLocalhost set to no in your sshd_config?

Yes
 
> Can you please attach the slurmd log from the node your connecting to and
> run the following and attach that log:
> > salloc -vvvvv -x11=first -n 1 xcalc

i can't attach the logs, but the salloc log doesn't report much, the one oddity

salloc: debug: waiting for resource configuration
salloc: debug: still waiting
salloc: debug: still waiting
salloc: debug: still waiting
 
> Can you also check your sshd logs for any errors?

the slurmd on the compute node has these, which probably correlates to the above

[6893.extern] error: Failed to connect to login4 port 22
-- SYN_SENT packet lodged in netstat -natp for slurm going to the 10.0.0.1 instead of the desired 10.0.1.1 address
[6893.extern] error: x11 port forwarding setup failed
[6893.extern] error: _span_job_container: failed retrieving x11 display value: No such file or directory
[6893.extern] error: _span_job_container: failed retrieving x11 authority value: No such file or directory
Comment 5 Nate Rini 2019-02-21 15:10:35 MST
(In reply to Michael DiDomenico from comment #4)
> salloc: debug: waiting for resource configuration
> salloc: debug: still waiting
This is salloc just waiting for nodes to become available.
  
> > Can you also check your sshd logs for any errors?
> 
> the slurmd on the compute node has these, which probably correlates to the
> above
> 
> [6893.extern] error: Failed to connect to login4 port 22
> -- SYN_SENT packet lodged in netstat -natp for slurm going to the 10.0.0.1
> instead of the desired 10.0.1.1 address
> [6893.extern] error: x11 port forwarding setup failed
> [6893.extern] error: _span_job_container: failed retrieving x11 display
> value: No such file or directory
> [6893.extern] error: _span_job_container: failed retrieving x11 authority
> value: No such file or directory

What is the value of these commands on your login node?
> uname -n
> getent hosts $(uname -n)

--Nate
Comment 6 Michael DiDomenico 2019-02-22 06:11:20 MST
(In reply to Nate Rini from comment #5)
> What is the value of these commands on your login node?
> > uname -n

login1.domain

> > getent hosts $(uname -n)

10.0.0.1 login1.domain
Comment 7 Nate Rini 2019-02-22 10:05:14 MST
(In reply to Michael DiDomenico from comment #6)
> (In reply to Nate Rini from comment #5)
> > What is the value of these commands on your login node?
> > > uname -n
> 
> login1.domain
> 
> > > getent hosts $(uname -n)
> 
> 10.0.0.1 login1.domain

Looks like the same issue as #6532. We are currently doing QA on patch to use FQDN with xauth.

--Nate
Comment 8 Michael DiDomenico 2019-02-22 11:11:13 MST
(In reply to Nate Rini from comment #7)
> (In reply to Michael DiDomenico from comment #6)
> > (In reply to Nate Rini from comment #5)
> > > What is the value of these commands on your login node?
> > > > uname -n
> > 
> > login1.domain
> > 
> > > > getent hosts $(uname -n)
> > 
> > 10.0.0.1 login1.domain
> 
> Looks like the same issue as #6532. We are currently doing QA on patch to
> use FQDN with xauth.

i think my issue is a little different.  the dns name whether fqdn or not is still going to resolve to the bad ip address.

if i change DNS to point to the good Ip, i suspect it'll work (adding it to /etc/hosts on the client doesn't seem to fix it).  i suspect slurm is doing the lookup on the login server and sending the ip over to the client.

and there must be some xauth magic not taking place since i can't just change the DISPLAY variable by hand and have it work (gives me a can't start xterm, which looks xauth related)

---

ideally (but probably isn't possible), is that slurm should see what the IP of the interface where the outgoing packets are going to be and use that as the address passed to the client

in my case the login node has two address

(desktop)--(A)--(login)--(B)--(compute)

when you do the dns resolv from the desktop login1.domain gets address A
when you do the dns resolv from the compute login1.domain gets address A

however, if you try to ssh from compute to address A, the system doesn't respond.  i can see the SYN packets leaving the compute and hitting login server, but it never answers back.  this despite the LISTEN port being tied to 0.0.0.0.  there's probably some linux safety/security magic in there i'm running up against, but can't locate an answer on the web

if ssh -p 6010 <addressB> i do get at least response from the login node

unfortunately, DNS is setup from the perspective of the desktop and not the compute.  so any look ups are going to return address A and not address B which is what's needed from the compute nodes

an alternative that might be nice, is if the --x11=first had a colon option where --x11=first:ib0 would indicate the interface it should use and pull the IP directly from that interface for the DISPLAY variable on the compute node.  or a slurm.conf variable would suffice too i suppose
Comment 9 Nate Rini 2019-02-22 12:52:51 MST
(In reply to Michael DiDomenico from comment #8)
> i think my issue is a little different.  the dns name whether fqdn or not is
> still going to resolve to the bad ip address.
> 
> if i change DNS to point to the good Ip, i suspect it'll work (adding it to
> /etc/hosts on the client doesn't seem to fix it).  i suspect slurm is doing
> the lookup on the login server and sending the ip over to the client.
Slurm resolves out the hostname (basically uname -n) and then only uses the first subdomain currently.

> and there must be some xauth magic not taking place since i can't just
> change the DISPLAY variable by hand and have it work (gives me a can't start
> xterm, which looks xauth related)
The xauth cookie value on the compute node is cloned from the user's matching magic key for the DISPLAY when srun/salloc is called. 

> ideally (but probably isn't possible), is that slurm should see what the IP
> of the interface where the outgoing packets are going to be and use that as
> the address passed to the client
This is why zones exist in IPv6.

> unfortunately, DNS is setup from the perspective of the desktop and not the
> compute.  so any look ups are going to return address A and not address B
> which is what's needed from the compute nodes
This is why I believe the patch should help, since Slurm and Xorg are both using the equiv of (uname -n) to determine the FQDN hostname used in the xauth magic key which should be unique on your cluster.
 
> an alternative that might be nice, is if the --x11=first had a colon option
> where --x11=first:ib0 would indicate the interface it should use and pull
> the IP directly from that interface for the DISPLAY variable on the compute
> node.  or a slurm.conf variable would suffice too i suppose
I can swap this ticket to a feature request, as that is adding a new feature to Slurm instead of fixing an existing issue.

--Nate
Comment 10 Michael DiDomenico 2019-02-22 12:58:49 MST
(In reply to Nate Rini from comment #9)
> > unfortunately, DNS is setup from the perspective of the desktop and not the
> > compute.  so any look ups are going to return address A and not address B
> > which is what's needed from the compute nodes
> This is why I believe the patch should help, since Slurm and Xorg are both
> using the equiv of (uname -n) to determine the FQDN hostname used in the
> xauth magic key which should be unique on your cluster.

can you send me the patch, pre-QA.  i'll give it a whirl and see if it helps any.
Comment 11 Nate Rini 2019-02-25 17:46:49 MST
Created attachment 9307 [details]
test patch

(In reply to Michael DiDomenico from comment #10)
> can you send me the patch, pre-QA.  i'll give it a whirl and see if it helps
> any.

The test patch is attached. You can call the following to add it to your source:
> git am /path/to/bug6532.slurm-18.08.patch

You will need to reinstall of your binaries:
> make -j install

Then you will need to either set or append X11Parameters in slurm.conf:
> X11Parameters=xhost_use_fqdn

Please attach your slurmd logs and slurm.conf if it doesn't work with at least SlurmdDebug=debug3 in your slurm.conf

Please note that this is a test patch and has not been fully tested. Please do not deploy it to a production system.

--Nate
Comment 12 Michael DiDomenico 2019-02-27 10:00:50 MST
the good news is that it didn't break anything, the bad news it didn't help either.  the same address is passed along through to the compute node

interestingly enough if i change

echo $DISPLAY
DISPLAY=10.0.0.1:10.0

to

export DISPLAY=10.0.1.1:10.0

and then do salloc -n1

$ echo $DISPLAY
DISPLAY=10.0.1.1:10.0

i still can't run xterm, but that's an xauth error i'll have to figure out

one option that would allow me to fix this is if the salloc command support --export like srun does.  is there a reason --export isn't permitted with salloc?
Comment 14 Nate Rini 2019-02-27 10:32:12 MST
(In reply to Michael DiDomenico from comment #12)
> the good news is that it didn't break anything, the bad news it didn't help
> either.  the same address is passed along through to the compute node
Please try the new patches from https://bugs.schedmd.com/show_bug.cgi?id=6543#c13.

> i still can't run xterm, but that's an xauth error i'll have to figure out
Can you share the xauth error?

> one option that would allow me to fix this is if the salloc command support
> --export like srun does.  is there a reason --export isn't permitted with
> salloc?
Salloc merely does a fork/exec on the caller's host. You can export before calling salloc (unless the env variable is overwritten).

Can you please attach your Xorg log on the login node?
Comment 15 Michael DiDomenico 2019-02-27 11:41:48 MST
(In reply to Nate Rini from comment #14)
> 
> > i still can't run xterm, but that's an xauth error i'll have to figure out
> Can you share the xauth error?

the immediate undiagnosed error is

xterm: Xt errpr: Can't open display: 10.0.1.1:10.0
 
> > one option that would allow me to fix this is if the salloc command support
> > --export like srun does.  is there a reason --export isn't permitted with
> > salloc?
> Salloc merely does a fork/exec on the caller's host. You can export before
> calling salloc (unless the env variable is overwritten).

i misspoke on this, i actually don't need salloc to honor it, i'm using sallocdefaultcommand, so i can just stick the --export into the srun in there
 
> Can you please attach your Xorg log on the login node?

i can't paste whole logs, is there something specific you're looking for?  i don't see anything overly relevant in the Xorg logs on my workstation/login/compute nodes anyhow
Comment 16 Michael DiDomenico 2019-02-27 12:07:47 MST
so it looks like i might have a workable solution, some what

$ ssh 10.0.0.1 (login node)
$ echo $DISPLAY
DISPLAY=10.0.0.1:10.0

$ xterm (pops up on my wks)

$ srun -n 1 --pty --export DISPLAY=10.0.1.1:10.0 /bin/bash
$ echo $DISPLAY
DISPLAY=10.0.1.1:10.0
$ xterm (no display)
xterm: Xt errpr: Can't open display: 10.0.1.1:10.0

$ xauth add 10.1.1.0:10 MIT-MAGIC-COOKIE-1 <a_big_hexstring>
$ xterm (pops up on my wks)

it looks like when i srun --pty over to the compute node it's not automatigically adding the auth entry for 10.0.1.1, if i add it by hand, it works.  i'm not sure who's supposed to be adding that entry
Comment 17 Nate Rini 2019-02-27 12:22:32 MST
(In reply to Michael DiDomenico from comment #15)
> xterm: Xt errpr: Can't open display: 10.0.1.1:10.0

Setting this in the job will bypass the Slurm X11 forwarding:
>export DISPLAY=10.0.1.1:10.0
Setting DISPLAY to make libx11 try to connect directly to your Xorg server. Your firewall might be blocking these attempts, or your xhost acl is blocking it or the xauth isn't setup for this.

> i can't paste whole logs, is there something specific you're looking for?
> don't see anything overly relevant in the Xorg logs on my
> workstation/login/compute nodes anyhow

I'm looking to see what host it expects from this line:
> Current Operating System: Linux spheron 4.18.0-15-generic #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 x86_64
Also if there are any errors getting dumped.

Can you also run a test job and provide this output:
>$ scontrol show job $JOBID |grep AllocNode
>    Partition=debug AllocNode:Sid=eternium:3604
Comment 18 Nate Rini 2019-03-01 13:08:41 MST
Created attachment 9381 [details]
test patch

Michael

I have attached a test patch to try the raw hostname for the ssh forwarding connection.

Can you please try it?

Thanks
--Nate
Comment 23 Nate Rini 2019-03-04 11:12:15 MST
Created attachment 9408 [details]
test patch

(In reply to Nate Rini from comment #18)
> Michael
> 
> I have attached a test patch to try the raw hostname for the ssh forwarding
> connection.
> 
> Can you please try it?
> 
> Thanks
> --Nate

Modified the patch to set x11_hostname using the raw callers hostname. Please give this a try with srun.
Comment 24 Michael DiDomenico 2019-03-07 06:37:13 MST
(In reply to Nate Rini from comment #23)
> Created attachment 9408 [details]
> test patch
> 
> (In reply to Nate Rini from comment #18)
> > Michael
> > 
> > I have attached a test patch to try the raw hostname for the ssh forwarding
> > connection.
> > 
> > Can you please try it?
> > 
> > Thanks
> > --Nate
> 
> Modified the patch to set x11_hostname using the raw callers hostname.
> Please give this a try with srun.

I haven't had a chance to test this, but i'm not likely to think it's going to work anyway.  if i understand the chunk of code correctly, it's going to do a lookup in dns for the address instead of pulling it from something internal.  that's not going to help my situation because the dns entry points to the incorrect ip address.

to get around the issue, i set the sallocdefaultcommand to be a shell script, inside that shell script i mangled the DISPLAY variable and set the xauth security correctly, before the srun is issued.  that seems to have corrected the display issue, but i garnered some other env variable issues.

i think the only fix this, are one of three methods

1. i change the dns name to be the cluster side interface (10.0.1.1) instead of the workstation side interface (10.0.0.1)

2. slurm puts in a variable that lets me dictate the display var (but this still might leave the xauth issue)

3. slurm magically figures out what the gateway address to the compute node is and sets the display variable to that

given the above, you can move this to an enhancement request, if you like.
Comment 25 Nate Rini 2019-03-07 12:25:45 MST
Michael,

The new X11 forwarding code has been pushed to 19.05: https://github.com/SchedMD/slurm/commit/9c8be2689e078756d020d19d8fb9ab2c09a88be5

It will most likely not merge cleanly into 18.08. But could you please give it a try on a test system? It no longer uses libssh to handle X11 forwarding.

Thanks,
--Nate
Comment 26 Nate Rini 2019-03-08 14:33:10 MST
(In reply to Nate Rini from comment #25)
> It will most likely not merge cleanly into 18.08.
Just to be clear, I was suggesting to test 19.05 and not to try to backport the patchset.
Comment 27 Michael DiDomenico 2019-03-13 06:27:03 MDT
(In reply to Nate Rini from comment #26)
> (In reply to Nate Rini from comment #25)
> > It will most likely not merge cleanly into 18.08.
> Just to be clear, I was suggesting to test 19.05 and not to try to backport
> the patchset.

understood.  is there a git branch or commit i can reference to pull in the right version.  i have a complicated process for my enclave that only lets me pull in source from github.  i can test the version, i need to do a checkout on the right branch or roll forward to some commit
Comment 28 Nate Rini 2019-03-13 16:31:42 MDT
(In reply to Michael DiDomenico from comment #27)
> is there a git branch or commit i can reference to pull in the
> right version.  

Here are relevant code commits from Bug#3647:
9c8be2689e078756d020d19d8fb9ab2c09a88be5
91170a04641d28d8020d1e4708af080ceb1e3279
f2da4d7c174a0baf4e15301b947e5625fb747c56
c97284691b6a0df57493a13132787a1a908a749f
2a58e3e228c4b0b589e2d6456159fe725e21d32d
3b7d1625c470d479d1c5d8cb492ae8918d551d7f
6985ccbac42a442c73fe91d5ee6146fe901058f1

> i have a complicated process for my enclave that only lets
> me pull in source from github.

All of the commits are here: https://github.com/SchedMD/slurm/releases/tag/slurm-19-05-0-0pre3

> i can test the version, i need to do a
> checkout on the right branch or roll forward to some commit

I suggest just pulling the tag slurm-19-05-0-0pre3 and then doing a parallel setup of Slurm for testing and verification.

(In reply to Michael DiDomenico from comment #24)
> I haven't had a chance to test this, but i'm not likely to think it's going
> to work anyway.

As long as interactive jobs work from your login nodes, then X11 should work with the new commits. It would be great to get feedback if this is works or not with your configuration.
Comment 29 Michael DiDomenico 2019-03-28 08:17:04 MDT
this test is taking longer then i had anticipated.  it took a little longer then expected to pull in the git repo, and now i'm having compile/runtime issues with ucx, ompi, pmix with slurm using v19.  a good portion of which is likely our environment.  if you want to shift the ticket to a lower priority or close it for now to get it off your plate thats fine with me
Comment 30 Nate Rini 2019-04-01 13:15:23 MDT
(In reply to Michael DiDomenico from comment #29)
> this test is taking longer then i had anticipated.  it took a little longer
> then expected to pull in the git repo, and now i'm having compile/runtime
> issues with ucx, ompi, pmix with slurm using v19.  a good portion of which
> is likely our environment.  if you want to shift the ticket to a lower
> priority or close it for now to get it off your plate thats fine with me

I'm going to close this ticket. Please reply to reopen it.

--Nate