Bug 3647

Summary: Add native X11 tunneling capability
Product: Slurm Reporter: Tim Wickberg <tim>
Component: User CommandsAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex.mamach, alex, babbott, charles.wright, chris, matthews, mrg, samuel
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5510
https://bugs.schedmd.com/show_bug.cgi?id=5349
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.0pre2 Target Release: 17.11
DevPrio: 3 - High Emory-Cloud Sites: ---

Description Tim Wickberg 2017-03-31 09:51:18 MDT
Add native support to handle X11 tunneling, similar to the widely used SPANK plugin.
Comment 10 Tim Wickberg 2017-10-15 22:58:08 MDT
Chris -

Just wanted to note that it looks like RSA hostkey based authentication is possible with our X11 implementation.

I'm assuming you're using RSA... unfortunately it looks like the libssh2 implementation this is based around is only able to support RSA or DSA, and DSA is not recommended these days. If that would not work in your environment I'd like to know that just as a point of reference...

cheers,
- Tim
Comment 11 Chris Samuel 2017-10-15 23:17:00 MDT
On 16/10/17 15:58, bugs@schedmd.com wrote:

> Chris -

Hi Tim,

> Just wanted to note that it looks like RSA hostkey based
> authentication is possible with our X11 implementation.

Excellent!

> I'm assuming you're using RSA... unfortunately it looks like the
> libssh2 implementation this is based around is only able to support
> RSA or DSA, and DSA is not recommended these days. If that would not
> work in your environment I'd like to know that just as a point of
> reference...

Yeah, I reckon we should be good on our 3 clusters:

root@merri-m:~# fgrep -c ssh-rsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
91
root@merri-m:~# fgrep -c ssh-dsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
0

root@barcoo-m:~# fgrep -c ssh-rsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
72
root@barcoo-m:~# fgrep -c ssh-dsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
0


root@snowy-m:~# fgrep -c ssh-rsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
46
root@snowy-m:~# fgrep -c ssh-dsa /cfmroot/netboot/etc/ssh/ssh_known_hosts
0

OpenSSH v7.0 and later disabled DSA keys at runtime.

http://www.openssh.com/legacy.html

So I've no issue with it not being supported.

Thanks for checking Tim!

All the best,
Chris
Comment 28 Tim Wickberg 2017-10-18 22:08:02 MDT
The X11 branch has been merged in, and will be available in 17.11.0rc1 and up.

I am holding this bug open until documentation has been prepared and committed.

If anyone reading this is interested in testing it out, you'll need PrologFlags=x11 set, libssh2 installed on the compute nodes, and the libssh2 development package installed wherever you compile.

When working, commands such as 'srun --x11 xclock' should work. The logs from the slurmstepd may isolate and issues, the most common of which are SSH key issues.

RSA keys must be used - either hostauth will work (and is a lot easier to manage on the cluster level), or each individual user will need their own keys set up ahead of time.
Comment 32 Tim Wickberg 2017-12-19 11:07:58 MST
*** Bug 4532 has been marked as a duplicate of this bug. ***
Comment 33 Christopher Samuel 2018-02-14 16:20:29 MST
I'm now in a position to test this, and the behaviour seems a little odd.

srun --x11 --pty -u /bin/bash -i -l
srun: job 21707 queued and waiting for resources
srun: job 21707 has been allocated resources
[csamuel@john72 splash]$ xeyes
Error: Can't open display: localhost:10.0

However, if I separately ssh into the node with X11 forwarding enabled that suddenly starts to work.

strace reveals the failure is because the application is trying to connect to localhost:6010 (IPv6 first and then IPv4) and neither work unless I am SSH'd into the node.

Checking the slurmd logs I see:

Feb 15 10:13:33 john72 slurmstepd[330233]: error: hostkey authentication failed: Invalid signature for supplied public key, or bad username/public key combination
Feb 15 10:13:33 john72 slurmstepd[330233]: error: ssh public key authentication failure: Unable to open public key file
Feb 15 10:13:33 john72 slurmstepd[330233]: error: x11 port forwarding setup failed
Feb 15 10:13:33 john72 slurmstepd[330226]: error: _spawn_job_container: failed retrieving x11 display value: No such file or directory
[...]
Feb 15 10:13:33 john72 slurmd[349089]: error: could not get x11 forwarding display for job 21707 step 0, x11 forwarding disabled

But there's no indication given to the user about this and $DISPLAY is set.

cheers,
Chris
Comment 34 Tim Wickberg 2018-02-14 16:36:46 MST
> However, if I separately ssh into the node with X11 forwarding enabled that
> suddenly starts to work.

You've made an X11 tunnel to :10, so the job starts using that instead.

> strace reveals the failure is because the application is trying to connect
> to localhost:6010 (IPv6 first and then IPv4) and neither work unless I am
> SSH'd into the node.
> 
> Checking the slurmd logs I see:
> 
> Feb 15 10:13:33 john72 slurmstepd[330233]: error: hostkey authentication
> failed: Invalid signature for supplied public key, or bad username/public
> key combination
> Feb 15 10:13:33 john72 slurmstepd[330233]: error: ssh public key
> authentication failure: Unable to open public key file
> Feb 15 10:13:33 john72 slurmstepd[330233]: error: x11 port forwarding setup
> failed
> Feb 15 10:13:33 john72 slurmstepd[330226]: error: _spawn_job_container:
> failed retrieving x11 display value: No such file or directory
> [...]
> Feb 15 10:13:33 john72 slurmd[349089]: error: could not get x11 forwarding
> display for job 21707 step 0, x11 forwarding disabled
> 
> But there's no indication given to the user about this and $DISPLAY is set.

That DISPLAY is the value restored off your login node environment; the Slurm X11 plugin isn't overwriting it. But I can see how that may be confusing.

I could possibly unset the DISPLAY environment variable, if you think that'd at least make it a bit less confusing. And/or add another environment variable warning that the tunnel setup failed.

There's not a great way to communicate a failure to setup the tunnel back to the user at the moment unfortunately - the only real option available would be to force-terminate the job which seemed a little bit drastic.

For 18.08 I'll have an X11Parameters option to work with, which I unfortunately didn't think of adding in 17.11. With that in there, I can give you an option to terminate a job if the tunnel setup fails, if you think that'd be of use.
Comment 35 Christopher Samuel 2018-02-14 16:42:28 MST
On 15/02/18 10:36, bugs@schedmd.com wrote:

> I could possibly unset the DISPLAY environment variable, if you think that'd at
> least make it a bit less confusing. And/or add another environment variable
> warning that the tunnel setup failed.

I reckon unsetting DISPLAY is good enough, that will tell the user that
something is wrong and forwarding isn't expected to work.

> There's not a great way to communicate a failure to setup the tunnel back to
> the user at the moment unfortunately - the only real option available would be
> to force-terminate the job which seemed a little bit drastic.

Agreed.

> For 18.08 I'll have an X11Parameters option to work with, which I unfortunately
> didn't think of adding in 17.11. With that in there, I can give you an option
> to terminate a job if the tunnel setup fails, if you think that'd be of use.

Personally I think just removing $DISPLAY from the environment is likely
enough, but perhaps other sites would prefer it to fail hard.

Any idea what might be failing for us?  I can't remember enough of our
discussion at SLUG. Is it trying to SSH back to the submission node?
That doesn't work here.

cheers,
Chris
Comment 36 Tim Wickberg 2018-02-14 16:51:35 MST
> Personally I think just removing $DISPLAY from the environment is likely
> enough, but perhaps other sites would prefer it to fail hard.

I'll see if I can work up a patch to do that.

> Any idea what might be failing for us?  I can't remember enough of our
> discussion at SLUG. Is it trying to SSH back to the submission node?
> That doesn't work here.

Yes. It uses libssh2 (which only supports RSA, annoyingly) to build a tunnel back to the login node you submitted the job from.

That login node must either accept RSA hostkeys, or the user must have an RSA pre-shared key setup already.

The SPANK X11 plugin has similar requirements, although it does use the ssh command directly and thus can work with other types of SSH key. But it does need to SSH back to the login node, to then SSH back to the compute node.

I'm adding Ben Matthews here who has similarly complained about SSH being a problem for their systems.

One idea that's been discussed would be to invent a new sproxyd (name decidedly undecided for now) daemon that could run on your login nodes, and would accept MUNGE-signed connections from the compute nodes to establish arbitrary network traffic proxying within the cluster. That proxy could handle relaying the traffic appropriately, rather than relying on SSH to handle it.

That could also presumably fix issues around connecting to X displays on unix sockets, which my implementation cannot currently do, and has been problematic for at least one center thus far. (The ssh protocol has no way to establish a channel that connects to a unix socket on the remote side, so I can't do that currently. The SPANK plugin can, but virtue of the 'ssh-within-ssh' approach it uses.)
Comment 37 Christopher Samuel 2018-02-14 17:11:10 MST
On 15/02/18 10:51, bugs@schedmd.com wrote:

> That login node must either accept RSA hostkeys, or the user must have an RSA
> pre-shared key setup already.

Cool - I thought I'd set that up correctly but ssh-copy-id copied the 
wrong key. ;-)

X11 forwarding now works for me.

[...]
> I'm adding Ben Matthews here who has similarly complained about SSH being a
> problem for their systems.

I think the idea of a proxy would be nice.

cheers,
Chris
Comment 38 Tim Wickberg 2018-06-27 11:39:33 MDT
X11Parameters is in 18.08 now, although it does not have any settings defined yet.

I do expect to add options for:

- Adjusting the xauth timeout.

- Changing the hostkey types. The development libssh2 releases have support for ecdsa, so the current hard-coded use of RSA needs to have some alternative options.

- Possibly some functionality for automatically setting a different XAUTHORITY environment variable pointing into /tmp/ somewhere on the node, rather than having all nodes fight for locks on ~/.Xauthority. (This will need some adjustment to pass around that correct environment variable, which needs to be handled before we freeze RPCs in a few weeks.)
Comment 39 Tim Wickberg 2018-06-27 11:47:06 MDT
*** Bug 4418 has been marked as a duplicate of this bug. ***
Comment 40 Tim Wickberg 2018-07-21 02:18:57 MDT
Just an update with some recently landed commits that will be in 18.08 when released. I think this bypasses a lot of the issues people have seen around xauth not responding quickly enough.

I will try to sneak a few more options in here as well, and get DISPLAY stripped from the environment if X11 forward was requested but failed to be setup at launch.

commit 2a58e3e228c4b0b589e2d6456159fe725e21d32d
Author: Tim Wickberg <tim@schedmd.com>
Date:   Sat Jul 21 02:12:17 2018 -0600

    Add X11Parameters=local_xauthority option.
    
    Creates a local XAUTHORITY file in TmpFS on the node, and deletes
    it upon job termination. This avoids file locking contention on
    ~/.Xauthority in the users home directory.
    
    Bug 3647.

commit 70e893b8e3f2fb122d5045e0aadd3aaeff0e116d
Author: Tim Wickberg <tim@schedmd.com>
Date:   Sat Jul 21 02:06:41 2018 -0600

    Send tmpfs to slurmstepd as part of pack_slurmd_conf_lite().

commit 3b7d1625c470d479d1c5d8cb492ae8918d551d7f
Author: Tim Wickberg <tim@schedmd.com>
Date:   Sat Jul 21 01:35:04 2018 -0600

    X11 forwarding subsystem - add plumbing to permit a temporary XAUTHORITY file
    
    Build out sufficient plumbing such that a temporary XAUTHORITY file
    can be used that is local to the compute node, thus avoiding lock
    contention on ~/.Xauthority on parallel filesystems.
    
    This commit only includes the requisite plumbing to pass this around.
    
    If this is not used, a null string results, and the XAUTHORITY env var
    will not be forced into the user environment.
    
    Add support and fix the modified API call in pam_slurm_adopt while here.
    
    Bug 3647.
Comment 41 Tim Wickberg 2018-07-21 11:07:43 MDT
This isn't ideal, but there are some more complicated issues with how the external step exiting abnormally are handled that I can't resolve before the 18.08 release:

commit a51d160090b6739cb133012fd61bea21493b3578
Author: Tim Wickberg <tim@schedmd.com>
Date:   Sat Jul 21 10:49:15 2018 -0600

    x11 - cleanup error path and overwrite DISPLAY if X11 forwarding failed.
    
    Set DISPLAY to SLURM_X11_SETUP_FAILED to make it clear that the
    tunnel setup has failed. This at least gives the user a hint as to
    why their X11 apps aren't working, although further refinement should be
    done later:
    
    tim@zoidberg:~$ srun --x11 xclock
    Error: Can't open display: SLURM_X11_SETUP_FAILED
    srun: error: node001: task 0: Exited with exit code 1
Comment 42 Tim Wickberg 2018-10-17 16:35:46 MDT
*** Bug 5868 has been marked as a duplicate of this bug. ***
Comment 44 Tim Wickberg 2018-12-05 16:02:04 MST
*** Bug 5510 has been marked as a duplicate of this bug. ***
Comment 48 Charles Wright 2019-02-06 15:14:47 MST
I just upgraded to 18.08 and can't seem to get it working.  Do you have any advice?   Thanks.

[cw464@grace1 ~]$ srun --pty --x11 -p interactive xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: localhost:61.0
srun: error: c01n01: task 0: Exited with exit code 1

[cw464@grace1 ~]$ rpm -qa | grep slurm
slurm-18.08.5-1.el7.x86_64
slurm-devel-18.08.5-1.el7.x86_64
slurm-libpmi-18.08.5-1.el7.x86_64
slurm-perlapi-18.08.5-1.el7.x86_64
slurm-contribs-18.08.5-1.el7.x86_64

[cw464@grace1 ~]$ cat /etc/slurm/slurm.conf  | grep -i X11
PrologFlags=x11
X11Parameters=local_xauthority
[cw464@grace1 ~]$
Comment 49 Tim Wickberg 2019-02-06 16:01:11 MST
Charles -

Please open a separate ticket to troubleshoot that, and attach the logs from the slurmd side as well. Those should usually have some better indication of where the problem is, whether it be SSH forwarding or something else.
Comment 50 Jason Booth 2019-02-19 09:12:33 MST
*** Bug 6533 has been marked as a duplicate of this bug. ***
Comment 54 Tim Wickberg 2019-02-28 01:42:09 MST
As a general update, the next 18.08 maintenance release (18.08.6) will add a new option of use_raw_hostname to X11Parameters (see bug 6532 for further details).

That option should address issues for sites where the hostname of the system includes the domain part. (Usually Slurm strips any domain portion off, but that appears to cause issues with some X11 authentication handling.)

That behavior will be the default in 19.05 when released, alongside a complete revamp of the forwarding code. That forwarding code will no longer use libssh2, which should help resolve a number of other edge cases that have been reported to date.

- Tim
Comment 55 Tim Wickberg 2019-03-07 03:11:30 MST
The commits to overhaul X11 forwarding have finally landed on master ahead of the 19.05.0pre2 preview release.

With those, I am finally closing this issue out.

If you're still running into issues with xauth after 18.08.6 is released, please file a new ticket to discuss this.

If you're having issues with libssh2, SSH keys, or the like, I would encourage you to look forward to the overhauled forwarding code in the next 19.05 release, as that removes our dependency on libssh2, and instead uses MUNGE authenticated connections through Slurm's existing RPC layer.

- Tim
Comment 56 Charles Wright 2019-03-07 10:50:01 MST
Hi Tim,
Do you have a rough idea when 18.08.6 will be released?   Days? Weeks? Months?
Thanks.
Comment 57 Tim Wickberg 2019-03-07 11:10:43 MST
Barring any last minute complications, I expect both 18.08.6 and 19.05.0pre2 to ship this afternoon.

- Tim