Bug 6533

Summary: X11 Forwarding Fails with FastX, succeeds with ssh -X
Product: Slurm Reporter: Alex Mamach <alex.mamach>
Component: ConfigurationAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.5   
Hardware: Linux   
OS: Linux   
Site: Northwestern Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf file

Description Alex Mamach 2019-02-15 12:51:59 MST
Created attachment 9194 [details]
slurm.conf file

Hi,

We've been working on setting up X11 forwarding for our user applications. Our users typically connect to our cluster using either ssh -X, or, more frequently, a remote display application called FastX.

When we attempt to use X11 forwarding in a job, (for example srun -A myaccount --time 4:00 --partition=mypartition --x11 xclock) after connecting with ssh -X, things work as expected. However, doing the same with FastX generates an error upon job submission: srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays.

After looking into some previous bug reports regarding X11 forwarding, I saw mention that Slurm looks at both the DISPLAY and HOSTNAME variables for the --x11 option. Interestingly, when connecting with ssh -X, DISPLAY will show a value such as localhost:11.0. However, when using FastX, DISPLAY instead has a value like :103, missing the localhost component of the variable.

Interestingly, connecting with FastX allows me to run a GUI application on the login nodes, as well as any compute node I directly connect to with ssh -X, so the issue seems to be particular to Slurm's --x11 flag.

Do you have any thoughts as to what might be going on here? I've attached our slurm.conf file in the event it proves helpful.

Thank you!

Alex
Comment 1 Alex Mamach 2019-02-15 13:25:23 MST
Upon further investigation, I believe this may be due to lines 92-96 in x11_util.c:

	if (display[0] == ':') {
		error("Cannot forward to local display. "
		      "Can only use X11 forwarding with network displays.");
		exit(-1);
	}

If this is in fact the reason, is there any danger in us removing this check? I'm not entirely clear what it's attempting to protect or prevent, and maybe there's a better way for us to navigate than modifying the code.

Thanks again!
Comment 3 Jason Booth 2019-02-15 15:14:57 MST
Hi Alex,

This is a copy and past from https://bugs.schedmd.com/show_bug.cgi?id=6233

Our X11 forwarding implementation cannot connect to unix sockets at this time, this is something we may look at in a future release.

Two options:

- Use "ssh -X localhost", then run "srun --x11" within that SSH session. SSH itself will handle translation between a TCP socket that Slurm's implementation can use to the local unix socket.

- Disable our build-in integration, and use the SPANK X11 plugin instead. Due to differences in how it forwards traffic, it can accommodate use of a unix socket instead of a network socket.

We hope to address these limitations soon and we are actively looking into a possible solution for 19.05 and that work is being tracked through https://bugs.schedmd.com/show_bug.cgi?id=3647.

-Jason
Comment 4 Jason Booth 2019-02-19 09:12:33 MST
Hi Alex,
 
 I am resolving this issue for now. The work that we are doing for X11 is targeted for 19.05 via the following issue.

https://bugs.schedmd.com/show_bug.cgi?id=3647

 Please consult the release notes in the upcoming 19.05 for the details once we have officially released.

*** This bug has been marked as a duplicate of bug 3647 ***