Bug 4721 - How to build Slurm with internal X11 support
Summary: How to build Slurm with internal X11 support
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other bugs)
Version: 17.11.2
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-02-01 10:07 MST by Jeff White
Modified: 2019-09-17 17:10 MDT (History)
1 user (show)

See Also:
Site: Washington State University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm-17.11.2_config.log (219.02 KB, text/x-log)
2018-02-01 10:08 MST, Jeff White
Details
slurm.conf parms and slurmd.log (deleted)
2019-09-17 16:49 MDT, Wei Feinstein
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff White 2018-02-01 10:07:45 MST
I am trying to build Slurm using rpmbuild.  I am doing so with the Slurm tarball and default configure and compile options.  I noticed X11 forwarding within a job was not working.  Looks like Slurm couldn't build it for some reason:

configure:22006: checking whether Slurm internal X11 support is enabled
configure:22021: result: yes
configure:22043: checking for libssh2 installation
configure:22090: result: 
configure:22094: WARNING: unable to locate libssh2 installation
configure:22096: WARNING: Slurm internal X11 support disabled

How can I get Slurm's internal X11 support to compile?
Comment 1 Jeff White 2018-02-01 10:08:45 MST
Created attachment 6052 [details]
slurm-17.11.2_config.log
Comment 2 Tim Wickberg 2018-02-01 10:15:27 MST
You'll need to have the libssh2 package installed throughout the cluster, as well as the libssh2-devel package installed on whatever machine you're compiling on.
Comment 3 Jeff White 2018-02-01 13:22:29 MST
Thanks for that information.  I was able to get X11 support to compile but it doesn't seem to work.  Is there something else needed to enable it?

Here's what I see:


laptop$ ssh -Y login-d1n01
login-d1n01$ xeyes # works
login-d1n01$ ssh -Y sn1
sn1$ xeyes # works


laptop$ ssh -Y login-d1n01
login-d1n01$ xeyes # works
login-d1n01$ srun --pty /bin/bash
sn1$ xeyes # fails
Error: Can't open display: 10.110.19.121:11.0


laptop$ ssh -Y login-d1n01
login-d1n01$ xeyes # works
login-d1n01$ srun --x11 --pty /bin/bash
srun: error: run_command: xauth poll timeout @ 100 msec
srun: error: Problem running xauth command. Cannot use X11 forwarding.
Comment 4 Jeff White 2018-02-01 13:23:30 MST
Also you may want to trash this FAQ entry or at least say it is only needed for <17.11

https://slurm.schedmd.com/faq.html#x11
Comment 5 Tim Wickberg 2018-02-01 13:37:19 MST
You do need to set PrologFlags=x11 in slurm.conf, and use 'scontrol restart' afterwards. You do need to either has RSA SSH password-less keys setup for all your users, or (recommended) RSA SSH hostkey authentication inside the cluster for this to work.

I'm guessing your home directories are coming from an NFS / GPFS / Lustre mount?

> laptop$ ssh -Y login-d1n01
> login-d1n01$ xeyes # works
> login-d1n01$ srun --x11 --pty /bin/bash
> srun: error: run_command: xauth poll timeout @ 100 msec
> srun: error: Problem running xauth command. Cannot use X11 forwarding.

It's hitting a 100ms timeout trying to retrieve the xauth cookie (done by running 'xauth l $DISPLAY') from ~/.Xauthority. This usually indicates a hard time grabbing locks on that file.

Unfortunately that timer is hard-coded; this will change in the future, but if you want to increase it's the fourth argument to the two run_command() functions in src/common/x11_util.c.

I can't say whether that will resolve this in your environment though - we've seen a lot of variance in how SSH and X11 work between systems and various distros, and are working to accommodate that as best we can. You can always fall back to using the SPANK plugin if that proves more convenient.

- Tim
Comment 6 Jeff White 2018-02-01 13:39:13 MST
So if whatever I/O Slurm is doing with X11 takes longer than just 100ms it gives up?
Comment 7 Tim Wickberg 2018-02-01 13:41:48 MST
(In reply to Jeff White from comment #6)
> So if whatever I/O Slurm is doing with X11 takes longer than just 100ms it
> gives up?

This is specific to fetching the current xauth cookie, and setting it on the client nodes.

But yes, it was an oversight on my part. There will be a tunable setting for it in a future release.
Comment 8 Jeff White 2018-02-01 13:44:03 MST
ok, I believe that would mean the feature is useless to us as it is now.

Should probably add --x11 to srun's man page.  Assuming it is required like it is with the plugin.
Comment 9 Tim Wickberg 2018-02-01 13:53:29 MST
Out of curiosity, if you run:

time xauth p $DISPLAY

what do you get back?
Comment 10 Jeff White 2018-02-01 14:02:05 MST
$ time xauth p $DISPLAY
xauth: (argv):1:  unknown command "p"

real    0m1.158s
user    0m0.000s
sys     0m0.003s


Did you mean list?

$ time xauth list $DISPLAY
10.110.6.21:10  MIT-MAGIC-COOKIE-1  fe0e244746e1b29dd09c8c814b32abc5

real    0m0.953s
user    0m0.001s
sys     0m0.004s


In my case the problem is that my cluster's storage is complete trash.  $HOME is on that storage, presented via NFS.  Sometimes things are just slow.  So anything that fails because of that are unusable (hence my complaining in another bug about Slurm's "kill everything" behavior when a ping times out).
Comment 11 Tim Wickberg 2018-02-01 14:07:10 MST
Sorry, should have been 'l' not 'p'.

We generally avoid allowing processes to have unlimited time to complete their tasks; but I do expect to add a way to readily adjust that timeout.

Right now, if you set it to 10 seconds, that'd get you past that hurdle.

And yes, I'm working on the documentation right now, based on feedback and configuration quirks exposed from early adopters.

If you'd like to keep troubleshooting this, I'm happy to help. Or, I can also just close this out if you'd rather skip this feature for this release.

- Tim
Comment 12 Jeff White 2018-02-01 14:08:15 MST
I'll be skipping the feature.
Comment 15 Jason Booth 2019-09-17 17:10:49 MDT
The content of attachment 11613 [details] has been deleted for the following reason:

attached to wrong bug