I am trying to build Slurm using rpmbuild. I am doing so with the Slurm tarball and default configure and compile options. I noticed X11 forwarding within a job was not working. Looks like Slurm couldn't build it for some reason: configure:22006: checking whether Slurm internal X11 support is enabled configure:22021: result: yes configure:22043: checking for libssh2 installation configure:22090: result: configure:22094: WARNING: unable to locate libssh2 installation configure:22096: WARNING: Slurm internal X11 support disabled How can I get Slurm's internal X11 support to compile?
Created attachment 6052 [details] slurm-17.11.2_config.log
You'll need to have the libssh2 package installed throughout the cluster, as well as the libssh2-devel package installed on whatever machine you're compiling on.
Thanks for that information. I was able to get X11 support to compile but it doesn't seem to work. Is there something else needed to enable it? Here's what I see: laptop$ ssh -Y login-d1n01 login-d1n01$ xeyes # works login-d1n01$ ssh -Y sn1 sn1$ xeyes # works laptop$ ssh -Y login-d1n01 login-d1n01$ xeyes # works login-d1n01$ srun --pty /bin/bash sn1$ xeyes # fails Error: Can't open display: 10.110.19.121:11.0 laptop$ ssh -Y login-d1n01 login-d1n01$ xeyes # works login-d1n01$ srun --x11 --pty /bin/bash srun: error: run_command: xauth poll timeout @ 100 msec srun: error: Problem running xauth command. Cannot use X11 forwarding.
Also you may want to trash this FAQ entry or at least say it is only needed for <17.11 https://slurm.schedmd.com/faq.html#x11
You do need to set PrologFlags=x11 in slurm.conf, and use 'scontrol restart' afterwards. You do need to either has RSA SSH password-less keys setup for all your users, or (recommended) RSA SSH hostkey authentication inside the cluster for this to work. I'm guessing your home directories are coming from an NFS / GPFS / Lustre mount? > laptop$ ssh -Y login-d1n01 > login-d1n01$ xeyes # works > login-d1n01$ srun --x11 --pty /bin/bash > srun: error: run_command: xauth poll timeout @ 100 msec > srun: error: Problem running xauth command. Cannot use X11 forwarding. It's hitting a 100ms timeout trying to retrieve the xauth cookie (done by running 'xauth l $DISPLAY') from ~/.Xauthority. This usually indicates a hard time grabbing locks on that file. Unfortunately that timer is hard-coded; this will change in the future, but if you want to increase it's the fourth argument to the two run_command() functions in src/common/x11_util.c. I can't say whether that will resolve this in your environment though - we've seen a lot of variance in how SSH and X11 work between systems and various distros, and are working to accommodate that as best we can. You can always fall back to using the SPANK plugin if that proves more convenient. - Tim
So if whatever I/O Slurm is doing with X11 takes longer than just 100ms it gives up?
(In reply to Jeff White from comment #6) > So if whatever I/O Slurm is doing with X11 takes longer than just 100ms it > gives up? This is specific to fetching the current xauth cookie, and setting it on the client nodes. But yes, it was an oversight on my part. There will be a tunable setting for it in a future release.
ok, I believe that would mean the feature is useless to us as it is now. Should probably add --x11 to srun's man page. Assuming it is required like it is with the plugin.
Out of curiosity, if you run: time xauth p $DISPLAY what do you get back?
$ time xauth p $DISPLAY xauth: (argv):1: unknown command "p" real 0m1.158s user 0m0.000s sys 0m0.003s Did you mean list? $ time xauth list $DISPLAY 10.110.6.21:10 MIT-MAGIC-COOKIE-1 fe0e244746e1b29dd09c8c814b32abc5 real 0m0.953s user 0m0.001s sys 0m0.004s In my case the problem is that my cluster's storage is complete trash. $HOME is on that storage, presented via NFS. Sometimes things are just slow. So anything that fails because of that are unusable (hence my complaining in another bug about Slurm's "kill everything" behavior when a ping times out).
Sorry, should have been 'l' not 'p'. We generally avoid allowing processes to have unlimited time to complete their tasks; but I do expect to add a way to readily adjust that timeout. Right now, if you set it to 10 seconds, that'd get you past that hurdle. And yes, I'm working on the documentation right now, based on feedback and configuration quirks exposed from early adopters. If you'd like to keep troubleshooting this, I'm happy to help. Or, I can also just close this out if you'd rather skip this feature for this release. - Tim
I'll be skipping the feature.
The content of attachment 11613 [details] has been deleted for the following reason: attached to wrong bug