Bug 4418 - X11 Forwarding: xauth timeout
Summary: X11 Forwarding: xauth timeout
Status: RESOLVED DUPLICATE of bug 3647
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 17.11.x
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-11-22 15:07 MST by Ben Matthews
Modified: 2018-06-27 12:09 MDT (History)
4 users (show)

See Also:
Site: UCAR
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: 3 - High
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Matthews 2017-11-22 15:07:05 MST
Locking on GPFS can be slow - I sometimes get this:

-bash-4.1$ salloc --x11 -N1 -n 20 -t 30 --exclusive -p dav --account=sssg0001 srun -N1 --pty /bin/bash -l
salloc: error: run_command: xauth poll timeout @ 100 msec
salloc: error: x11_get_xauth: Could not retrieve magic cookie. Cannot use X11 forwarding.


Shouldn't this timeout be configurable? Or at least 10s of seconds?
Comment 1 Tim Wickberg 2017-12-05 22:46:46 MST
I believe you know where to patch the timeout if required?

I'm moving this into an enhancement request. I should probably have added an X11Parameters configuration option to give us a place to change these default values, but that will need to wait until 18.08 at this point.
Comment 2 Felip Moll 2017-12-07 09:16:30 MST
I found that this error is raised also when you have some stale locks on .Xauthority* files in your directory.

The workaround to this is to remove the stale locks using xauth '-b' option, or 
to remove directly these files.

[slurm@moll0 ~]$ ls .Xauthority* -lah
-rw------- 1 slurm slurm 0  7 des 17:08 .Xauthority
-rw------- 1 slurm slurm 0  7 des 17:10 .Xauthority-c
-rw------- 1 slurm slurm 0  7 des 17:01 .Xauthority-l

To detect if this is the problem an 'strace xauth' would show multiple EEXIST errors like that one:

open("/nfs/home/slurm/.Xauthority-c", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
Comment 3 John Hanks 2018-01-19 21:38:04 MST
I'm seeing this today, although a week ago everything was working fine. 

[griznog@smsx10srw-srcf-d15-37 ~]$ srun --pty --x11 --time=1:00:00 xterm
srun: error: run_command: xauth poll timeout @ 100 msec
srun: error: x11_get_xauth: Could not retrieve magic cookie. Cannot use X11 forwarding.

We have $HOME on GPFS so I tried increasing the timeout, but even at 10 seconds I still get the same error. There doesn't seem to be an issue with .Xauthority and I can 'ssh -Y' to a node and X forwarding back works normally. Any other suggestions on how to get this to work again? I'm on 17.11.02.
Comment 5 Tim Wickberg 2018-06-27 11:47:06 MDT
Hey folks -

I'm tagging this as a duplicate of the X11 catch-all bug 3647. As mentioned on there, 18.08 will have an X11Parameters option that gives us a place to add settings to change these timers.

Also mentioned on there, I may add support for creating separate XAUTHORITY environment variables/files on the compute nodes, which should reduce contention on various filesystem for locking around ~/.Xauthority.

*** This bug has been marked as a duplicate of bug 3647 ***
Comment 6 Ben Matthews 2018-06-27 12:09:36 MDT
This would be very helpful. 

> 
> Also mentioned on there, I may add support for creating separate XAUTHORITY
> environment variables/files on the compute nodes, which should reduce
> contention on various filesystem for locking around ~/.Xauthority.
> 
> *** This bug has been marked as a duplicate of bug 3647 ***