Bug 4418

Summary: X11 Forwarding: xauth timeout
Product: Slurm Reporter: Ben Matthews <matthews>
Component: slurmstepdAssignee: Tim Wickberg <tim>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, felip.moll, griznog, kaizaad
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
Site: UCAR Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: 3 - High Emory-Cloud Sites: ---

Description Ben Matthews 2017-11-22 15:07:05 MST
Locking on GPFS can be slow - I sometimes get this:

-bash-4.1$ salloc --x11 -N1 -n 20 -t 30 --exclusive -p dav --account=sssg0001 srun -N1 --pty /bin/bash -l
salloc: error: run_command: xauth poll timeout @ 100 msec
salloc: error: x11_get_xauth: Could not retrieve magic cookie. Cannot use X11 forwarding.


Shouldn't this timeout be configurable? Or at least 10s of seconds?
Comment 1 Tim Wickberg 2017-12-05 22:46:46 MST
I believe you know where to patch the timeout if required?

I'm moving this into an enhancement request. I should probably have added an X11Parameters configuration option to give us a place to change these default values, but that will need to wait until 18.08 at this point.
Comment 2 Felip Moll 2017-12-07 09:16:30 MST
I found that this error is raised also when you have some stale locks on .Xauthority* files in your directory.

The workaround to this is to remove the stale locks using xauth '-b' option, or 
to remove directly these files.

[slurm@moll0 ~]$ ls .Xauthority* -lah
-rw------- 1 slurm slurm 0  7 des 17:08 .Xauthority
-rw------- 1 slurm slurm 0  7 des 17:10 .Xauthority-c
-rw------- 1 slurm slurm 0  7 des 17:01 .Xauthority-l

To detect if this is the problem an 'strace xauth' would show multiple EEXIST errors like that one:

open("/nfs/home/slurm/.Xauthority-c", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
Comment 3 John Hanks 2018-01-19 21:38:04 MST
I'm seeing this today, although a week ago everything was working fine. 

[griznog@smsx10srw-srcf-d15-37 ~]$ srun --pty --x11 --time=1:00:00 xterm
srun: error: run_command: xauth poll timeout @ 100 msec
srun: error: x11_get_xauth: Could not retrieve magic cookie. Cannot use X11 forwarding.

We have $HOME on GPFS so I tried increasing the timeout, but even at 10 seconds I still get the same error. There doesn't seem to be an issue with .Xauthority and I can 'ssh -Y' to a node and X forwarding back works normally. Any other suggestions on how to get this to work again? I'm on 17.11.02.
Comment 5 Tim Wickberg 2018-06-27 11:47:06 MDT
Hey folks -

I'm tagging this as a duplicate of the X11 catch-all bug 3647. As mentioned on there, 18.08 will have an X11Parameters option that gives us a place to add settings to change these timers.

Also mentioned on there, I may add support for creating separate XAUTHORITY environment variables/files on the compute nodes, which should reduce contention on various filesystem for locking around ~/.Xauthority.

*** This bug has been marked as a duplicate of bug 3647 ***
Comment 6 Ben Matthews 2018-06-27 12:09:36 MDT
This would be very helpful. 

> 
> Also mentioned on there, I may add support for creating separate XAUTHORITY
> environment variables/files on the compute nodes, which should reduce
> contention on various filesystem for locking around ~/.Xauthority.
> 
> *** This bug has been marked as a duplicate of bug 3647 ***