Bug 5692 - X11 forwarding error: _get_home: getpwuid_r(33560764):No error
Summary: X11 forwarding error: _get_home: getpwuid_r(33560764):No error
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 17.11.9
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-09 15:31 MDT by Jack Duvall
Modified: 2018-09-09 16:46 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---


Attachments
Verbose error log for issue (11.62 KB, text/plain)
2018-09-09 15:31 MDT, Jack Duvall
Details
slurm.conf (4.00 KB, text/plain)
2018-09-09 15:32 MDT, Jack Duvall
Details
Verbose slurmd.log, 18.08 version (12.05 KB, text/plain)
2018-09-09 16:46 MDT, Jack Duvall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jack Duvall 2018-09-09 15:31:08 MDT
Created attachment 7790 [details]
Verbose error log for issue

Hello,

I'm having an issue setting up X11 forwarding. Everything was compiled with the necessary libssh2 and libssh2-devel libs, X11 forwarding enabled using the PrologFlags=x11 switch, and srun itself doesn't complain, yet any graphical programs throw an error about not being able to open the display.

Digging further in the logs for the node, I found this (debug2 level logging file attached if you want to see the full output):

> [2018-09-09T16:52:20.042] _run_prolog: run job script took usec=9
> [2018-09-09T16:52:20.042] _run_prolog: prolog with lock for job 2064 ran for 0 seconds
> [2018-09-09T16:52:20.063] [2064.extern] error: _get_home: getpwuid_r(33560764):No error
> [2018-09-09T16:52:20.063] [2064.extern] error: could not find HOME in environment
> [2018-09-09T16:52:20.064] [2064.extern] error: x11 port forwarding setup failed
> [2018-09-09T16:52:20.065] [2064.extern] error: _spawn_job_container: failed retrieving x11 display value: No error
> [2018-09-09T16:52:20.095] [2064.extern] done with job
> [2018-09-09T16:52:20.157] launch task 2064.0 request from 33560764.2019@198.38.16.106 (port 181)
> [2018-09-09T16:52:20.159] error: could not get x11 forwarding display for job 2064 step 0, x11 forwarding disabled
> [2018-09-09T16:52:20.216] [2064.0] done with job

The part about not being able to find $HOME is strange, because the output of `srun -n1 -p compute2 -w borgw201 --x11 /usr/bin/env | grep HOME`, the same command that produced the error above, says $HOME is defined.

One thing of note about my setup: Unlike in past errors, users are not allowed to ssh to compute nodes. All authentication is done with kerberos, so ssh keys don't exist. User home directories are on a shared NFS drive too. If this error is just a symptom of not having ssh key-based authentication and X11 forwarding isn't supported under any other setup, that would be nice to know for certain.

I haven't had time yet, but will try the latest version of slurm (18.08.0) and close the issue if updating fixes it.

If not, I will attach my slurm.conf next, and can provide any other files requested.

Thanks,
-Jack Duvall
Comment 1 Jack Duvall 2018-09-09 15:32:45 MDT
Created attachment 7791 [details]
slurm.conf
Comment 2 Jack Duvall 2018-09-09 16:46:04 MDT
Update: 18.08.0 does not fix this issue for me. Nothing seemed to change drastically in the logs either.

New abbreviated log:

> [2018-09-09T18:40:31.983] _run_prolog: run job script took usec=169
> [2018-09-09T18:40:31.984] _run_prolog: prolog with lock for job 2067 ran for 0 seconds
> [2018-09-09T18:40:32.046] [2067.extern] error: _get_home: getpwuid_r(33560764):No error
> [2018-09-09T18:40:32.047] [2067.extern] error: could not find HOME in environment
> [2018-09-09T18:40:32.047] [2067.extern] error: x11 port forwarding setup failed
> [2018-09-09T18:40:32.065] [2067.extern] error: _spawn_job_container: failed retrieving x11 display value: No error
> [2018-09-09T18:40:32.065] [2067.extern] error: _spawn_job_container: failed retrieving x11 authority value: No error
> [2018-09-09T18:40:32.085] [2067.extern] done with job
> [2018-09-09T18:40:32.102] launch task 2067.0 request from UID:33560764 GID:2019 HOST:198.38.16.106 PORT:17035
> [2018-09-09T18:40:32.104] error: could not get x11 forwarding display for job 2067 step 0, x11 forwarding disabled
> [2018-09-09T18:40:33.843] [2067.0] done with job

debug2 version of log attached.
Comment 3 Jack Duvall 2018-09-09 16:46:36 MDT
Created attachment 7792 [details]
Verbose slurmd.log, 18.08 version