Bug 5692

Summary: X11 forwarding error: _get_home: getpwuid_r(33560764):No error
Product: Slurm Reporter: Jack Duvall <theduvallj>
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 17.11.9   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Verbose error log for issue
slurm.conf
Verbose slurmd.log, 18.08 version

Description Jack Duvall 2018-09-09 15:31:08 MDT
Created attachment 7790 [details]
Verbose error log for issue

Hello,

I'm having an issue setting up X11 forwarding. Everything was compiled with the necessary libssh2 and libssh2-devel libs, X11 forwarding enabled using the PrologFlags=x11 switch, and srun itself doesn't complain, yet any graphical programs throw an error about not being able to open the display.

Digging further in the logs for the node, I found this (debug2 level logging file attached if you want to see the full output):

> [2018-09-09T16:52:20.042] _run_prolog: run job script took usec=9
> [2018-09-09T16:52:20.042] _run_prolog: prolog with lock for job 2064 ran for 0 seconds
> [2018-09-09T16:52:20.063] [2064.extern] error: _get_home: getpwuid_r(33560764):No error
> [2018-09-09T16:52:20.063] [2064.extern] error: could not find HOME in environment
> [2018-09-09T16:52:20.064] [2064.extern] error: x11 port forwarding setup failed
> [2018-09-09T16:52:20.065] [2064.extern] error: _spawn_job_container: failed retrieving x11 display value: No error
> [2018-09-09T16:52:20.095] [2064.extern] done with job
> [2018-09-09T16:52:20.157] launch task 2064.0 request from 33560764.2019@198.38.16.106 (port 181)
> [2018-09-09T16:52:20.159] error: could not get x11 forwarding display for job 2064 step 0, x11 forwarding disabled
> [2018-09-09T16:52:20.216] [2064.0] done with job

The part about not being able to find $HOME is strange, because the output of `srun -n1 -p compute2 -w borgw201 --x11 /usr/bin/env | grep HOME`, the same command that produced the error above, says $HOME is defined.

One thing of note about my setup: Unlike in past errors, users are not allowed to ssh to compute nodes. All authentication is done with kerberos, so ssh keys don't exist. User home directories are on a shared NFS drive too. If this error is just a symptom of not having ssh key-based authentication and X11 forwarding isn't supported under any other setup, that would be nice to know for certain.

I haven't had time yet, but will try the latest version of slurm (18.08.0) and close the issue if updating fixes it.

If not, I will attach my slurm.conf next, and can provide any other files requested.

Thanks,
-Jack Duvall
Comment 1 Jack Duvall 2018-09-09 15:32:45 MDT
Created attachment 7791 [details]
slurm.conf
Comment 2 Jack Duvall 2018-09-09 16:46:04 MDT
Update: 18.08.0 does not fix this issue for me. Nothing seemed to change drastically in the logs either.

New abbreviated log:

> [2018-09-09T18:40:31.983] _run_prolog: run job script took usec=169
> [2018-09-09T18:40:31.984] _run_prolog: prolog with lock for job 2067 ran for 0 seconds
> [2018-09-09T18:40:32.046] [2067.extern] error: _get_home: getpwuid_r(33560764):No error
> [2018-09-09T18:40:32.047] [2067.extern] error: could not find HOME in environment
> [2018-09-09T18:40:32.047] [2067.extern] error: x11 port forwarding setup failed
> [2018-09-09T18:40:32.065] [2067.extern] error: _spawn_job_container: failed retrieving x11 display value: No error
> [2018-09-09T18:40:32.065] [2067.extern] error: _spawn_job_container: failed retrieving x11 authority value: No error
> [2018-09-09T18:40:32.085] [2067.extern] done with job
> [2018-09-09T18:40:32.102] launch task 2067.0 request from UID:33560764 GID:2019 HOST:198.38.16.106 PORT:17035
> [2018-09-09T18:40:32.104] error: could not get x11 forwarding display for job 2067 step 0, x11 forwarding disabled
> [2018-09-09T18:40:33.843] [2067.0] done with job

debug2 version of log attached.
Comment 3 Jack Duvall 2018-09-09 16:46:36 MDT
Created attachment 7792 [details]
Verbose slurmd.log, 18.08 version