Bug 14134 - salloc --x11 followed by executing, say, xeyes, fails if job is submitted from the control node
Summary: salloc --x11 followed by executing, say, xeyes, fails if job is submitted fro...
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other bugs)
Version: 21.08.5
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-05-21 23:20 MDT by selva.nair
Modified: 2022-05-22 09:06 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description selva.nair 2022-05-21 23:20:54 MDT
Note: though this is suspiciously similar to 13342, I'm not sure of the rules here, so posting as a new bug -- please merge with it if desired.

slurmd.conf has
LaunchParameters=use_interactive_step
PrologFlags=X11
InteractiveStepOptions="--interactive --pty --preserve-env --mpi=none $SHELL"

"salloc --x11" runs successfully and connects the user to the allocated node. But running any X program from the node fails with 
"Error: Can't open display: localhost:xx.0". 
DISPLAY gets correctly set, xauth has the right cookie copied from source host, so nothing looks obviously wrong. slurm logs show this error:

[2022-05-22T00:42:29.439] [91.extern] error: _x11_socket_read: slurm_open_msg_conn: Connection refused

Interestingly, everything works fine if the job is submitted from a host other than the slurm control host. This was tested by ensuring the job lands in the same compute node. 

Mine is a smallish cluster and the login node is also the slurm control node.

Before anyone asks: Not using --x11, but running  "salloc /usr/bin/bash" followed by "ssh -X $SLURM_NODELIST xeyes" does work irrespective of the submission node.
Comment 1 selva.nair 2022-05-22 09:06:12 MDT
Solved: all nodes had a line in /etc/hosts as

127.0.1.1  <nodename>

Ubuntu adds this when the hostname is set. Removing this on the controller fixes the issue. 

I guess this entry makes the controller provide 127.0.1.1 as the IP of the starting host to slurmstepd. Not sure why only x11 forwarding is affected.