Bug 8729

Summary: Cannot run interactive job
Product: Slurm Reporter: ssingh
Component: User CommandsAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: ssingh
Version: 20.02.0   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Bull/Atos Sites: --- Confidential Site: ---
Cray Sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
SFW Sites: --- SNIC sites: ---
Linux Distro: CentOS Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---

Description ssingh 2020-03-25 07:19:29 MDT
CentOS 7.7.1908
Slurm 18.08.8
 
When trying to run an interactive job I am getting the following error:
 
srun: error: task 0 launch failed: Slurmd could not connect IO
 
Checking the log file on the compute node I see the following error:
 
[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980
[2020-03-25T01:42:08.262] lllp_distribution jobid [13] implicit auto binding: cores,one_thread, dist 8192
[2020-03-25T01:42:08.262] _task_layout_lllp_cyclic
[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x0000000000000001
[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5
[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds
[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket
[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021
[2020-03-25T01:42:08.315] [13.0] done with job
 
When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected.
 
Any advise on how to remedy this would be appreciated.
 
-Sajesh-
Comment 1 ssingh 2020-03-25 07:25:59 MDT
I have also looked at the past report of these issues and the firewall between the two hosts is not an issue as the interface that is used by the compute nodes to communicate with the head node accepts all traffic and I have stopped the firewall to test and have gotten the same results.

-Sajesh-
Comment 2 ssingh 2020-03-25 07:57:13 MDT
Running an strace against slurmd on the compute host show that this might be the problem:

chown("/dev/pts/1", 1326, 7)      = -1 EPERM (Operation not permitted)

Not sure how to fix this as I can ssh to the compute node as the user with the UID of 1326

-Sajesh-
Comment 3 ssingh 2020-03-25 10:17:32 MDT
Updated to Slurm 20.02.0 but issue still persists