Bug 8729 - Cannot run interactive job
Summary: Cannot run interactive job
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 20.02.0
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-03-25 07:19 MDT by ssingh
Modified: 2020-03-25 10:17 MDT (History)
1 user (show)

See Also:
Site: -Other-
Alineos Sites: ---
Bull/Atos Sites: ---
Confidential Site: ---
Cray Sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ssingh 2020-03-25 07:19:29 MDT
CentOS 7.7.1908
Slurm 18.08.8
 
When trying to run an interactive job I am getting the following error:
 
srun: error: task 0 launch failed: Slurmd could not connect IO
 
Checking the log file on the compute node I see the following error:
 
[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980
[2020-03-25T01:42:08.262] lllp_distribution jobid [13] implicit auto binding: cores,one_thread, dist 8192
[2020-03-25T01:42:08.262] _task_layout_lllp_cyclic
[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x0000000000000001
[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5
[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds
[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket
[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021
[2020-03-25T01:42:08.315] [13.0] done with job
 
When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected.
 
Any advise on how to remedy this would be appreciated.
 
-Sajesh-
Comment 1 ssingh 2020-03-25 07:25:59 MDT
I have also looked at the past report of these issues and the firewall between the two hosts is not an issue as the interface that is used by the compute nodes to communicate with the head node accepts all traffic and I have stopped the firewall to test and have gotten the same results.

-Sajesh-
Comment 2 ssingh 2020-03-25 07:57:13 MDT
Running an strace against slurmd on the compute host show that this might be the problem:

chown("/dev/pts/1", 1326, 7)      = -1 EPERM (Operation not permitted)

Not sure how to fix this as I can ssh to the compute node as the user with the UID of 1326

-Sajesh-
Comment 3 ssingh 2020-03-25 10:17:32 MDT
Updated to Slurm 20.02.0 but issue still persists