When trying to run an interactive job I am getting the following error:
srun: error: task 0 launch failed: Slurmd could not connect IO
Checking the log file on the compute node I see the following error:
[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980
[2020-03-25T01:42:08.262] lllp_distribution jobid  implicit auto binding: cores,one_thread, dist 8192
[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid : mask_cpu,one_thread, 0x0000000000000001
[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5
[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds
[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket
[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021
[2020-03-25T01:42:08.315] [13.0] done with job
When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected.
Any advise on how to remedy this would be appreciated.
I have also looked at the past report of these issues and the firewall between the two hosts is not an issue as the interface that is used by the compute nodes to communicate with the head node accepts all traffic and I have stopped the firewall to test and have gotten the same results.
Running an strace against slurmd on the compute host show that this might be the problem:
chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)
Not sure how to fix this as I can ssh to the compute node as the user with the UID of 1326
Updated to Slurm 20.02.0 but issue still persists