Ticket 4107 - jobs > 100 nodes, slurmstepd timeout, _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection timed out
Summary: jobs > 100 nodes, slurmstepd timeout, _send_launch_resp: Failed to send RESPO...
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.02.3
Hardware: Linux Linux
: --- 1 - System not usable
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-08-24 10:11 MDT by S Senator
Modified: 2019-02-06 19:08 MST (History)
5 users (show)

See Also:
Site: LANL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Ice
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Failure log for "srun -vvvv" as requested (443.74 KB, text/plain)
2017-08-24 15:33 MDT, Michael Jennings
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description S Senator 2017-08-24 10:11:12 MDT
After a normal maintenance window, with no changes to the slurm configuration nor network we are seeing the following error message when initiating a command such as:
  srun -N 200 -n 200 --reservation=PreventMaint

  slurmstepd timeout, _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection timed out
with jobs >200 nodes reliably reproducing the condition. However it is intermittent at job sizes of 100 nodes or lower.

Additionally, we occasionally observe, another 'Connection timed out' when a job is completing. The job will wait in cleanup phase since it never receives the task complete message. This is on a similar frequency and scale. 

These job launches are done on a quiet system. There is no observable load on master (where ctld/dbd/database exist) via top.  Connection timeouts are reported in the slurmctld.log as well as on the offending compute node in its slurmd.log.

The sysctl.conf contains:
 net.core.netdev_max_backlog=5000
 net.core.rmem_max=2147483647
 net.core.wmem_max=2147483647
 net.ipv4.tcp_rmem=4096 65536 2147483647
 net.ipv4.tcp_wmem=4096 65536 2147483647
 net.ipv4.tcp_mtu_probing= 1
 net.ipv4.max_syn_backlog= 65536
 net.core.somaxconn= 65535
 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait= 1

We did increase to no effect.

>200 nodes generates this very quickly.
Related error messages:
  srun: error: task <x> launch failed: Slurmd could not connect IO

No other network errors are being reported. Specific netstat -in shows no TX/RX errors. The networking debug procedures are not finding any other network problems.

setting DebugFlags=Steps & level=debug3 doesn't provide any new errors, just dispatch & scheduling messages. (Note: this system's logs are not directly available.)

However, running salloc or sbatch does *not* trigger this reproduce this condition. Only unallocated initial srun's do.
Comment 1 Brian Christiansen 2017-08-24 11:11:27 MDT
Do srun's inside a bash script produce the same issue?
Comment 2 Brian Christiansen 2017-08-24 11:15:32 MDT
Sorry if my question is redundant since you said that:

"running salloc or sbatch does *not* trigger this reproduce this condition. Only unallocated initial srun's do."

We are looking into it.
Comment 3 Moe Jette 2017-08-24 11:16:38 MDT
Have you tried restarting the ncmd daemon (likely on sdb)?
Comment 4 Joseph 'Joshi' Fullop 2017-08-24 11:18:27 MDT
This is not a Cray system.  Standard architecture with master hosting ctld/dbd/and database.
Comment 5 Moe Jette 2017-08-24 11:23:43 MDT
Did anything change in your maintenance window?
Any changes to hardware/software/configuration?
Comment 6 Moe Jette 2017-08-24 11:26:30 MDT
What is your Slurm message timeout (see "scontrol show config | grep MessageTimeout")?

It would be possible to increase that and probably get jobs running again, but it would likely mask some other issue.
Comment 7 Joseph 'Joshi' Fullop 2017-08-24 11:31:25 MDT
MessageTimeout is set to 60 seconds.  We see the messages show up within a few seconds during a srun.  Usually around 7 or 8 seconds into the run.
Comment 8 Moe Jette 2017-08-24 11:35:22 MDT
(In reply to Joseph 'Joshi' Fullop from comment #7)
> MessageTimeout is set to 60 seconds.  We see the messages show up within a
> few seconds during a srun.  Usually around 7 or 8 seconds into the run.

That is a great clue. Each of Slurm's compute node deamons reads its local configuration file for parameters like the message timeout.

Is there any chance of different slurm.conf files on some of the compute nodes?
Perhaps the daemon was started before some file system was mounted and read an old/vestigial configuration file?
Comment 9 Joseph 'Joshi' Fullop 2017-08-24 11:40:43 MDT
Another thing that does not appear in the original post, when srun-ing a hostname job, the task on the node actually completes and the output for the node reporting the error shows up in stdout. It appears just the reporting of the launch is failing.  For example doing a 'srun -N100 -n100 -l --reservation=x hostname' will output all 100 hosts' hostnames. but then sometimes throw 1 or more 'task#: slurmstepd: error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection timed out'

Similarly, the reporting of task completion sometimes also fails.  Both cases causes the job to not complete correctly.


The slurm.conf files are all the same.  We did a fresh reboot and have no reports of differing configs.
Comment 10 Moe Jette 2017-08-24 11:45:47 MDT
Would it be possible to get your Slurm configuration files?

Did you look at the slurmctld and slurmd log files?
Did anything there stand out?

If the configuration files differ between nodes, you should see messages like this in the slurmctld log file:
error: Node nid00001 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
Comment 11 Moe Jette 2017-08-24 12:02:54 MDT
I don't know if this helps, but here is more information about what causes the error:

The srun command is sending a launch RPC to Slurm's slurmd daemon the compute nodes.

When the launch is completed by a slurmstepd process (spawned to manage the local application I/O, accounting, etc.), the slurmstepd initiates an RPC to the srun. That message from slurmstepd to the srun is what is timing out.

Perhaps starting the srun with more verbose logging (add "-vvvv" to the srun command line) would provide some more information.
Comment 12 Brian Christiansen 2017-08-24 12:09:15 MDT
Will you also try running:
sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname"

I'm wondering if it has something to do with the communications between the compute nodes and the login nodes (unless you are submitting your srun's from a compute node).
Comment 13 Tim Wickberg 2017-08-24 12:10:13 MDT
(In reply to Moe Jette from comment #11)
> I don't know if this helps, but here is more information about what causes
> the error:
> 
> The srun command is sending a launch RPC to Slurm's slurmd daemon the
> compute nodes.
> 
> When the launch is completed by a slurmstepd process (spawned to manage the
> local application I/O, accounting, etc.), the slurmstepd initiates an RPC to
> the srun. That message from slurmstepd to the srun is what is timing out.
> 
> Perhaps starting the srun with more verbose logging (add "-vvvv" to the srun
> command line) would provide some more information.

To add to this -

srun needs to open ephemeral TCP ports for the slurmstepd process to connect back to. I'm guessing there's some sort of firewall between the login nodes and the compute nodes in this cluster?

If you're running srun outside an allocation, this implies that the login node will need to allow connections back from the compute nodes on any arbitrary TCP port. With larger jobs, there's a higher chance that this may have fallen outside a "normal" range that you may have permitted through a firewall, as it'll be opening more ports than normal.

Setting SrunPortRange in slurm.conf will at least restrict the range that srun listens on, and would make it easier to configure the appropriate firewall settings.
Comment 14 Michael Jennings 2017-08-24 12:18:49 MDT
(In reply to Brian Christiansen from comment #12)
> Will you also try running:
> sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname"

I ran this with -N200 -n200 since that's what reliably reproduces the error on srun.  The job completed with no errors.  Removing the --wrap= part and changing sbatch back to srun causes the errors to appear.
Comment 15 Moe Jette 2017-08-24 12:47:27 MDT
(In reply to Michael Jennings from comment #14)
> (In reply to Brian Christiansen from comment #12)
> > Will you also try running:
> > sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname"
> 
> I ran this with -N200 -n200 since that's what reliably reproduces the error
> on srun.  The job completed with no errors.  Removing the --wrap= part and
> changing sbatch back to srun causes the errors to appear.

The slurmstepd process needs to talk to the srun command. In the sbatch case, the srun runs on the first compute node of the job's allocation. When executing srun directly, the communications are going back to the login node.

Can you think of any hardware, software or configuration differences between the nodes that might account for the failure when executing srun directly?
Comment 16 Joseph 'Joshi' Fullop 2017-08-24 13:18:54 MDT
We are testing the network between the computes and the front ends now.  Will update with results soon.
Comment 17 Joseph 'Joshi' Fullop 2017-08-24 13:45:52 MDT
We have confirmed packets being dropped on the front end/login nodes and are investigating further.
Comment 18 Joseph 'Joshi' Fullop 2017-08-24 14:19:49 MDT
We have found the problem.  There appears to have been firewall rules in regards to flooding on the FE nodes that were causing the drops when sruns were executed directly from the front ends.  Hence also why it appeared to be observed only at a certain scale.

Additionally, likely due to the complexity of the firewall rules, when the firewall was dropped (and after corrected) we saw the 300 node hostname runs drop from 20+ seconds to under 1 second.

Thank you for your assistance in narrowing our root cause.  I think we understand the mechanics a bit better now too.
Comment 19 Michael Jennings 2017-08-24 15:33:55 MDT
Created attachment 5144 [details]
Failure log for "srun -vvvv" as requested

I went through the process of having someone DC the "srun -vvvv" that Moe asked for, so I'm going to go ahead and attach it here anyway in case it might reveal something else that's going on or just for future reference.  Hope that's okay.  The errors we were seeing about "IO" and RESPONSE_LAUNCH_TASKS are in there, so if they could potentially indicate anything apart from packet loss or network throttling, please let us know.

If nothing else, at least I'm set up for going through that process a bit faster now!  :-)