Bug 5349 - libssh2 based X11 forwarding might not work with OpenSSH ClientAliveInterval keepalives
Summary: libssh2 based X11 forwarding might not work with OpenSSH ClientAliveInterval ...
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 17.11.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-25 06:01 MDT by Christopher Samuel
Modified: 2019-01-14 21:50 MST (History)
0 users

See Also:
Site: Swinburne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Toy ssh client (connects and shows the banner, thats it) (7.73 KB, text/x-csrc)
2018-07-01 23:49 MDT, Christopher Samuel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christopher Samuel 2018-06-25 06:01:43 MDT
Hi there,

We are seeing an odd case where X11 forwarding stops after around 1 minute.
Digging into the logs on the login node I see this pattern:

Jun 25 18:08:59 farnarkle1 sshd[2243]: Accepted hostbased for csamuel from 192.168.44.103 port 59812 ssh2: RSA SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4, client user "csamuel", client host "john3"
Jun 25 18:08:59 farnarkle1 sshd[2243]: pam_unix(sshd:session): session opened for user csamuel by (uid=0)
[...]
Jun 25 18:09:54 farnarkle1 sshd[2262]: Timeout, client not responding.
Jun 25 18:09:54 farnarkle1 sshd[2243]: pam_unix(sshd:session): session closed for user csamuel

The "Timeout, client not responding" is the diagnostic message when an ssh client has not replied to the keepalive messages that are activated by ClientAliveInterval & ClientAliveCountMax.

Our sshd_config has this set to keep idle connections alive otherwise various universities firewalls kill users SSH sessions:

ClientAliveInterval 5
ClientAliveCountMax 10

So after 50 seconds of no response the SSH server will terminate the session because the client has obviously not said anything.

The problem is here that whilst libssh2 does support these keepalive messages it seems that Slurm's use of it isn't enabling them.  This network traffic is travelling over IPoIB so there are no firewalls in between the compute node and the login node.

This issue is stopping some people working at present (the software they use depends on this working), hence setting this to "medium" impact.

All the best,
Chris
Comment 1 Tim Wickberg 2018-06-25 10:13:21 MDT
Looking around, it does seem like this is a limitation of libssh2 - it won't respond to want_reply pings. It looks like sending keepalives from the client side periodically may be sufficient to keep that connection alive, even if we don't respond to their ping correctly.

I do have our X11 forwarding connection sending keepalives, albeit at a 60 second internal. If you lower this number:

libssh2_keepalive_config(session, 1, 60);

to ~ 30 seconds, it'd be interesting to know if that resolves the issue. If it doesn't, I don't have a way to workaround this. If it does, I'll look at adding an option to X11Parameters in the 18.08 release to let you tune this value. (That option does not exist in 17.11, so you'd need to run something like this as a local patch in the meantime.)

- Tim
Comment 2 Christopher Samuel 2018-06-26 22:49:32 MDT
Hi Tim,

On 26/06/18 02:13, bugs@schedmd.com wrote:

> Looking around, it does seem like this is a limitation of libssh2 -
> it won't respond to want_reply pings. It looks like sending
> keepalives from the client side periodically may be sufficient to
> keep that connection alive, even if we don't respond to their ping
> correctly.

OK, interesting. Looking around it seems that others think it should work:

http://narkive.com/JzSp4ChN:1.417.99

Which seems to be the case looking at this code starting at line 743:

https://github.com/libssh2/libssh2/blob/libssh2-1.4.3/src/packet.c


> I do have our X11 forwarding connection sending keepalives, albeit at
> a 60 second internal. If you lower this number:
> 
> libssh2_keepalive_config(session, 1, 60);
> 
> to ~ 30 seconds, it'd be interesting to know if that resolves the
> issue.

OK, will see what I can do about that, currently working from the
foyer of the ASA2018 conference where it's rather chilly.

> If it doesn't, I don't have a way to workaround this. If it
> does, I'll look at adding an option to X11Parameters in the 18.08
> release to let you tune this value. (That option does not exist in
> 17.11, so you'd need to run something like this as a local patch in
> the meantime.)

Understood.

All the best,
Chris
Comment 3 Christopher Samuel 2018-06-26 23:16:47 MDT
On 26/06/18 02:13, bugs@schedmd.com wrote:

> I do have our X11 forwarding connection sending keepalives, albeit at
> a 60 second internal. If you lower this number:
> 
> libssh2_keepalive_config(session, 1, 60);
> 
> to ~ 30 seconds, it'd be interesting to know if that resolves the
> issue.

Sadly not, that made no difference I'm afraid. :-(
Comment 4 Tim Wickberg 2018-06-27 14:06:16 MDT
(In reply to Christopher Samuel from comment #2)
> Hi Tim,
> 
> On 26/06/18 02:13, bugs@schedmd.com wrote:
> 
> > Looking around, it does seem like this is a limitation of libssh2 -
> > it won't respond to want_reply pings. It looks like sending
> > keepalives from the client side periodically may be sufficient to
> > keep that connection alive, even if we don't respond to their ping
> > correctly.
> 
> OK, interesting. Looking around it seems that others think it should work:
> 
> http://narkive.com/JzSp4ChN:1.417.99

I'd read that and come to the opposite conclusion. I think it works alright when triggered from our end, but the SSH-server requested responses aren't processed right.

> Which seems to be the case looking at this code starting at line 743:
> 
> https://github.com/libssh2/libssh2/blob/libssh2-1.4.3/src/packet.c

I'm not sure which part you're referring to - that's the middle of a comment block.

(In reply to Christopher Samuel from comment #3)
> On 26/06/18 02:13, bugs@schedmd.com wrote:
> 
> > I do have our X11 forwarding connection sending keepalives, albeit at
> > a 60 second internal. If you lower this number:
> > 
> > libssh2_keepalive_config(session, 1, 60);
> > 
> > to ~ 30 seconds, it'd be interesting to know if that resolves the
> > issue.
> 
> Sadly not, that made no difference I'm afraid. :-(

Any chance you can just tweak the SSH server settings to disable this?

I haven't looked, but you should be able to vary some of these settings based on the remote host, and I'm assuming that you could disable some of this for cluster-internal connections.

Longer-term I do have some ideas of how to get SSH out of the forwarding path entirely; while nearly universal in deployment, there are too many odd issues with how it gets setup in each environment making it far from a universal solution.

Unfortunately, I won't have anything in place for 18.08, the code freeze is coming up soon and there are some other higher priority projects to wrap up.

- Tim
Comment 5 Christopher Samuel 2018-06-27 19:52:17 MDT
Hi Tim,

On 28/06/18 06:06, bugs@schedmd.com wrote:

> I'd read that and come to the opposite conclusion. I think it works
> alright when triggered from our end, but the SSH-server requested
> responses aren't processed right.

I don't think that's what it's saying, the nub of the issue seems to be:

# it apparently does *not* act on the responses from the server even
# when want_reply is set while invoking libssh2_keepalive_send().

To me this is saying that if libssh2 sends a message with want_reply
set it doesn't do anything with the response to that request from the
server.

I don't think this is talking about the case where it needs to respond
to a message from the server with want_reply set.

> I'm not sure which part you're referring to - that's the middle of a
> comment block.

That's the start of the code that handles SSH_MSG_CHANNEL_REQUEST
packets which is how the server side keep alive requests are being
sent.  Line 817 is where those seem to be picked up.

> Any chance you can just tweak the SSH server settings to disable
> this?

Unfortunately not, we had to enable it recently (which caused this
breakage) because we had users complaining of idle sessions getting
dropped because their institutional firewalls were being over-zealous
about dropping established connections from their state table. :-(

> I haven't looked, but you should be able to vary some of these
> settings based on the remote host, and I'm assuming that you could
> disable some of this for cluster-internal connections.

Again sadly not, it's just for the server. :-(

What we have done is to adjust the settings so that connections won't
get killed off if there's no response for 7 days, which matches our
longest allowed walltime.  It means that external connections that do
disappear and don't elicit ICMP or TCP RST's for the established stream
won't get disconnected for 7 days, but prior to us enabling this that
wouldn't have happened at all. :-)

> Longer-term I do have some ideas of how to get SSH out of the
> forwarding path entirely; while nearly universal in deployment, there
> are too many odd issues with how it gets setup in each environment
> making it far from a universal solution.

Understood.

> Unfortunately, I won't have anything in place for 18.08, the code
> freeze is coming up soon and there are some other higher priority
> projects to wrap up.

Also understood!

Best of luck mate..

All the best,
Chris
Comment 6 Christopher Samuel 2018-07-01 23:47:42 MDT
Hi Tim,

I've just done some quick testing and can show that a toy ssh client using libssh2 connects and does the ClientAlive* keep alive stuff correctly.

# /usr/sbin/sshd -d -p 2222 -D -o ClientAliveCountMax=3 -o ClientAliveInterval=5
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: private host key #0: ssh-rsa SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4
debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:KIWqV4Jdcgl4zE5l4eNX2/3SqCrsfhY9G0mVLXkK6hI
debug1: private host key #2: ssh-ed25519 SHA256:7j6gu2lAR623BGXVfB0qbybddylktV2qMvAt2oOaCRI
debug1: rexec_argv[0]='/usr/sbin/sshd'
debug1: rexec_argv[1]='-d'
debug1: rexec_argv[2]='-p'
debug1: rexec_argv[3]='2222'
debug1: rexec_argv[4]='-D'
debug1: rexec_argv[5]='-o'
debug1: rexec_argv[6]='ClientAliveCountMax=3'
debug1: rexec_argv[7]='-o'
debug1: rexec_argv[8]='ClientAliveInterval=5'
debug1: Set /proc/self/oom_score_adj from 0 to -1000
debug1: Bind to port 2222 on 0.0.0.0.
Server listening on 0.0.0.0 port 2222.



debug1: Server will not fork when running in debugging mode.
debug1: rexec start in 4 out 4 newsock 4 pipe -1 sock 7
debug1: inetd sockets after dupping: 3, 3
Connection from 127.0.0.1 port 47680 on 127.0.0.1 port 2222
debug1: Client protocol version 2.0; client software version libssh2_1.4.3
debug1: no match: libssh2_1.4.3
debug1: Local version string SSH-2.0-OpenSSH_7.4
debug1: Enabling compatibility mode for protocol 2.0
debug1: SELinux support disabled [preauth]
debug1: permanently_set_uid: 74/74 [preauth]
debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth]
debug1: SSH2_MSG_KEXINIT sent [preauth]
debug1: SSH2_MSG_KEXINIT received [preauth]
debug1: kex: algorithm: diffie-hellman-group14-sha1 [preauth]
debug1: kex: host key algorithm: ssh-rsa [preauth]
debug1: kex: client->server cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth]
debug1: kex: server->client cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth]
debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth]
debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth]
debug1: expecting SSH2_MSG_KEXDH_INIT [preauth]
debug1: rekey after 4294967296 blocks [preauth]
debug1: SSH2_MSG_NEWKEYS sent [preauth]
debug1: expecting SSH2_MSG_NEWKEYS [preauth]
debug1: SSH2_MSG_NEWKEYS received [preauth]
debug1: rekey after 4294967296 blocks [preauth]
debug1: KEX done [preauth]
debug1: userauth-request for user csamuel service ssh-connection method none [preauth]
debug1: attempt 0 failures 0 [preauth]
debug1: PAM: initializing for "csamuel"
debug1: PAM: setting PAM_RHOST to "localhost"
debug1: PAM: setting PAM_TTY to "ssh"
debug1: userauth-request for user csamuel service ssh-connection method publickey [preauth]
debug1: attempt 1 failures 0 [preauth]
debug1: userauth_pubkey: test whether pkalg/pkblob are acceptable for RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo [preauth]
debug1: temporarily_use_uid: 10631/10190 (e=0/0)
debug1: trying public key file /home/csamuel/.ssh/authorized_keys
debug1: fd 10 clearing O_NONBLOCK
debug1: matching key found: file /home/csamuel/.ssh/authorized_keys, line 1 RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo
debug1: restore_uid: 0/0
Postponed publickey for csamuel from 127.0.0.1 port 47680 ssh2 [preauth]
debug1: userauth-request for user csamuel service ssh-connection method publickey [preauth]
debug1: attempt 2 failures 0 [preauth]
debug1: temporarily_use_uid: 10631/10190 (e=0/0)
debug1: trying public key file /home/csamuel/.ssh/authorized_keys
debug1: fd 10 clearing O_NONBLOCK
debug1: matching key found: file /home/csamuel/.ssh/authorized_keys, line 1 RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo
debug1: restore_uid: 0/0
debug1: do_pam_account: called
Accepted publickey for csamuel from 127.0.0.1 port 47680 ssh2: RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo
debug1: monitor_child_preauth: csamuel has been authenticated by privileged process
debug1: monitor_read_log: child log fd closed
debug1: SELinux support disabled
debug1: PAM: establishing credentials
User child is on pid 39033
debug1: PAM: establishing credentials
debug1: permanently_set_uid: 10631/10190
debug1: rekey after 4294967296 blocks
debug1: rekey after 4294967296 blocks
debug1: ssh_packet_set_postauth: called
debug1: Entering interactive session for SSH2.
debug1: server_init_dispatch
debug1: server_input_channel_open: ctype session rchan 0 win 2097152 max 32768
debug1: input_session_request
debug1: channel 0: new [server-session]
debug1: session_new: session 0
debug1: session_open: channel 0
debug1: session_open: session 0: link with channel 0
debug1: server_input_channel_open: confirm session
debug1: server_input_channel_req: channel 0 request env reply 1
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req env
debug1: server_input_channel_req: channel 0 request pty-req reply 1
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req pty-req
debug1: Allocating pty.
debug1: session_new: session 0
debug1: SELinux support disabled
debug1: lastlog_openseek: Couldn't open /var/log/lastlog: Operation not supported
debug1: session_pty_req: session 0 alloc /dev/pts/49
debug1: server_input_channel_req: channel 0 request shell reply 1
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req shell
Starting session: shell on pts/49 for csamuel from 127.0.0.1 port 47680 id 0
debug1: Setting controlling tty using TIOCSCTTY.


debug1: Got 100/11 for keepalive
debug1: Got 100/12 for keepalive
debug1: Got 100/13 for keepalive
debug1: Got 100/14 for keepalive
debug1: Got 100/15 for keepalive
debug1: Got 100/16 for keepalive


Whereas Slurm's use doesn't.

# systemctl stop sshd && /usr/sbin/sshd -d -p 22 -D -o ClientAliveCountMax=3 -o ClientAliveInterval=5
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: private host key #0: ssh-rsa SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4
debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:KIWqV4Jdcgl4zE5l4eNX2/3SqCrsfhY9G0mVLXkK6hI
debug1: private host key #2: ssh-ed25519 SHA256:7j6gu2lAR623BGXVfB0qbybddylktV2qMvAt2oOaCRI
debug1: rexec_argv[0]='/usr/sbin/sshd'
debug1: rexec_argv[1]='-d'
debug1: rexec_argv[2]='-p'
debug1: rexec_argv[3]='22'
debug1: rexec_argv[4]='-D'
debug1: rexec_argv[5]='-o'
debug1: rexec_argv[6]='ClientAliveCountMax=3'
debug1: rexec_argv[7]='-o'
debug1: rexec_argv[8]='ClientAliveInterval=5'
debug1: Set /proc/self/oom_score_adj from 0 to -1000
debug1: Bind to port 22 on 0.0.0.0.
Server listening on 0.0.0.0 port 22.
debug1: Server will not fork when running in debugging mode.
debug1: rexec start in 4 out 4 newsock 4 pipe -1 sock 7
debug1: inetd sockets after dupping: 3, 3
Connection from 192.168.44.201 port 48700 on 192.168.44.14 port 22
debug1: Client protocol version 2.0; client software version libssh2_1.4.3
debug1: no match: libssh2_1.4.3
debug1: Local version string SSH-2.0-OpenSSH_7.4
debug1: Enabling compatibility mode for protocol 2.0
debug1: SELinux support disabled [preauth]
debug1: permanently_set_uid: 74/74 [preauth]
debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth]
debug1: SSH2_MSG_KEXINIT sent [preauth]
debug1: SSH2_MSG_KEXINIT received [preauth]
debug1: kex: algorithm: diffie-hellman-group14-sha1 [preauth]
debug1: kex: host key algorithm: ssh-rsa [preauth]
debug1: kex: client->server cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth]
debug1: kex: server->client cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth]
debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth]
debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth]
debug1: expecting SSH2_MSG_KEXDH_INIT [preauth]
debug1: rekey after 4294967296 blocks [preauth]
debug1: SSH2_MSG_NEWKEYS sent [preauth]
debug1: expecting SSH2_MSG_NEWKEYS [preauth]
debug1: SSH2_MSG_NEWKEYS received [preauth]
debug1: rekey after 4294967296 blocks [preauth]
debug1: KEX done [preauth]
debug1: userauth-request for user csamuel service ssh-connection method none [preauth]
debug1: attempt 0 failures 0 [preauth]
debug1: PAM: initializing for "csamuel"
debug1: PAM: setting PAM_RHOST to "john101"
debug1: PAM: setting PAM_TTY to "ssh"
debug1: userauth-request for user csamuel service ssh-connection method hostbased [preauth]
debug1: attempt 1 failures 0 [preauth]
debug1: userauth_hostbased: cuser csamuel chost john101 pkalg ssh-rsa slen 271 [preauth]
debug1: temporarily_use_uid: 10631/10190 (e=0/0)
debug1: restore_uid: 0/0
debug1: fd 10 clearing O_NONBLOCK
debug1: check_key_in_hostfiles: key for john101 found at /etc/ssh/ssh_known_hosts:4
Accepted RSA public key SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4 from csamuel@john101
debug1: do_pam_account: called
Accepted hostbased for csamuel from 192.168.44.201 port 48700 ssh2: RSA SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4, client user "csamuel", client host "john101"
debug1: monitor_child_preauth: csamuel has been authenticated by privileged process
debug1: monitor_read_log: child log fd closed
debug1: SELinux support disabled
debug1: PAM: establishing credentials
User child is on pid 43668
debug1: PAM: establishing credentials
debug1: permanently_set_uid: 10631/10190
debug1: rekey after 4294967296 blocks
debug1: rekey after 4294967296 blocks
debug1: ssh_packet_set_postauth: called
debug1: Entering interactive session for SSH2.
debug1: server_init_dispatch
debug1: server_input_global_request: rtype keepalive@libssh2.org want_reply 1


Timeout, client not responding.
debug1: do_cleanup
debug1: do_cleanup
debug1: PAM: cleanup
debug1: PAM: closing session
debug1: PAM: deleting credentials


The one obvious difference is the fact that you're using libssh2_channel_direct_tcpip() whereas this toy client uses libssh2_session_init().

I'll attach the source so you can take a look yourself.  Basically it's the client from their Github modified to be stand alone and changes the port to connect to to be 2222.  https://github.com/libssh2/libssh2/blob/master/example/ssh2.c

With key auth I run it as:

./ssh 127.0.0.1 csamuel foo -k

(it doesn't use the "foo" password field in this case).

All the best,
Chris
Comment 7 Christopher Samuel 2018-07-01 23:49:02 MDT
Created attachment 7202 [details]
Toy ssh client (connects and shows the banner, thats it)
Comment 8 Tim Wickberg 2019-01-14 13:54:49 MST
Chris - 

I'm closing this out as resolved/wontfix.

I am completely overhauling the X11 forwarding code in 19.05, and it will no longer use SSH in any fashion, thus making this issue moot.

- Tim
Comment 9 Christopher Samuel 2019-01-14 21:50:25 MST
On 14/1/19 12:54 pm, bugs@schedmd.com wrote:

> I am completely overhauling the X11 forwarding code in 19.05, and it will no
> longer use SSH in any fashion, thus making this issue moot.

Yay!  Sounds good - thanks for the update Tim!