Hi there, We are seeing an odd case where X11 forwarding stops after around 1 minute. Digging into the logs on the login node I see this pattern: Jun 25 18:08:59 farnarkle1 sshd[2243]: Accepted hostbased for csamuel from 192.168.44.103 port 59812 ssh2: RSA SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4, client user "csamuel", client host "john3" Jun 25 18:08:59 farnarkle1 sshd[2243]: pam_unix(sshd:session): session opened for user csamuel by (uid=0) [...] Jun 25 18:09:54 farnarkle1 sshd[2262]: Timeout, client not responding. Jun 25 18:09:54 farnarkle1 sshd[2243]: pam_unix(sshd:session): session closed for user csamuel The "Timeout, client not responding" is the diagnostic message when an ssh client has not replied to the keepalive messages that are activated by ClientAliveInterval & ClientAliveCountMax. Our sshd_config has this set to keep idle connections alive otherwise various universities firewalls kill users SSH sessions: ClientAliveInterval 5 ClientAliveCountMax 10 So after 50 seconds of no response the SSH server will terminate the session because the client has obviously not said anything. The problem is here that whilst libssh2 does support these keepalive messages it seems that Slurm's use of it isn't enabling them. This network traffic is travelling over IPoIB so there are no firewalls in between the compute node and the login node. This issue is stopping some people working at present (the software they use depends on this working), hence setting this to "medium" impact. All the best, Chris
Looking around, it does seem like this is a limitation of libssh2 - it won't respond to want_reply pings. It looks like sending keepalives from the client side periodically may be sufficient to keep that connection alive, even if we don't respond to their ping correctly. I do have our X11 forwarding connection sending keepalives, albeit at a 60 second internal. If you lower this number: libssh2_keepalive_config(session, 1, 60); to ~ 30 seconds, it'd be interesting to know if that resolves the issue. If it doesn't, I don't have a way to workaround this. If it does, I'll look at adding an option to X11Parameters in the 18.08 release to let you tune this value. (That option does not exist in 17.11, so you'd need to run something like this as a local patch in the meantime.) - Tim
Hi Tim, On 26/06/18 02:13, bugs@schedmd.com wrote: > Looking around, it does seem like this is a limitation of libssh2 - > it won't respond to want_reply pings. It looks like sending > keepalives from the client side periodically may be sufficient to > keep that connection alive, even if we don't respond to their ping > correctly. OK, interesting. Looking around it seems that others think it should work: http://narkive.com/JzSp4ChN:1.417.99 Which seems to be the case looking at this code starting at line 743: https://github.com/libssh2/libssh2/blob/libssh2-1.4.3/src/packet.c > I do have our X11 forwarding connection sending keepalives, albeit at > a 60 second internal. If you lower this number: > > libssh2_keepalive_config(session, 1, 60); > > to ~ 30 seconds, it'd be interesting to know if that resolves the > issue. OK, will see what I can do about that, currently working from the foyer of the ASA2018 conference where it's rather chilly. > If it doesn't, I don't have a way to workaround this. If it > does, I'll look at adding an option to X11Parameters in the 18.08 > release to let you tune this value. (That option does not exist in > 17.11, so you'd need to run something like this as a local patch in > the meantime.) Understood. All the best, Chris
On 26/06/18 02:13, bugs@schedmd.com wrote: > I do have our X11 forwarding connection sending keepalives, albeit at > a 60 second internal. If you lower this number: > > libssh2_keepalive_config(session, 1, 60); > > to ~ 30 seconds, it'd be interesting to know if that resolves the > issue. Sadly not, that made no difference I'm afraid. :-(
(In reply to Christopher Samuel from comment #2) > Hi Tim, > > On 26/06/18 02:13, bugs@schedmd.com wrote: > > > Looking around, it does seem like this is a limitation of libssh2 - > > it won't respond to want_reply pings. It looks like sending > > keepalives from the client side periodically may be sufficient to > > keep that connection alive, even if we don't respond to their ping > > correctly. > > OK, interesting. Looking around it seems that others think it should work: > > http://narkive.com/JzSp4ChN:1.417.99 I'd read that and come to the opposite conclusion. I think it works alright when triggered from our end, but the SSH-server requested responses aren't processed right. > Which seems to be the case looking at this code starting at line 743: > > https://github.com/libssh2/libssh2/blob/libssh2-1.4.3/src/packet.c I'm not sure which part you're referring to - that's the middle of a comment block. (In reply to Christopher Samuel from comment #3) > On 26/06/18 02:13, bugs@schedmd.com wrote: > > > I do have our X11 forwarding connection sending keepalives, albeit at > > a 60 second internal. If you lower this number: > > > > libssh2_keepalive_config(session, 1, 60); > > > > to ~ 30 seconds, it'd be interesting to know if that resolves the > > issue. > > Sadly not, that made no difference I'm afraid. :-( Any chance you can just tweak the SSH server settings to disable this? I haven't looked, but you should be able to vary some of these settings based on the remote host, and I'm assuming that you could disable some of this for cluster-internal connections. Longer-term I do have some ideas of how to get SSH out of the forwarding path entirely; while nearly universal in deployment, there are too many odd issues with how it gets setup in each environment making it far from a universal solution. Unfortunately, I won't have anything in place for 18.08, the code freeze is coming up soon and there are some other higher priority projects to wrap up. - Tim
Hi Tim, On 28/06/18 06:06, bugs@schedmd.com wrote: > I'd read that and come to the opposite conclusion. I think it works > alright when triggered from our end, but the SSH-server requested > responses aren't processed right. I don't think that's what it's saying, the nub of the issue seems to be: # it apparently does *not* act on the responses from the server even # when want_reply is set while invoking libssh2_keepalive_send(). To me this is saying that if libssh2 sends a message with want_reply set it doesn't do anything with the response to that request from the server. I don't think this is talking about the case where it needs to respond to a message from the server with want_reply set. > I'm not sure which part you're referring to - that's the middle of a > comment block. That's the start of the code that handles SSH_MSG_CHANNEL_REQUEST packets which is how the server side keep alive requests are being sent. Line 817 is where those seem to be picked up. > Any chance you can just tweak the SSH server settings to disable > this? Unfortunately not, we had to enable it recently (which caused this breakage) because we had users complaining of idle sessions getting dropped because their institutional firewalls were being over-zealous about dropping established connections from their state table. :-( > I haven't looked, but you should be able to vary some of these > settings based on the remote host, and I'm assuming that you could > disable some of this for cluster-internal connections. Again sadly not, it's just for the server. :-( What we have done is to adjust the settings so that connections won't get killed off if there's no response for 7 days, which matches our longest allowed walltime. It means that external connections that do disappear and don't elicit ICMP or TCP RST's for the established stream won't get disconnected for 7 days, but prior to us enabling this that wouldn't have happened at all. :-) > Longer-term I do have some ideas of how to get SSH out of the > forwarding path entirely; while nearly universal in deployment, there > are too many odd issues with how it gets setup in each environment > making it far from a universal solution. Understood. > Unfortunately, I won't have anything in place for 18.08, the code > freeze is coming up soon and there are some other higher priority > projects to wrap up. Also understood! Best of luck mate.. All the best, Chris
Hi Tim, I've just done some quick testing and can show that a toy ssh client using libssh2 connects and does the ClientAlive* keep alive stuff correctly. # /usr/sbin/sshd -d -p 2222 -D -o ClientAliveCountMax=3 -o ClientAliveInterval=5 debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: private host key #0: ssh-rsa SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4 debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:KIWqV4Jdcgl4zE5l4eNX2/3SqCrsfhY9G0mVLXkK6hI debug1: private host key #2: ssh-ed25519 SHA256:7j6gu2lAR623BGXVfB0qbybddylktV2qMvAt2oOaCRI debug1: rexec_argv[0]='/usr/sbin/sshd' debug1: rexec_argv[1]='-d' debug1: rexec_argv[2]='-p' debug1: rexec_argv[3]='2222' debug1: rexec_argv[4]='-D' debug1: rexec_argv[5]='-o' debug1: rexec_argv[6]='ClientAliveCountMax=3' debug1: rexec_argv[7]='-o' debug1: rexec_argv[8]='ClientAliveInterval=5' debug1: Set /proc/self/oom_score_adj from 0 to -1000 debug1: Bind to port 2222 on 0.0.0.0. Server listening on 0.0.0.0 port 2222. debug1: Server will not fork when running in debugging mode. debug1: rexec start in 4 out 4 newsock 4 pipe -1 sock 7 debug1: inetd sockets after dupping: 3, 3 Connection from 127.0.0.1 port 47680 on 127.0.0.1 port 2222 debug1: Client protocol version 2.0; client software version libssh2_1.4.3 debug1: no match: libssh2_1.4.3 debug1: Local version string SSH-2.0-OpenSSH_7.4 debug1: Enabling compatibility mode for protocol 2.0 debug1: SELinux support disabled [preauth] debug1: permanently_set_uid: 74/74 [preauth] debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth] debug1: SSH2_MSG_KEXINIT sent [preauth] debug1: SSH2_MSG_KEXINIT received [preauth] debug1: kex: algorithm: diffie-hellman-group14-sha1 [preauth] debug1: kex: host key algorithm: ssh-rsa [preauth] debug1: kex: client->server cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth] debug1: kex: server->client cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth] debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth] debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth] debug1: expecting SSH2_MSG_KEXDH_INIT [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: SSH2_MSG_NEWKEYS sent [preauth] debug1: expecting SSH2_MSG_NEWKEYS [preauth] debug1: SSH2_MSG_NEWKEYS received [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: KEX done [preauth] debug1: userauth-request for user csamuel service ssh-connection method none [preauth] debug1: attempt 0 failures 0 [preauth] debug1: PAM: initializing for "csamuel" debug1: PAM: setting PAM_RHOST to "localhost" debug1: PAM: setting PAM_TTY to "ssh" debug1: userauth-request for user csamuel service ssh-connection method publickey [preauth] debug1: attempt 1 failures 0 [preauth] debug1: userauth_pubkey: test whether pkalg/pkblob are acceptable for RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo [preauth] debug1: temporarily_use_uid: 10631/10190 (e=0/0) debug1: trying public key file /home/csamuel/.ssh/authorized_keys debug1: fd 10 clearing O_NONBLOCK debug1: matching key found: file /home/csamuel/.ssh/authorized_keys, line 1 RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo debug1: restore_uid: 0/0 Postponed publickey for csamuel from 127.0.0.1 port 47680 ssh2 [preauth] debug1: userauth-request for user csamuel service ssh-connection method publickey [preauth] debug1: attempt 2 failures 0 [preauth] debug1: temporarily_use_uid: 10631/10190 (e=0/0) debug1: trying public key file /home/csamuel/.ssh/authorized_keys debug1: fd 10 clearing O_NONBLOCK debug1: matching key found: file /home/csamuel/.ssh/authorized_keys, line 1 RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo debug1: restore_uid: 0/0 debug1: do_pam_account: called Accepted publickey for csamuel from 127.0.0.1 port 47680 ssh2: RSA SHA256:h0OXqmBqge7H79NNRgQHA7nns7kkdmot5fKlOFkApOo debug1: monitor_child_preauth: csamuel has been authenticated by privileged process debug1: monitor_read_log: child log fd closed debug1: SELinux support disabled debug1: PAM: establishing credentials User child is on pid 39033 debug1: PAM: establishing credentials debug1: permanently_set_uid: 10631/10190 debug1: rekey after 4294967296 blocks debug1: rekey after 4294967296 blocks debug1: ssh_packet_set_postauth: called debug1: Entering interactive session for SSH2. debug1: server_init_dispatch debug1: server_input_channel_open: ctype session rchan 0 win 2097152 max 32768 debug1: input_session_request debug1: channel 0: new [server-session] debug1: session_new: session 0 debug1: session_open: channel 0 debug1: session_open: session 0: link with channel 0 debug1: server_input_channel_open: confirm session debug1: server_input_channel_req: channel 0 request env reply 1 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req env debug1: server_input_channel_req: channel 0 request pty-req reply 1 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req pty-req debug1: Allocating pty. debug1: session_new: session 0 debug1: SELinux support disabled debug1: lastlog_openseek: Couldn't open /var/log/lastlog: Operation not supported debug1: session_pty_req: session 0 alloc /dev/pts/49 debug1: server_input_channel_req: channel 0 request shell reply 1 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req shell Starting session: shell on pts/49 for csamuel from 127.0.0.1 port 47680 id 0 debug1: Setting controlling tty using TIOCSCTTY. debug1: Got 100/11 for keepalive debug1: Got 100/12 for keepalive debug1: Got 100/13 for keepalive debug1: Got 100/14 for keepalive debug1: Got 100/15 for keepalive debug1: Got 100/16 for keepalive Whereas Slurm's use doesn't. # systemctl stop sshd && /usr/sbin/sshd -d -p 22 -D -o ClientAliveCountMax=3 -o ClientAliveInterval=5 debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: private host key #0: ssh-rsa SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4 debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:KIWqV4Jdcgl4zE5l4eNX2/3SqCrsfhY9G0mVLXkK6hI debug1: private host key #2: ssh-ed25519 SHA256:7j6gu2lAR623BGXVfB0qbybddylktV2qMvAt2oOaCRI debug1: rexec_argv[0]='/usr/sbin/sshd' debug1: rexec_argv[1]='-d' debug1: rexec_argv[2]='-p' debug1: rexec_argv[3]='22' debug1: rexec_argv[4]='-D' debug1: rexec_argv[5]='-o' debug1: rexec_argv[6]='ClientAliveCountMax=3' debug1: rexec_argv[7]='-o' debug1: rexec_argv[8]='ClientAliveInterval=5' debug1: Set /proc/self/oom_score_adj from 0 to -1000 debug1: Bind to port 22 on 0.0.0.0. Server listening on 0.0.0.0 port 22. debug1: Server will not fork when running in debugging mode. debug1: rexec start in 4 out 4 newsock 4 pipe -1 sock 7 debug1: inetd sockets after dupping: 3, 3 Connection from 192.168.44.201 port 48700 on 192.168.44.14 port 22 debug1: Client protocol version 2.0; client software version libssh2_1.4.3 debug1: no match: libssh2_1.4.3 debug1: Local version string SSH-2.0-OpenSSH_7.4 debug1: Enabling compatibility mode for protocol 2.0 debug1: SELinux support disabled [preauth] debug1: permanently_set_uid: 74/74 [preauth] debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth] debug1: SSH2_MSG_KEXINIT sent [preauth] debug1: SSH2_MSG_KEXINIT received [preauth] debug1: kex: algorithm: diffie-hellman-group14-sha1 [preauth] debug1: kex: host key algorithm: ssh-rsa [preauth] debug1: kex: client->server cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth] debug1: kex: server->client cipher: aes128-ctr MAC: hmac-sha1 compression: none [preauth] debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth] debug1: kex: diffie-hellman-group14-sha1 need=20 dh_need=20 [preauth] debug1: expecting SSH2_MSG_KEXDH_INIT [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: SSH2_MSG_NEWKEYS sent [preauth] debug1: expecting SSH2_MSG_NEWKEYS [preauth] debug1: SSH2_MSG_NEWKEYS received [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: KEX done [preauth] debug1: userauth-request for user csamuel service ssh-connection method none [preauth] debug1: attempt 0 failures 0 [preauth] debug1: PAM: initializing for "csamuel" debug1: PAM: setting PAM_RHOST to "john101" debug1: PAM: setting PAM_TTY to "ssh" debug1: userauth-request for user csamuel service ssh-connection method hostbased [preauth] debug1: attempt 1 failures 0 [preauth] debug1: userauth_hostbased: cuser csamuel chost john101 pkalg ssh-rsa slen 271 [preauth] debug1: temporarily_use_uid: 10631/10190 (e=0/0) debug1: restore_uid: 0/0 debug1: fd 10 clearing O_NONBLOCK debug1: check_key_in_hostfiles: key for john101 found at /etc/ssh/ssh_known_hosts:4 Accepted RSA public key SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4 from csamuel@john101 debug1: do_pam_account: called Accepted hostbased for csamuel from 192.168.44.201 port 48700 ssh2: RSA SHA256:N98+WGJSpocW3Juz2HrAN8TXsQfH0f2Mk5L2eSJZKS4, client user "csamuel", client host "john101" debug1: monitor_child_preauth: csamuel has been authenticated by privileged process debug1: monitor_read_log: child log fd closed debug1: SELinux support disabled debug1: PAM: establishing credentials User child is on pid 43668 debug1: PAM: establishing credentials debug1: permanently_set_uid: 10631/10190 debug1: rekey after 4294967296 blocks debug1: rekey after 4294967296 blocks debug1: ssh_packet_set_postauth: called debug1: Entering interactive session for SSH2. debug1: server_init_dispatch debug1: server_input_global_request: rtype keepalive@libssh2.org want_reply 1 Timeout, client not responding. debug1: do_cleanup debug1: do_cleanup debug1: PAM: cleanup debug1: PAM: closing session debug1: PAM: deleting credentials The one obvious difference is the fact that you're using libssh2_channel_direct_tcpip() whereas this toy client uses libssh2_session_init(). I'll attach the source so you can take a look yourself. Basically it's the client from their Github modified to be stand alone and changes the port to connect to to be 2222. https://github.com/libssh2/libssh2/blob/master/example/ssh2.c With key auth I run it as: ./ssh 127.0.0.1 csamuel foo -k (it doesn't use the "foo" password field in this case). All the best, Chris
Created attachment 7202 [details] Toy ssh client (connects and shows the banner, thats it)
Chris - I'm closing this out as resolved/wontfix. I am completely overhauling the X11 forwarding code in 19.05, and it will no longer use SSH in any fashion, thus making this issue moot. - Tim
On 14/1/19 12:54 pm, bugs@schedmd.com wrote: > I am completely overhauling the X11 forwarding code in 19.05, and it will no > longer use SSH in any fashion, thus making this issue moot. Yay! Sounds good - thanks for the update Tim!