[JBoss JIRA] (JGRP-2504) Poor throughput over high latency TCP connection when recv_buf_size is configured
by Andrew Skalski (Jira)
[ https://issues.redhat.com/browse/JGRP-2504?page=com.atlassian.jira.plugin... ]
Andrew Skalski commented on JGRP-2504:
--------------------------------------
Here's what I found based on further research.
The Linux kernel manages the TCP receive window in two different ways (this is a simplification) depending on whether or not the SO_RCVBUF socket option was explicitly set by the application. By default, the kernel automatically manages the receive buffer and window clamp (an internal per-socket variable that acts as a ceiling on window size) based on the min/default/max sizes defined in the net.ipv4.tcp_rmem sysctl.
If the application explicitly configures SO_RCVBUF, this automatic management is disabled. The window clamp is initialized once at handshake time, and does not change, even if SO_RCVBUF is subsequently updated.
In the case where the application configures SO_RCVBUF "too late" (after the handshake), the socket ends up in a state where:
* Window clamp was initialized based on a small default value
* Automatic growth of window clamp is disabled
Consequently, throughput is actually worse than if SO_RCVBUF was left unconfigured. (This also explains why recent versions of JGroups performed better for me _by default_ than older versions: the builtin default value for recv_buf_size was removed.)
Regarding whether or not the SO_RCVBUF setting is inherited from the listening socket, I found this in chapter 7 of _UNIX Network Programming, Volume 1: The Sockets Networking API, Third Edition_:
{quote}The following socket options are inherited by a connected TCP socket from the listening socket (pp. 462–463 of TCPv2): {{SO_DEBUG}}, {{SO_DONTROUTE}}, {{SO_KEEPALIVE}}, {{SO_LINGER}}, {{SO_OOBINLINE}}, {{SO_RCVBUF}}, {{SO_RCVLOWAT}}, {{SO_SNDBUF}}, {{SO_SNDLOWAT}}, {{TCP_MAXSEG}}, and {{TCP_NODELAY}}. This is important with TCP because the connected socket is not returned to a server by {{accept}} until the three-way handshake is completed by the TCP layer. To ensure that one of these socket options is set for the connected socket when the three-way handshake completes, we must set that option for the listening socket.
{quote}
> Poor throughput over high latency TCP connection when recv_buf_size is configured
> ---------------------------------------------------------------------------------
>
> Key: JGRP-2504
> URL: https://issues.redhat.com/browse/JGRP-2504
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 5.0.0.Final
> Reporter: Andrew Skalski
> Assignee: Bela Ban
> Priority: Minor
> Fix For: 5.1
>
> Attachments: SpeedTest.java, bla5.java, bla6.java, bla7.java, delay-ip.sh
>
>
> I recently finished troubleshooting a unidirectional throughput bottleneck involving a JGroups application (Infinispan) communicating over a high-latency (~45 milliseconds) TCP connection.
> The root cause was JGroups improperly configuring the receive/send buffers on the listening socket. According to the tcp(7) man page:
> {code:java}
> On individual connections, the socket buffer size must be set prior to
> the listen(2) or connect(2) calls in order to have it take effect.
> {code}
> However, JGroups does not set the buffer size on the listening side until after accept().
> The result is poor throughput when sending data from client (connecting side) to server (listening side.) Because the issue is a too-small TCP receive window, throughput is ultimately latency-bound.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
5 years, 7 months