[
https://issues.redhat.com/browse/JGRP-2504?page=com.atlassian.jira.plugin...
]
Andrew Skalski commented on JGRP-2504:
--------------------------------------
Here's what I found based on further research.
The Linux kernel manages the TCP receive window in two different ways (this is a
simplification) depending on whether or not the SO_RCVBUF socket option was explicitly set
by the application. By default, the kernel automatically manages the receive buffer and
window clamp (an internal per-socket variable that acts as a ceiling on window size) based
on the min/default/max sizes defined in the net.ipv4.tcp_rmem sysctl.
If the application explicitly configures SO_RCVBUF, this automatic management is
disabled. The window clamp is initialized once at handshake time, and does not change,
even if SO_RCVBUF is subsequently updated.
In the case where the application configures SO_RCVBUF "too late" (after the
handshake), the socket ends up in a state where:
* Window clamp was initialized based on a small default value
* Automatic growth of window clamp is disabled
Consequently, throughput is actually worse than if SO_RCVBUF was left unconfigured. (This
also explains why recent versions of JGroups performed better for me _by default_ than
older versions: the builtin default value for recv_buf_size was removed.)
Regarding whether or not the SO_RCVBUF setting is inherited from the listening socket, I
found this in chapter 7 of _UNIX Network Programming, Volume 1: The Sockets Networking
API, Third Edition_:
{quote}The following socket options are inherited by a connected TCP socket from the
listening socket (pp. 462–463 of TCPv2): {{SO_DEBUG}}, {{SO_DONTROUTE}}, {{SO_KEEPALIVE}},
{{SO_LINGER}}, {{SO_OOBINLINE}}, {{SO_RCVBUF}}, {{SO_RCVLOWAT}}, {{SO_SNDBUF}},
{{SO_SNDLOWAT}}, {{TCP_MAXSEG}}, and {{TCP_NODELAY}}. This is important with TCP because
the connected socket is not returned to a server by {{accept}} until the three-way
handshake is completed by the TCP layer. To ensure that one of these socket options is set
for the connected socket when the three-way handshake completes, we must set that option
for the listening socket.
{quote}
Poor throughput over high latency TCP connection when recv_buf_size
is configured
---------------------------------------------------------------------------------
Key: JGRP-2504
URL:
https://issues.redhat.com/browse/JGRP-2504
Project: JGroups
Issue Type: Bug
Affects Versions: 5.0.0.Final
Reporter: Andrew Skalski
Assignee: Bela Ban
Priority: Minor
Fix For: 5.1
Attachments: SpeedTest.java, bla5.java, bla6.java, bla7.java, delay-ip.sh
I recently finished troubleshooting a unidirectional throughput bottleneck involving a
JGroups application (Infinispan) communicating over a high-latency (~45 milliseconds) TCP
connection.
The root cause was JGroups improperly configuring the receive/send buffers on the
listening socket. According to the tcp(7) man page:
{code:java}
On individual connections, the socket buffer size must be set prior to
the listen(2) or connect(2) calls in order to have it take effect.
{code}
However, JGroups does not set the buffer size on the listening side until after
accept().
The result is poor throughput when sending data from client (connecting side) to server
(listening side.) Because the issue is a too-small TCP receive window, throughput is
ultimately latency-bound.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)