]
Bela Ban resolved JGRP-1873.
----------------------------
Fix Version/s: (was: 3.5.1)
Resolution: Done
UNICAST2: unilateral connection close of receiver can lead to missing
seqnos in sender
--------------------------------------------------------------------------------------
Key: JGRP-1873
URL:
https://issues.jboss.org/browse/JGRP-1873
Project: JGroups
Issue Type: Bug
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.6
In {{UNICAST2}}, if we have a connection between sender A and receiver B, and B closes
the connection (but not A), then A can end up with missing messages in its send table.
Example:
* A sends messages to B
* A has an entry for B in its send-table: {{B: 10|20}} (lowest sent=10, highest sent=20)
* B has an entry for A in its recv-table: {{A: 10|20}} (lowest received=10, highest
received=20)
* B now gets a view that doesn't contain A and closes its connection to A
** This results in B's connection to A getting removed
* A now sends message {{A::21}}
* B doesn't find an entry in its recv-table for A and sends {{GET-FIRST-SEQNO}} to A
* A receives the request and sends message {{A::11 first}} - {{A:21}} to B. These
messages are sent unreliably, so they can get dropped. Let's assume (for this example)
that some of them are dropped.
* B does receive {{A::11 first}} and creates an entry for A in its recv-table: {{A:
11|21}} (next to be received is {{A:12}})
* Now a spurious {{STABLE(A::15)}} message by B is received by A
** This can happen when B sent the {{STABLE}} message *before* its connection to A was
removed, but the message was delayed, e.g. by garbage collection
** Note that the connection ID ({{conn-id}} is the same, so A will _not_ reject the
{{STABLE}} message by B
* A receives the {{STABLE}} message and purges elements up to 15, so its new entry for B
is: {{B:: 15|21}}
* When B asks A for retransmission of messages {{A::12}} - {{A:21}}, A can only
retransmit messages 16-21, but *not* {{A::12}} - {{A:15}} !
Depending on which messages from A (which it sent unreliably on reception of
{{GET-FIRST-SEQNO}}) were received by B, there would be never-ending retransmission
requests from B to A for all or some messages in {{A[12..15]}}, e.g.
{noformat}
WARN [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in
retransmission table of B: [15 | 15 | 22] (X elements, Y missing)
{noformat}
h5. Reordering of STABLE messages
In the worst case, as {{STABLE}} messages are not sent reliably and can therefore get
dropped or reordered, if A gets another {{STABLE(10)}} message after the {{STABLE(15)}}
message, the error message above would look like this:
{noformat}
WARN [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in
retransmission table of B: [10 | 10 | 22] (X elements, Y missing)
{noformat}
Note that, with
https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur
anymore.
h5. Solution
There's no real solution but to upgrade to {{UNICAST3}}: when {{UNICAST3}} receives a
view, it doesn't _remove_ receive (and send) connections immediately, but merely marks
them as _closed_. The connection will only be removed after {{conn_close_timeout}} ms. If
B therefore gets further messages from A, it will simply mark the receive connection as
_open_ and doesn't need to send a {{GET-FIRST-SEQNO}} message to A as it still has all
of A's messages.
We could think of a connection establishment and teardown protocol used by all of the
unicast protocols, which establishes connections similar to TCP. Senders would block until
a connection is established etc and new conn-ids would be created, plus the current send-
and receive- seqnos would be exchanged. This could also be used as a second line of
defense, to re-establish the connection when a sender doesn't find messages requested
for retransmission by a receiver. As an alternative, we could create a new protocol which
syncs a receive table with a sender, e.g.
https://issues.jboss.org/browse/JGRP-1875.
To mitigate the above issue, {{FD_ALL}} rather than {{FD}} should be used, so that
members suspect each other more or less at the same time. This is not the case with FD,
where multiple hung (or GC'ing) members take N * timeout time to suspect. With
{{FD_ALL}}, chances are that A and B suspect each other and later, both establish a new
connection.