[jboss-jira] [JBoss JIRA] (JGRP-1873) UNICAST2: unilateral connection close of receiver can lead to missing seqnos in sender

Thu Aug 28 08:14:01 EDT 2014

     [ https://issues.jboss.org/browse/JGRP-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban updated JGRP-1873:
---------------------------
    Description: 
In {{UNICAST2}}, if we have a connection between sender A and receiver B, and B closes the connection (but not A), then A can end up with missing messages in its send table.
Example:
* A sends messages to B
* A has an entry for B in its send-table: {{B: 10|20}} (lowest sent=10, highest sent=20)
* B has an entry for A in its recv-table: {{A: 10|20}} (lowest received=10, highest received=20)
* B now gets a view that doesn't contain A and closes its connection to A
** This results in B's connection to A getting removed
* A now sends message {{A::21}}
* B doesn't find an entry in its recv-table for A and sends {{GET-FIRST-SEQNO}} to A
* A receives the request and sends message {{A::11 first}} - {{A:21}} to B. These messages are sent unreliably, so they can get dropped. Let's assume (for this example) that some of them are dropped.
* B does receive {{A::11 first}} and creates an entry for A in its recv-table: {{A: 11|21}} (next to be received is {{A:12}})
* Now a spurious {{STABLE(A::15)}} message by B is received by A
** This can happen when B sent the {{STABLE}} message *before* its connection to A was removed, but the message was delayed, e.g. by garbage collection
** Note that the connection ID ({{conn-id}} is the same, so A will _not_ reject the {{STABLE}} message by B
* A receives the {{STABLE}} message and purges elements up to 15, so its new entry for B is: {{B:: 15|21}}
* When B asks A for retransmission of messages {{A::12}} - {{A:21}}, A can only retransmit messages 16-21, but *not* {{A::12}} - {{A:15}} !

Depending on which messages from A (which it sent unreliably on reception of {{GET-FIRST-SEQNO}}) were received by B, there would be never-ending retransmission requests from B to A for all or some messages in {{A[12..15]}}, e.g.
{noformat}
WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in 
retransmission table of B: [15 | 15 | 22] (X elements, Y missing)
{noformat}

h5. Reordering of STABLE messages
In the worst case, as {{STABLE}} messages are not sent reliably and can therefore get dropped or reordered, if A gets another {{STABLE(10)}} message after the {{STABLE(15)}} message, the error message above would look like this:
{noformat}
WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in
retransmission table of B: [10 | 10 | 22] (X elements, Y missing)
{noformat}
Note that, with https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur anymore.

h5. Solution
There's no real solution but to upgrade to {{UNICAST3}}: when {{UNICAST3}} receives a view, it doesn't _remove_ receive (and send) connections immediately, but merely marks them as _closed_. The connection will only be removed after {{conn_close_timeout}} ms. If B therefore gets further messages from A, it will simply mark the receive connection as _open_ and doesn't need to send a {{GET-FIRST-SEQNO}} message to A as it still has all of A's messages.

We could think of a connection establishment and teardown protocol used by all of the unicast protocols, which establishes connections similar to TCP. Senders would block until a connection is established etc and new conn-ids would be created, plus the current send- and receive- seqnos would be exchanged. This could also be used as a second line of defense, to re-establish the connection when a sender doesn't find messages requested for retransmission by a receiver. As an alternative, we could create a new protocol which syncs a receive table with a sender, e.g. https://issues.jboss.org/browse/JGRP-1875.

To mitigate the above issue, {{FD_ALL}} rather than {{FD}} should be used, so that members suspect each other more or less at the same time. This is not the case with FD, where multiple hung (or GC'ing) members take N * timeout time to suspect. With {{FD_ALL}}, chances are that A and B suspect each other and later, both establish a new connection.

  was:
In {{UNICAST2}}, if we have a connection between sender A and receiver B, and B closes the connection (but not A), then A can end up with missing messages in its send table.
Example:
* A sends messages to B
* A has an entry for B in its send-table: {{B: 10|20}} (lowest sent=10, highest sent=20)
* B has an entry for A in its recv-table: {{A: 10|20}} (lowest received=10, highest received=20)
* B now gets a view that doesn't contain A and closes its connection to A
** This results in B's connection to A getting removed
* A now sends message {{A::21}}
* B doesn't find an entry in its recv-table for A and sends {{GET-FIRST-SEQNO}} to A
* A receives the request and sends message {{A::11 first}} - {{A:21}} to B. These messages are sent unreliably, so they can get dropped. Let's assume (for this example) that some of them are dropped.
* B does receive {{A::11 first}} and creates an entry for A in its recv-table: {{A: 11|21}} (next to be received is {{A:12}})
* Now a spurious {{STABLE(A::15)}} message by B is received by A
** This can happen when B sent the {{STABLE}} message *before* its connection to A was removed, but the message was delayed, e.g. by garbage collection
** Note that the connection ID ({{conn-id}} is the same, so A will _not_ reject the {{STABLE}} message by B
* A receives the {{STABLE}} message and purges elements up to 15, so its new entry for B is: {{B:: 15|21}}
* When B asks A for retransmission of messages {{A::12}} - {{A:21}}, A can only retransmit messages 16-21, but *not* {{A::12}} - {{A:15}} !

Depending on which messages from A (which it sent unreliably on reception of {{GET-FIRST-SEQNO}}) were received by B, there would be never-ending retransmission requests from B to A for all or some messages in {{A[12..15]}}, e.g.
{noformat}
WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in retransmission table of B:
[15 | 15 | 22] (X elements, Y missing)
{noformat}

h5. Reordering of STABLE messages
In the worst case, as {{STABLE}} messages are not sent reliably and can therefore get dropped or reordered, if A gets another {{STABLE(10)}} message after the {{STABLE(15)}} message, the error message above would look like this:
{noformat}
WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in retransmission table of B:
[10 | 10 | 22] (X elements, Y missing)
{noformat}
Note that, with https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur anymore.

h5. Solution
There's no real solution but to upgrade to {{UNICAST3}}: when {{UNICAST3}} receives a view, it doesn't _remove_ receive (and send) connections immediately, but merely marks them as _closed_. The connection will only be removed after {{conn_close_timeout}} ms. If B therefore gets further messages from A, it will simply mark the receive connection as _open_ and doesn't need to send a {{GET-FIRST-SEQNO}} message to A as it still has all of A's messages.

We could think of a connection establishment and teardown protocol used by all of the unicast protocols, which establishes connections similar to TCP. Senders would block until a connection is established etc and new conn-ids would be created, plus the current send- and receive- seqnos would be exchanged. This could also be used as a second line of defense, to re-establish the connection when a sender doesn't find messages requested for retransmission by a receiver. As an alternative, we could create a new protocol which syncs a receive table with a sender, e.g. https://issues.jboss.org/browse/JGRP-1875.

To mitigate the above issue, {{FD_ALL}} rather than {{FD}} should be used, so that members suspect each other more or less at the same time. This is not the case with FD, where multiple hung (or GC'ing) members take N * timeout time to suspect. With {{FD_ALL}}, chances are that A and B suspect each other and later, both establish a new connection.

> UNICAST2: unilateral connection close of receiver can lead to missing seqnos in sender
> --------------------------------------------------------------------------------------
>
>                 Key: JGRP-1873
>                 URL: https://issues.jboss.org/browse/JGRP-1873
>             Project: JGroups
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>            Reporter: Bela Ban
>            Assignee: Bela Ban
>             Fix For: 3.5
>
>
> In {{UNICAST2}}, if we have a connection between sender A and receiver B, and B closes the connection (but not A), then A can end up with missing messages in its send table.
> Example:
> * A sends messages to B
> * A has an entry for B in its send-table: {{B: 10|20}} (lowest sent=10, highest sent=20)
> * B has an entry for A in its recv-table: {{A: 10|20}} (lowest received=10, highest received=20)
> * B now gets a view that doesn't contain A and closes its connection to A
> ** This results in B's connection to A getting removed
> * A now sends message {{A::21}}
> * B doesn't find an entry in its recv-table for A and sends {{GET-FIRST-SEQNO}} to A
> * A receives the request and sends message {{A::11 first}} - {{A:21}} to B. These messages are sent unreliably, so they can get dropped. Let's assume (for this example) that some of them are dropped.
> * B does receive {{A::11 first}} and creates an entry for A in its recv-table: {{A: 11|21}} (next to be received is {{A:12}})
> * Now a spurious {{STABLE(A::15)}} message by B is received by A
> ** This can happen when B sent the {{STABLE}} message *before* its connection to A was removed, but the message was delayed, e.g. by garbage collection
> ** Note that the connection ID ({{conn-id}} is the same, so A will _not_ reject the {{STABLE}} message by B
> * A receives the {{STABLE}} message and purges elements up to 15, so its new entry for B is: {{B:: 15|21}}
> * When B asks A for retransmission of messages {{A::12}} - {{A:21}}, A can only retransmit messages 16-21, but *not* {{A::12}} - {{A:15}} !
> Depending on which messages from A (which it sent unreliably on reception of {{GET-FIRST-SEQNO}}) were received by B, there would be never-ending retransmission requests from B to A for all or some messages in {{A[12..15]}}, e.g.
> {noformat}
> WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in 
> retransmission table of B: [15 | 15 | 22] (X elements, Y missing)
> {noformat}
> h5. Reordering of STABLE messages
> In the worst case, as {{STABLE}} messages are not sent reliably and can therefore get dropped or reordered, if A gets another {{STABLE(10)}} message after the {{STABLE(15)}} message, the error message above would look like this:
> {noformat}
> WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in
> retransmission table of B: [10 | 10 | 22] (X elements, Y missing)
> {noformat}
> Note that, with https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur anymore.
> h5. Solution
> There's no real solution but to upgrade to {{UNICAST3}}: when {{UNICAST3}} receives a view, it doesn't _remove_ receive (and send) connections immediately, but merely marks them as _closed_. The connection will only be removed after {{conn_close_timeout}} ms. If B therefore gets further messages from A, it will simply mark the receive connection as _open_ and doesn't need to send a {{GET-FIRST-SEQNO}} message to A as it still has all of A's messages.
> We could think of a connection establishment and teardown protocol used by all of the unicast protocols, which establishes connections similar to TCP. Senders would block until a connection is established etc and new conn-ids would be created, plus the current send- and receive- seqnos would be exchanged. This could also be used as a second line of defense, to re-establish the connection when a sender doesn't find messages requested for retransmission by a receiver. As an alternative, we could create a new protocol which syncs a receive table with a sender, e.g. https://issues.jboss.org/browse/JGRP-1875.
> To mitigate the above issue, {{FD_ALL}} rather than {{FD}} should be used, so that members suspect each other more or less at the same time. This is not the case with FD, where multiple hung (or GC'ing) members take N * timeout time to suspect. With {{FD_ALL}}, chances are that A and B suspect each other and later, both establish a new connection.

--
This message was sent by Atlassian JIRA
(v6.3.1#6329)