[jboss-jira] [JBoss JIRA] Commented: (JGRP-348) UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
Bela Ban (JIRA)
jira-events at jboss.com
Fri Oct 27 08:40:42 EDT 2006
[ http://jira.jboss.com/jira/browse/JGRP-348?page=comments#action_12345910 ]
Bela Ban commented on JGRP-348:
-------------------------------
I think the issue is even more complex: what if b1 didn't suspect a1 and therefore didn't remove the connection to a1 ? Say the seqno for the a1 connection is 24. When b1 sends a message to a1 with seqno=25, a1 would create a new connection with seqno=24 for b1 (according to your suggested fix), which is okay. However, when a1 sends a message with seqno=1 to b1, b1 will simply discard it because the next seqno to be expected from a1 is 25.
Another case:
- {a1,a2,a3, b1, a4,a5,a6}. All a's are on the first machine, b1 is on the 2nd machine. b1 pings a4 in this example
- Plug is pulled
- a3 suspects b1, new view on machine #1 is {a1,a2,a3,a4,a5,a6}. This after ca 10 seconds
- b1 suspects a4, a5 and a6 after 30 seconds, so its view is {a1,a2,a3,b1}
- The the plug is reinserted
- Issue #1: when a1-3 want to send a message to b1, the seqno is 1 because the connection is newly created: b1 will discard all seqnos < 25
- Issue #2: when b1 wants to send a message to a1-3, it will send it with an existing seqno, because the connections to a1-3 were not purged
SOLUTION to #1: when a member sends the *first* unicast to a previously excluded member, it includes in the UnicastHeader a flag that says the receiver should reset that connection, so b1 would reset its seqno to 1 and therefore accept a1's seqno=1
SOLUTION to #2: when a member sees the first message (e.g. b1's seqno=24) from a previously excluded member P, it sets its seqno for that member to the one received from P. PROBLEM: that seqno might *not* be the first one ! E.g. if b1 sends seqnos 24 and 25, and 24 is dropped on the way to a1, then the first seqno would be 25, therefore a1 would drop a subsequently retransmitted 24 !
> UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
> -------------------------------------------------------------------------------------------------
>
> Key: JGRP-348
> URL: http://jira.jboss.com/jira/browse/JGRP-348
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.4
> Reporter: Bela Ban
> Assigned To: Bela Ban
> Fix For: 2.5
>
>
> Mail from David Foregt:
> Hi Bela,
> Still have an issue with JGroup 2.4 with UNICAST after applying your
> recommended settings. We spent more time analyzing the issue and found the
> exact scenario that cause the problem:
> - We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
> node running on machine B (b1).
> - b1 node is started first (coord) then all a's nodes are started.
> When all nodes are active in the group we disconnected machine A from the
> network.
> - After ~10 sec all a's see b1 as dead and a new view is propagated to all
> a's nodes and connection table for b1 entry is cleared for all a's nodes.
> - b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
> / VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
> reconnected the network cable on machine A. (b1 connection table was cleared
> for only a1...a3)
> - After A reconnect to the network a merge was done and all nodes are back
> in the group and are able to exchange Multicast message.
> - Because b1 did not detect a4...a15 as dead when it send a unicast message
> to those nodes the seqno has not been reset to 1. When a4 receive the first
> unicast message from b1 (because its connection table was cleared for b1) it
> create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
> and add the received message (that has a seqno > 1) in the new
> AckReceiverWindow then all subsequent unicast message received from b1 are
> added in this new AckReceiverWindow and when remove is called at line 470 of
> UNICAST it always return null because the AckReceiverWindow::next_to_remove
> is equal to 1 and the message that we are adding to AckReceiverWindow have a
> seqno > 1.
> The result is that a4...a15 will never be able to receive any other unicast
> msg from b1. This is reproducible all the time.
> Our quick fix that look to work fine is to change UNICAST line 453 as
> following (I am not sure about potential bug introduce by this):
> entry.received_msgs=new AckReceiverWindow(seqno);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list