[jboss-jira] [JBoss JIRA] Commented: (JGRP-348) UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other

Bela Ban (JIRA) jira-events at lists.jboss.org
Mon Mar 12 05:19:47 EDT 2007


    [ http://jira.jboss.com/jira/browse/JGRP-348?page=comments#action_12355788 ] 
            
Bela Ban commented on JGRP-348:
-------------------------------

Maybe the simplest solution would be to remove all connections for all members in the MergeView, so that everyone in the new MergeView starts with seqno=1

> UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
> -------------------------------------------------------------------------------------------------
>
>                 Key: JGRP-348
>                 URL: http://jira.jboss.com/jira/browse/JGRP-348
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.4
>            Reporter: Bela Ban
>         Assigned To: Bela Ban
>             Fix For: 2.5
>
>
> Mail from David Foregt:
> Hi Bela,
> 	Still have an issue with JGroup 2.4 with UNICAST after applying your
> recommended settings.  We spent more time analyzing the issue and found the
> exact scenario that cause the problem: 
> - We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
> node running on machine B (b1).
> - b1 node is started first (coord) then all a's nodes are started. 
> When all nodes are active in the group we disconnected machine A from the
> network.
> - After ~10 sec all a's see b1 as dead and a new view is propagated to all
> a's nodes and connection table for b1 entry is cleared for all a's nodes. 
> - b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
> / VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
> reconnected the network cable on machine A. (b1 connection table was cleared
> for only a1...a3) 
> - After A reconnect to the network a merge was done and all nodes are back
> in the group and are able to exchange Multicast message. 
> - Because b1 did not detect a4...a15 as dead when it send a unicast message
> to those nodes the seqno has not been reset to 1. When a4 receive the first
> unicast message from b1 (because its connection table was cleared for b1) it
> create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
> and add the received message (that has a seqno > 1) in the new
> AckReceiverWindow then all subsequent unicast message received from b1 are
> added in this new AckReceiverWindow and when remove is called at line 470 of
> UNICAST it always return null because the AckReceiverWindow::next_to_remove
> is equal to 1 and the message that we are adding to AckReceiverWindow have a
> seqno > 1. 
> The result is that a4...a15 will never be able to receive any other unicast
> msg from b1. This is reproducible all the time. 
> Our quick fix that look to work fine is to change UNICAST line 453 as
> following (I am not sure about potential bug introduce by this):
> entry.received_msgs=new AckReceiverWindow(seqno); 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        



More information about the jboss-jira mailing list