[jboss-jira] [JBoss JIRA] Created: (JGRP-348) UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other

Bela Ban (JIRA) jira-events at jboss.com
Fri Oct 27 08:34:41 EDT 2006


UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
-------------------------------------------------------------------------------------------------

                 Key: JGRP-348
                 URL: http://jira.jboss.com/jira/browse/JGRP-348
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.4
            Reporter: Bela Ban
         Assigned To: Bela Ban
             Fix For: 2.5


Mail from David Foregt:
Hi Bela,
	Still have an issue with JGroup 2.4 with UNICAST after applying your
recommended settings.  We spent more time analyzing the issue and found the
exact scenario that cause the problem: 

- We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
node running on machine B (b1).

- b1 node is started first (coord) then all a's nodes are started. 
When all nodes are active in the group we disconnected machine A from the
network.

- After ~10 sec all a's see b1 as dead and a new view is propagated to all
a's nodes and connection table for b1 entry is cleared for all a's nodes. 

- b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
/ VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
reconnected the network cable on machine A. (b1 connection table was cleared
for only a1...a3) 

- After A reconnect to the network a merge was done and all nodes are back
in the group and are able to exchange Multicast message. 

- Because b1 did not detect a4...a15 as dead when it send a unicast message
to those nodes the seqno has not been reset to 1. When a4 receive the first
unicast message from b1 (because its connection table was cleared for b1) it
create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
and add the received message (that has a seqno > 1) in the new
AckReceiverWindow then all subsequent unicast message received from b1 are
added in this new AckReceiverWindow and when remove is called at line 470 of
UNICAST it always return null because the AckReceiverWindow::next_to_remove
is equal to 1 and the message that we are adding to AckReceiverWindow have a
seqno > 1. 

The result is that a4...a15 will never be able to receive any other unicast
msg from b1. This is reproducible all the time. 

Our quick fix that look to work fine is to change UNICAST line 453 as
following (I am not sure about potential bug introduce by this):

entry.received_msgs=new AckReceiverWindow(seqno); 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        



More information about the jboss-jira mailing list