[jboss-jira] [JBoss JIRA] Created: (JGRP-348) UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
Bela Ban (JIRA)
jira-events at jboss.com
Fri Oct 27 08:34:41 EDT 2006
UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other
-------------------------------------------------------------------------------------------------
Key: JGRP-348
URL: http://jira.jboss.com/jira/browse/JGRP-348
Project: JGroups
Issue Type: Bug
Affects Versions: 2.4
Reporter: Bela Ban
Assigned To: Bela Ban
Fix For: 2.5
Mail from David Foregt:
Hi Bela,
Still have an issue with JGroup 2.4 with UNICAST after applying your
recommended settings. We spent more time analyzing the issue and found the
exact scenario that cause the problem:
- We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
node running on machine B (b1).
- b1 node is started first (coord) then all a's nodes are started.
When all nodes are active in the group we disconnected machine A from the
network.
- After ~10 sec all a's see b1 as dead and a new view is propagated to all
a's nodes and connection table for b1 entry is cleared for all a's nodes.
- b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
/ VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
reconnected the network cable on machine A. (b1 connection table was cleared
for only a1...a3)
- After A reconnect to the network a merge was done and all nodes are back
in the group and are able to exchange Multicast message.
- Because b1 did not detect a4...a15 as dead when it send a unicast message
to those nodes the seqno has not been reset to 1. When a4 receive the first
unicast message from b1 (because its connection table was cleared for b1) it
create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
and add the received message (that has a seqno > 1) in the new
AckReceiverWindow then all subsequent unicast message received from b1 are
added in this new AckReceiverWindow and when remove is called at line 470 of
UNICAST it always return null because the AckReceiverWindow::next_to_remove
is equal to 1 and the message that we are adding to AckReceiverWindow have a
seqno > 1.
The result is that a4...a15 will never be able to receive any other unicast
msg from b1. This is reproducible all the time.
Our quick fix that look to work fine is to change UNICAST line 453 as
following (I am not sure about potential bug introduce by this):
entry.received_msgs=new AckReceiverWindow(seqno);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list