[
http://jira.jboss.com/jira/browse/JGRP-348?page=comments#action_12345910 ]
Bela Ban commented on JGRP-348:
-------------------------------
I think the issue is even more complex: what if b1 didn't suspect a1 and therefore
didn't remove the connection to a1 ? Say the seqno for the a1 connection is 24. When
b1 sends a message to a1 with seqno=25, a1 would create a new connection with seqno=24 for
b1 (according to your suggested fix), which is okay. However, when a1 sends a message with
seqno=1 to b1, b1 will simply discard it because the next seqno to be expected from a1 is
25.
Another case:
- {a1,a2,a3, b1, a4,a5,a6}. All a's are on the first machine, b1 is on the 2nd
machine. b1 pings a4 in this example
- Plug is pulled
- a3 suspects b1, new view on machine #1 is {a1,a2,a3,a4,a5,a6}. This after ca 10 seconds
- b1 suspects a4, a5 and a6 after 30 seconds, so its view is {a1,a2,a3,b1}
- The the plug is reinserted
- Issue #1: when a1-3 want to send a message to b1, the seqno is 1 because the connection
is newly created: b1 will discard all seqnos < 25
- Issue #2: when b1 wants to send a message to a1-3, it will send it with an existing
seqno, because the connections to a1-3 were not purged
SOLUTION to #1: when a member sends the *first* unicast to a previously excluded member,
it includes in the UnicastHeader a flag that says the receiver should reset that
connection, so b1 would reset its seqno to 1 and therefore accept a1's seqno=1
SOLUTION to #2: when a member sees the first message (e.g. b1's seqno=24) from a
previously excluded member P, it sets its seqno for that member to the one received from
P. PROBLEM: that seqno might *not* be the first one ! E.g. if b1 sends seqnos 24 and 25,
and 24 is dropped on the way to a1, then the first seqno would be 25, therefore a1 would
drop a subsequently retransmitted 24 !
UNICAST: incorrect sequence numbers after merge if subgroups
didn't completely exclude each other
-------------------------------------------------------------------------------------------------
Key: JGRP-348
URL:
http://jira.jboss.com/jira/browse/JGRP-348
Project: JGroups
Issue Type: Bug
Affects Versions: 2.4
Reporter: Bela Ban
Assigned To: Bela Ban
Fix For: 2.5
Mail from David Foregt:
Hi Bela,
Still have an issue with JGroup 2.4 with UNICAST after applying your
recommended settings. We spent more time analyzing the issue and found the
exact scenario that cause the problem:
- We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
node running on machine B (b1).
- b1 node is started first (coord) then all a's nodes are started.
When all nodes are active in the group we disconnected machine A from the
network.
- After ~10 sec all a's see b1 as dead and a new view is propagated to all
a's nodes and connection table for b1 entry is cleared for all a's nodes.
- b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
/ VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
reconnected the network cable on machine A. (b1 connection table was cleared
for only a1...a3)
- After A reconnect to the network a merge was done and all nodes are back
in the group and are able to exchange Multicast message.
- Because b1 did not detect a4...a15 as dead when it send a unicast message
to those nodes the seqno has not been reset to 1. When a4 receive the first
unicast message from b1 (because its connection table was cleared for b1) it
create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
and add the received message (that has a seqno > 1) in the new
AckReceiverWindow then all subsequent unicast message received from b1 are
added in this new AckReceiverWindow and when remove is called at line 470 of
UNICAST it always return null because the AckReceiverWindow::next_to_remove
is equal to 1 and the message that we are adding to AckReceiverWindow have a
seqno > 1.
The result is that a4...a15 will never be able to receive any other unicast
msg from b1. This is reproducible all the time.
Our quick fix that look to work fine is to change UNICAST line 453 as
following (I am not sure about potential bug introduce by this):
entry.received_msgs=new AckReceiverWindow(seqno);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira