[
http://jira.jboss.com/jira/browse/JGRP-659?page=comments#action_12404533 ]
Troy Schulz commented on JGRP-659:
----------------------------------
The following is a description I posted on the news group that may be helpful resolving
the issue:
I am currently diagnosing a problem with our application where starting
multiple members concurrently fails to properly connect each of the
members into a single group. Since our stress test framework is fairly
involved and difficult to isolate issues. I started by using the testing
our wrapper class used by our application, and once I replicated the
issue, I moved to creating an independent eclipse project with none of
our application code. This way I could rule out any of our application
logic. What I ended up with was a test that starts a single member,
waits a few seconds, then spins up X number of add'l members. The test
then monitors each of them until they all see each other and have
gathered state, or 10*X seconds pass. So, if there are 10 connections,
the group has 100 seconds to become stable. In this case, they need to
see 11 members (10 plus the coordinator). The test itself tests X= 2,
5, and 10 concurrent connections. 2 has 0% failure rate, 5 has about
40% failure rate, 10 has about 60% failure rate.
More details are below, and I can provide the project that I used to
replicate the issue, since it has none of our application code in it.
The cause of the issue is that in UNICAST, the SenderWindow and
ReceiverWindows get out of sync. Sometimes it is from the member to the
coordinator, sometimes the other way around. When UNICAST installs a
new view, it resets the entry holding the sender and receiver windows.
The sequence is something like this:
MemberA gets new view and resets connections
MemberA sends message(seq=1) to MemberB
MemberB expects seq=4(for instance) so drops message(seq=1)
MemberB gets new view and resets connections
MemberB requests state from MemberA
MemberA sends state_message(seq=2) to MemberB
MemberB queues state_message(seq=2) and proceeds to wait for message(seq=1)
When testing against our application code, MemberA is usually (may be
always, not sure) the coordinator of the largest subgroup and MemberB is
usually (may be always as well) the new coordinator. Happens when
processing a MergeView. This is most likely just the result of the fact
that these are the only point-to-point messages being generated within
the application logic.
With the test in the self contained project, this is more spread out,
and not always in response to a getState request. My most recent
udp.xml file has Flush enabled, and it seems to help fill the receive
window with unprocessed messages.
Is there something in the protocol stack we can add/remove to alleviate
this problem? Is there perhaps something we have done to inflict this
on ourselves with improper message handling or similar? Any insight
would be grateful.
Tas
PS.
Environment:
I have tried jgroups versions 2.5.0, 2.5.2, and 2.6.2; using the
default udp.xml from the respective versions. Each of these versions
exhibit the same behavior. With 2.6.2, I then added flush, changed GMS
max_bundling time to 250, changed PING's num_initial_members to 2.
Again, none of these made any change on the behavior.
Using either jdk 1.5 or 1.6
NOTE: Not all permutations of the above were tested. However, since the
failures were similar, that is probably not an issue.
Logic:
I can provide a project if you wish to run the test for yourself, but
the gist of the logic is that the member will cache the members when it
receives view changes, and if it is not the coordinator it will request
state from the coordinator. With merge views, there is additional logic
for the member to request state from the coordinator of the largest
subgroup, not necessarily from the new view's coordinator. This
behavior is how our original implementation is programmed, and so I kept
this for the test.
Merge and UNICAST sequencing problem
------------------------------------
Key: JGRP-659
URL:
http://jira.jboss.com/jira/browse/JGRP-659
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6, 2.4, 2.5
Reporter: Vladimir Blagojevic
Assigned To: Bela Ban
Fix For: 2.7
The problem is related to trashing of connection table in UNICAST during merge. Consider
following scenario:
There are 4 nodes in a cluster A,B,C, and D. After network split we have two islands A,B
and C,D. When the network healing starts eventually MergeView gets installed in both
islands. MergeView installation causes trashing of UNICAST connection table [1].
However if we have a scenario where MergeView gets installed in A,B island at time T and
it gets installed in island C,D at time T+N msec and a node from island A,B sends a
unicast message in this N msec time window then we'll run into problems with unicast
sequencing at C and D. Why? Because next message coming from island A,B into C,D will be
will with sequence number > 1 and sequencing in UNICAST of C,D after connection
trashing (from merge) expects starting sequence of 1. This causes UNICAST in C and/or D to
wait forever for missing messages. Final outcome is thus that no more unicast message
coming from A and/or B will ever be delivered at C and/or D!
[
1]http://jira.jboss.com/jira/browse/JGRP-348
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira