DistributedTree hangs after updating new joiner
-----------------------------------------------
Key: JGRP-782
URL:
http://jira.jboss.com/jira/browse/JGRP-782
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.2
Environment: Windows XP, Java: Sun jdk1.5.0_13,
JGroups 2.6.2 (CVS: $Id: Version.java,v 1.59.2.2 2008/02/25 16:33:46 belaban Exp $)
Also affects 2.6.3CR1
Reporter: Tom Brophy
Assigned To: Bela Ban
Fix For: 2.6.3
We use a DistributedTree which can typically have 10000 nodes per system. The nodes are
added and removed while the system is running, typically added in a large group at
startup, and then intermittantly thereafter added and removed.
Frequently we find that one of the systems locks up at
org.jgroups.blocks.GroupRequest.execute(boolean use_anycasting) at the line:
boolean retval=doExecute(use_anycasting, timeout);
This appears to be because the "done" flag is set, but not cleared, possibly by
different threads. The test program to be attached is run on two Windows machines, and
each adds up to 10000 nodes with a pause of about 2 ms between additions. The following is
the common scenario that plays out:
1 Start system 1, which adds nodes.
2 When system 1 has added two or three thousand nodes, start system 2.
3 System 2 is updated with the state of the tree and starts adding nodes.
4 System 1 hangs after sending state to system 2 when it tries to update the tree with a
new node.
5 Killing system 2 releases the hang on system 1 and system 1 resumes as if nothing
happened.
6 If System 1 did not hang, instead of killing system 2, kill system 1 and then restart
it.
Notes:
The hang seems to depend on timing, network states and possibly other factors, as it does
not always occur - occasionally it will occur at every test run, at other times no hangs
occur for may runs.
Although there are many nodes the tree is structured as /serverN/majorKey/minorKey/leaf
where the serialisable leaf is relatively small, consisting of a few strings and the
Address of the system creating the node
When I make a small change to GroupRequest.receiveResponse(Object response_value, Address
sender) and change the line reading
if(rsp_filter != null && !rsp_filter.needMoreResponses())
to
if(rsp_filter == null || !rsp_filter.needMoreResponses())
which causes done = true; to be executed if rsp_filter is null the test no longer hangs.
However, I do not know if this is a safe change, and even if it is, would prefer to not
have to make changes to the jgroups code.
The attached test code can be run on two systems with the command
java -cp .;jgroups-all.jar;commons_logging.jar TreeTest <n>
where <n> is 1 or 2 for each of the two systems. It is a stripped down equivalent to
the way we use the DistributedTree
The test program has the protocol stack as a String within the code
Finally, the DistributedTree itself issues a System.println call whenever a node is added
instead of using the logger - In our case this floods our environment with tens of
thousands of lines for the nodes as they are added, and it would be appreciated if it is
logged rather than sent to the console.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira