[jboss-jira] [JBoss JIRA] Created: (JGRP-782) DistributedTree hangs after updating new joiner

Tom Brophy (JIRA) jira-events at lists.jboss.org
Tue Jun 10 05:47:19 EDT 2008


DistributedTree hangs after updating new joiner
-----------------------------------------------

                 Key: JGRP-782
                 URL: http://jira.jboss.com/jira/browse/JGRP-782
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.2
         Environment: Windows XP, Java: Sun jdk1.5.0_13, 
JGroups 2.6.2 (CVS:  $Id: Version.java,v 1.59.2.2 2008/02/25 16:33:46 belaban Exp $)
Also affects 2.6.3CR1

            Reporter: Tom Brophy
         Assigned To: Bela Ban
             Fix For: 2.6.3


We use a DistributedTree which can typically have 10000 nodes per system. The nodes are added and removed while the system is running, typically added in a large group at startup, and then intermittantly thereafter added and removed. 
Frequently we find that one of the systems locks up at org.jgroups.blocks.GroupRequest.execute(boolean use_anycasting) at the line:
boolean retval=doExecute(use_anycasting, timeout);

This appears to be because the "done" flag is set, but not cleared, possibly by different threads. The test program to be attached is run on two Windows machines, and each adds up to 10000 nodes with a pause of about 2 ms between additions. The following is the common scenario that plays out:
1 Start system 1, which adds nodes.
2 When system 1 has added two or three thousand nodes, start system 2.
3 System 2 is updated with the state of the tree and starts adding nodes.
4 System 1 hangs after sending state to system 2 when it tries to update the tree with a new node.
5 Killing system 2 releases the hang on system 1 and system 1 resumes as if nothing happened.
6 If System 1 did not hang, instead of killing system 2, kill system 1 and then restart it.

Notes:
The hang seems to depend on timing, network states and possibly other factors, as it does not always occur - occasionally it will occur at every test run, at other times no hangs occur for may runs.
Although there are many nodes the tree  is structured as /serverN/majorKey/minorKey/leaf where the serialisable leaf is relatively small, consisting of a few strings and the Address of the system creating the node

When I make a small change to GroupRequest.receiveResponse(Object response_value, Address sender) and change the line reading
                    if(rsp_filter != null && !rsp_filter.needMoreResponses())
to
                    if(rsp_filter == null || !rsp_filter.needMoreResponses())
which causes done = true; to be executed if rsp_filter is null the test no longer hangs.
However, I do not know if this is a safe change, and even if it is, would prefer to not have to make changes to the jgroups code.

The attached test code can be run on two systems with the command
java -cp .;jgroups-all.jar;commons_logging.jar TreeTest <n>
where <n> is 1 or 2 for each of the two systems. It is a stripped down equivalent to the way we use the DistributedTree

The test program has the protocol stack as a String within the code

Finally, the DistributedTree itself issues a System.println call whenever a node is added instead of using the logger - In our case this floods our environment with tens of thousands of lines for the nodes as they are added, and it would be appreciated if it is logged rather than sent to the console.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        



More information about the jboss-jira mailing list