[jboss-jira] [JBoss JIRA] Updated: (JGRP-782) DistributedTree hangs after updating new joiner

Mon Sep 8 15:01:41 EDT 2008

     [ https://jira.jboss.org/jira/browse/JGRP-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Blagojevic updated JGRP-782:
-------------------------------------

    Fix Version/s: 2.7
                       (was: 2.6.4)


> DistributedTree hangs after updating new joiner
> -----------------------------------------------
>
>                 Key: JGRP-782
>                 URL: https://jira.jboss.org/jira/browse/JGRP-782
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>         Environment: Windows XP, Java: Sun jdk1.5.0_13, 
> JGroups 2.6.2 (CVS:  $Id: Version.java,v 1.59.2.2 2008/02/25 16:33:46 belaban Exp $)
> Also affects 2.6.3CR1
>            Reporter: Tom Brophy
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 2.7
>
>         Attachments: TreeTest.java
>
>
> We use a DistributedTree which can typically have 10000 nodes per system. The nodes are added and removed while the system is running, typically added in a large group at startup, and then intermittantly thereafter added and removed. 
> Frequently we find that one of the systems locks up at org.jgroups.blocks.GroupRequest.execute(boolean use_anycasting) at the line:
> boolean retval=doExecute(use_anycasting, timeout);
> This appears to be because the "done" flag is set, but not cleared, possibly by different threads. The test program to be attached is run on two Windows machines, and each adds up to 10000 nodes with a pause of about 2 ms between additions. The following is the common scenario that plays out:
> 1 Start system 1, which adds nodes.
> 2 When system 1 has added two or three thousand nodes, start system 2.
> 3 System 2 is updated with the state of the tree and starts adding nodes.
> 4 System 1 hangs after sending state to system 2 when it tries to update the tree with a new node.
> 5 Killing system 2 releases the hang on system 1 and system 1 resumes as if nothing happened.
> 6 If System 1 did not hang, instead of killing system 2, kill system 1 and then restart it.
> Notes:
> The hang seems to depend on timing, network states and possibly other factors, as it does not always occur - occasionally it will occur at every test run, at other times no hangs occur for may runs.
> Although there are many nodes the tree  is structured as /serverN/majorKey/minorKey/leaf where the serialisable leaf is relatively small, consisting of a few strings and the Address of the system creating the node
> When I make a small change to GroupRequest.receiveResponse(Object response_value, Address sender) and change the line reading
>                     if(rsp_filter != null && !rsp_filter.needMoreResponses())
> to
>                     if(rsp_filter == null || !rsp_filter.needMoreResponses())
> which causes done = true; to be executed if rsp_filter is null the test no longer hangs.
> However, I do not know if this is a safe change, and even if it is, would prefer to not have to make changes to the jgroups code.
> The attached test code can be run on two systems with the command
> java -cp .;jgroups-all.jar;commons_logging.jar TreeTest <n>
> where <n> is 1 or 2 for each of the two systems. It is a stripped down equivalent to the way we use the DistributedTree
> The test program has the protocol stack as a String within the code
> Finally, the DistributedTree itself issues a System.println call whenever a node is added instead of using the logger - In our case this floods our environment with tens of thousands of lines for the nodes as they are added, and it would be appreciated if it is logged rather than sent to the console.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira