[jboss-jira] [JBoss JIRA] Commented: (JGRP-1157) GossipRouter: JGroups threads get stuck and stop communicating

Fri Mar 12 15:10:38 EST 2010

    [ https://jira.jboss.org/jira/browse/JGRP-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12519689#action_12519689 ] 

vivek v commented on JGRP-1157:
-------------------------------

Note, this issue is not related to Gossip Router, but TCP.  So, the fix for GR may not apply here. We still need to figure out where are multiple TCPConnections coming from for the same destination.  It sounds similar to GR issues where RouterStub wasn't closing the connections properly.

> GossipRouter: JGroups threads get stuck and stop communicating
> --------------------------------------------------------------
>
>                 Key: JGRP-1157
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1157
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>         Environment: Linux
>            Reporter: vivek v
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.10
>
>         Attachments: jgroups_stack.txt, TCPConnection-1.png, TCPConnection-2.png, threaddump-jg.log
>
>
> We are having problem where a node gets isolated after some intermittent network outage and is never able to join back. Bela suspected some issue /w RouterStub and fixed a bug - JGRP-1151. But, we are  were able to reproduce this problem /w even JGroups 2.9 GA. Looks like the problem is that the node that gets isolated becomes unresponsive as all its JGroups threads hang.  Here is how we reproduced the error  /w 3 nodes (Node A - coordinator also running
> Gossip Router, Node B, Node C),
> 1) We added WANem between A and B - so there are random disconnects,high packet loss and 200 msec of delay
> 2)  Due to our WANem setting B loses connectivity /w GR - in and out
> 3) We restart A  and it becomes isolated. A becomes singleton and never joins back the group. We see NAKACK on the node C - as A is still able to get to C, but not B. C keeps dropping messages from A as A is not in its transmission table.
> 4) We turned on tracing on A, but after a while (couple of hours) we see no JGroups trace on A - we suspected that some of the JGroups threads might have got stuck. So we took the thread dump of the java process on A (attached). As you can see there are quite a few JGroups threads in the waiting state and all are for TCP.send
> We are not clear on how or why will the JGroups threads hang. Could outgoing messages be queued up and not moving for some reason?
> The only solution to fix this was to restart all the nodes, which is not desirable. 
> Attached are the stack trace (thread-dump) and our protocol stack.
> This jira was originated from discussion at http://sourceforge.net/mailarchive/forum.php?thread_name=4B7BC107.9060304@yahoo.com&forum_name=javagroups-users

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira