[jboss-jira] [JBoss JIRA] Commented: (JGRP-1157) TCP: JGroups threads get stuck and stop communicating

Tue Jul 6 03:48:46 EDT 2010

    [ https://jira.jboss.org/browse/JGRP-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12537968#action_12537968 ] 

Bela Ban commented on JGRP-1157:
--------------------------------

Vladimir made some changes to TCPConnectionMap, which ensures that receiver and sender threads always terminate together; ie. it's not possible any longer for a sender thread to terminate, but for the receiver thread to continue. The change was to close the connection on sender thread termination (this was already in place for the receiver thread).
When the connection is closed, the peer will need to re-establish it when trying to send data.

> TCP: JGroups threads get stuck and stop communicating
> -----------------------------------------------------
>
>                 Key: JGRP-1157
>                 URL: https://jira.jboss.org/browse/JGRP-1157
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>         Environment: Linux
>            Reporter: vivek v
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.10
>
>         Attachments: jgroups_stack.txt, TCPConnection-1.png, TCPConnection-2.png, threaddump-jg.log
>
>
> We are having problem where a node gets isolated after some intermittent network outage and is never able to join back. Bela suspected some issue /w RouterStub and fixed a bug - JGRP-1151. But, we are  were able to reproduce this problem /w even JGroups 2.9 GA. Looks like the problem is that the node that gets isolated becomes unresponsive as all its JGroups threads hang.  Here is how we reproduced the error  /w 3 nodes (Node A - coordinator also running
> Gossip Router, Node B, Node C),
> 1) We added WANem between A and B - so there are random disconnects,high packet loss and 200 msec of delay
> 2)  Due to our WANem setting B loses connectivity /w GR - in and out
> 3) We restart A  and it becomes isolated. A becomes singleton and never joins back the group. We see NAKACK on the node C - as A is still able to get to C, but not B. C keeps dropping messages from A as A is not in its transmission table.
> 4) We turned on tracing on A, but after a while (couple of hours) we see no JGroups trace on A - we suspected that some of the JGroups threads might have got stuck. So we took the thread dump of the java process on A (attached). As you can see there are quite a few JGroups threads in the waiting state and all are for TCP.send
> We are not clear on how or why will the JGroups threads hang. Could outgoing messages be queued up and not moving for some reason?
> The only solution to fix this was to restart all the nodes, which is not desirable. 
> Attached are the stack trace (thread-dump) and our protocol stack.
> This jira was originated from discussion at http://sourceforge.net/mailarchive/forum.php?thread_name=4B7BC107.9060304@yahoo.com&forum_name=javagroups-users

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira