[jboss-jira] [JBoss JIRA] Updated: (JGRP-1157) JGroups threads get stuck and stop communicating
vivek v (JIRA)
jira-events at lists.jboss.org
Wed Feb 17 14:25:10 EST 2010
[ https://jira.jboss.org/jira/browse/JGRP-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
vivek v updated JGRP-1157:
--------------------------
Attachment: threaddump-jg.log
jgroups_stack.txt
Adding attachments
> JGroups threads get stuck and stop communicating
> ------------------------------------------------
>
> Key: JGRP-1157
> URL: https://jira.jboss.org/jira/browse/JGRP-1157
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.8, 2.9
> Environment: Linux
> Reporter: vivek v
> Assignee: Bela Ban
> Attachments: jgroups_stack.txt, threaddump-jg.log
>
>
> We are having problem where a node gets isolated after some intermittent network outage and is never able to join back. Bela suspected some issue /w RouterStub and fixed a bug - JGRP-1151. But, we are were able to reproduce this problem /w even JGroups 2.9 GA. Looks like the problem is that the node that gets isolated becomes unresponsive as all its JGroups threads hang. Here is how we reproduced the error /w 3 nodes (Node A - coordinator also running
> Gossip Router, Node B, Node C),
> 1) We added WANem between A and B - so there are random disconnects,high packet loss and 200 msec of delay
> 2) Due to our WANem setting B loses connectivity /w GR - in and out
> 3) We restart A and it becomes isolated. A becomes singleton and never joins back the group. We see NAKACK on the node C - as A is still able to get to C, but not B. C keeps dropping messages from A as A is not in its transmission table.
> 4) We turned on tracing on A, but after a while (couple of hours) we see no JGroups trace on A - we suspected that some of the JGroups threads might have got stuck. So we took the thread dump of the java process on A (attached). As you can see there are quite a few JGroups threads in the waiting state and all are for TCP.send
> We are not clear on how or why will the JGroups threads hang. Could outgoing messages be queued up and not moving for some reason?
> The only solution to fix this was to restart all the nodes, which is not desirable.
> Attached are the stack trace (thread-dump) and our protocol stack.
> This jira was originated from discussion at http://sourceforge.net/mailarchive/forum.php?thread_name=4B7BC107.9060304@yahoo.com&forum_name=javagroups-users
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list