[jboss-jira] [JBoss JIRA] Commented: (JGRP-1157) JGroups threads get stuck and stop communicating

vivek v (JIRA) jira-events at lists.jboss.org
Thu Feb 18 20:36:10 EST 2010


    [ https://jira.jboss.org/jira/browse/JGRP-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12515307#action_12515307 ] 

vivek v commented on JGRP-1157:
-------------------------------

We debugged this issue little more (by attaching the debugger to one of the instance, which is having problem). Here is what I found by stepping into some of the stuck threads,

1) There were two sender threads (Node A -> B and Node A -> C). Both these sender threads were waiting on queue.take() , basically there was nothing to be read from the sender_queue, i.e, no messages.

2) There were lots of other threads, which were waiting on queue.put(). After tracing through those threads I found the TCPConnection object for this queue was different than the sender threads. This TCPConnection was for A-> B, but a different object altogether

3) I also found that there were no sender thread for this new TCPConnection object (# 2). So, basically all the messages were getting written to the queue, but there was no Sender thread for it to read it.

In theory there should only be one TCPConnection (and one sender thread) for each node-to-node connection. In this case, we got two TCPConnection objects for the same node-to-node connection. I'm not sure how it is possible. Looks like one TCPConnection was created, sender thread started, but later due to some socket connect problem it was removed from the TCPConnectionMap, but the sender thread remained from that connection.  Now later a new TCPConnection was started and added to the TCPConnectionMap, but for some reason its sender thread never got started - is it possible since, they both have the same thread-name they might fail (though unlikely).

I'm attaching the snapshots to prove that we got two TCPConnections for the same peer_addr.

Not sure if this is a race-condition, but this definitely seems like a problem. 

Is there a work-around for this problem? Like not using sender or something? What are the implications of not using sender (use_send_queues="false")?

> JGroups threads get stuck and stop communicating
> ------------------------------------------------
>
>                 Key: JGRP-1157
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1157
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>         Environment: Linux
>            Reporter: vivek v
>            Assignee: Bela Ban
>             Fix For: 2.10
>
>         Attachments: jgroups_stack.txt, threaddump-jg.log
>
>
> We are having problem where a node gets isolated after some intermittent network outage and is never able to join back. Bela suspected some issue /w RouterStub and fixed a bug - JGRP-1151. But, we are  were able to reproduce this problem /w even JGroups 2.9 GA. Looks like the problem is that the node that gets isolated becomes unresponsive as all its JGroups threads hang.  Here is how we reproduced the error  /w 3 nodes (Node A - coordinator also running
> Gossip Router, Node B, Node C),
> 1) We added WANem between A and B - so there are random disconnects,high packet loss and 200 msec of delay
> 2)  Due to our WANem setting B loses connectivity /w GR - in and out
> 3) We restart A  and it becomes isolated. A becomes singleton and never joins back the group. We see NAKACK on the node C - as A is still able to get to C, but not B. C keeps dropping messages from A as A is not in its transmission table.
> 4) We turned on tracing on A, but after a while (couple of hours) we see no JGroups trace on A - we suspected that some of the JGroups threads might have got stuck. So we took the thread dump of the java process on A (attached). As you can see there are quite a few JGroups threads in the waiting state and all are for TCP.send
> We are not clear on how or why will the JGroups threads hang. Could outgoing messages be queued up and not moving for some reason?
> The only solution to fix this was to restart all the nodes, which is not desirable. 
> Attached are the stack trace (thread-dump) and our protocol stack.
> This jira was originated from discussion at http://sourceforge.net/mailarchive/forum.php?thread_name=4B7BC107.9060304@yahoo.com&forum_name=javagroups-users

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        



More information about the jboss-jira mailing list