[jboss-jira] [JBoss JIRA] Commented: (JGRP-1162) TCPGOSSIP leaking RouterStubs causes GossipRouter failures

Mon Mar 1 05:36:10 EST 2010

    [ https://jira.jboss.org/jira/browse/JGRP-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12517227#action_12517227 ] 

Bela Ban commented on JGRP-1162:
--------------------------------

thx Vladimir for the speedy fix !

> TCPGOSSIP leaking RouterStubs causes GossipRouter failures
> ----------------------------------------------------------
>
>                 Key: JGRP-1162
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1162
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>         Environment: Linux, JGroups 2.9 GA
>            Reporter: vivek v
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.10
>
>         Attachments: jgroups_stack.txt
>
>
> We are using JGroups 2.9 GA /w TCPGOSSIP and Gossip Router. In quite a few occasions we noticed node isolation, where one node becomes singleton and is never able to join back. While debugging that problem we found Gossip Router sometimes start publishing wrong list of nodes to the coordinator. Coordinator needs to call GR every few seconds to get the list of nodes (this is part of Merge2 protocol). TCPGossip is supposed to make only one RouterStub per GR, but what happens is any time there is an exception in the "getMembers" method of RouterStub it calls disconnect on TCPGossip, which basically starts the reconnector to create a new RouterStub. The bug is that the old RouterStub never gets cleaned up - neither on the TCPGossip side nor on the Gossip Router.
> Now, problem we have seen is due to some IOException in the  "readLoop()" of GossipRouter causes the old socket to be closed and removes the RouterStub address from GossipRouter's map (calling removeEntry()). So, now you still have the new RouterStub, but no entry for it in the GR's list. Anytime the coordinator asks for the list it may not get itself in the list. 
> The problem becomes even more critical if a node goes down comes up again and asks for the list from GR - the returned list wouldn't have the coordinator in it and thus it may not get it's logical address - it may get the view from other node in the list, but still may never be able to join without the right logical address. We have seen that happening, where the coordinator (or other node) keeps saying NAKACK - dropping message.
> Proposed Solution
> --------------------
> 1) When RouterStub calls state change to "Disconnect" from either "getMembers" or "checkConnection" (usually when there is any exception thrown), in TCPGossip's "connectionStatusChange()" if the state change is disconnect then call destroy on the routerstub - so we clean up the old router stub.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira