[jboss-jira] [JBoss JIRA] Commented: (JGRP-1162) TCPGOSSIP leaking RouterStubs causes GossipRouter failures
vivek v (JIRA)
jira-events at lists.jboss.org
Thu Feb 25 20:55:10 EST 2010
[ https://jira.jboss.org/jira/browse/JGRP-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12516876#action_12516876 ]
vivek v commented on JGRP-1162:
-------------------------------
Here is some stack trace.
This is exception thrown by node when calling "getMembers()" on the GR - not sure if this is caused by "suspect" messages in the pipe,
{noformat}
2010-02-24 13:21:34,379 ERROR [Timer-3,prem-main,manager_10.0.2.73:4576] RouterStub - Router stub RouterStub[localsocket=/10.0.2.73:46751,router_host=mgr-2-73::4575,connected=true] failed sending message to router
java.lang.RuntimeException: class for magic number 636 not found
at org.jgroups.util.Util.readOtherAddress(Util.java:908)
at org.jgroups.util.Util.readAddress(Util.java:879)
at org.jgroups.protocols.PingData.readFrom(PingData.java:131)
at org.jgroups.stack.RouterStub.getMembers(RouterStub.java:213)
at org.jgroups.protocols.TCPGOSSIP.sendGetMembersRequest(TCPGOSSIP.java:166)
at org.jgroups.protocols.Discovery$PingSenderTask$1.run(Discovery.java:487)
at org.jgroups.util.TimeScheduler$RobustRunnable.run(TimeScheduler.java:194)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{noformat}
After this we get reconnector creating new RouterStub, but the old routerstub never gets removed.
> TCPGOSSIP leaking RouterStubs causes GossipRouter failures
> ----------------------------------------------------------
>
> Key: JGRP-1162
> URL: https://jira.jboss.org/jira/browse/JGRP-1162
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.8, 2.9
> Environment: Linux, JGroups 2.9 GA
> Reporter: vivek v
> Assignee: Bela Ban
>
> We are using JGroups 2.9 GA /w TCPGOSSIP and Gossip Router. In quite a few occasions we noticed node isolation, where one node becomes singleton and is never able to join back. While debugging that problem we found Gossip Router sometimes start publishing wrong list of nodes to the coordinator. Coordinator needs to call GR every few seconds to get the list of nodes (this is part of Merge2 protocol). TCPGossip is supposed to make only one RouterStub per GR, but what happens is any time there is an exception in the "getMembers" method of RouterStub it calls disconnect on TCPGossip, which basically starts the reconnector to create a new RouterStub. The bug is that the old RouterStub never gets cleaned up - neither on the TCPGossip side nor on the Gossip Router.
> Now, problem we have seen is due to some IOException in the "readLoop()" of GossipRouter causes the old socket to be closed and removes the RouterStub address from GossipRouter's map (calling removeEntry()). So, now you still have the new RouterStub, but no entry for it in the GR's list. Anytime the coordinator asks for the list it may not get itself in the list.
> The problem becomes even more critical if a node goes down comes up again and asks for the list from GR - the returned list wouldn't have the coordinator in it and thus it may not get it's logical address - it may get the view from other node in the list, but still may never be able to join without the right logical address. We have seen that happening, where the coordinator (or other node) keeps saying NAKACK - dropping message.
> Proposed Solution
> --------------------
> 1) When RouterStub calls state change to "Disconnect" from either "getMembers" or "checkConnection" (usually when there is any exception thrown), in TCPGossip's "connectionStatusChange()" if the state change is disconnect then call destroy on the routerstub - so we clean up the old router stub.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list