[
https://jira.jboss.org/jira/browse/JGRP-1162?page=com.atlassian.jira.plug...
]
vivek v commented on JGRP-1162:
-------------------------------
Here is some stack trace.
This is exception thrown by node when calling "getMembers()" on the GR - not
sure if this is caused by "suspect" messages in the pipe,
{noformat}
2010-02-24 13:21:34,379 ERROR [Timer-3,prem-main,manager_10.0.2.73:4576] RouterStub -
Router stub
RouterStub[localsocket=/10.0.2.73:46751,router_host=mgr-2-73::4575,connected=true] failed
sending message to router
java.lang.RuntimeException: class for magic number 636 not found
at org.jgroups.util.Util.readOtherAddress(Util.java:908)
at org.jgroups.util.Util.readAddress(Util.java:879)
at org.jgroups.protocols.PingData.readFrom(PingData.java:131)
at org.jgroups.stack.RouterStub.getMembers(RouterStub.java:213)
at org.jgroups.protocols.TCPGOSSIP.sendGetMembersRequest(TCPGOSSIP.java:166)
at org.jgroups.protocols.Discovery$PingSenderTask$1.run(Discovery.java:487)
at org.jgroups.util.TimeScheduler$RobustRunnable.run(TimeScheduler.java:194)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{noformat}
After this we get reconnector creating new RouterStub, but the old routerstub never gets
removed.
TCPGOSSIP leaking RouterStubs causes GossipRouter failures
----------------------------------------------------------
Key: JGRP-1162
URL:
https://jira.jboss.org/jira/browse/JGRP-1162
Project: JGroups
Issue Type: Bug
Affects Versions: 2.8, 2.9
Environment: Linux, JGroups 2.9 GA
Reporter: vivek v
Assignee: Bela Ban
We are using JGroups 2.9 GA /w TCPGOSSIP and Gossip Router. In quite a few occasions we
noticed node isolation, where one node becomes singleton and is never able to join back.
While debugging that problem we found Gossip Router sometimes start publishing wrong list
of nodes to the coordinator. Coordinator needs to call GR every few seconds to get the
list of nodes (this is part of Merge2 protocol). TCPGossip is supposed to make only one
RouterStub per GR, but what happens is any time there is an exception in the
"getMembers" method of RouterStub it calls disconnect on TCPGossip, which
basically starts the reconnector to create a new RouterStub. The bug is that the old
RouterStub never gets cleaned up - neither on the TCPGossip side nor on the Gossip
Router.
Now, problem we have seen is due to some IOException in the "readLoop()" of
GossipRouter causes the old socket to be closed and removes the RouterStub address from
GossipRouter's map (calling removeEntry()). So, now you still have the new RouterStub,
but no entry for it in the GR's list. Anytime the coordinator asks for the list it may
not get itself in the list.
The problem becomes even more critical if a node goes down comes up again and asks for
the list from GR - the returned list wouldn't have the coordinator in it and thus it
may not get it's logical address - it may get the view from other node in the list,
but still may never be able to join without the right logical address. We have seen that
happening, where the coordinator (or other node) keeps saying NAKACK - dropping message.
Proposed Solution
--------------------
1) When RouterStub calls state change to "Disconnect" from either
"getMembers" or "checkConnection" (usually when there is any exception
thrown), in TCPGossip's "connectionStatusChange()" if the state change is
disconnect then call destroy on the routerstub - so we clean up the old router stub.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira