[infinispan-issues] [JBoss JIRA] (ISPN-3791) Silence "Received invalid rebalance confirmation from NodeX" exceptions

Friday, 6 December 2013

Dan Berindei created ISPN-3791:
----------------------------------

             Summary: Silence "Received invalid rebalance confirmation from
NodeX" exceptions
                 Key: ISPN-3791
                 URL: https://issues.jboss.org/browse/ISPN-3791
             Project: Infinispan
          Issue Type: Bug
          Components: State transfer
    Affects Versions: 6.0.0.Final
            Reporter: Dan Berindei
            Assignee: Dan Berindei
            Priority: Minor

When the coordinator shuts down, it tries to shut down each of its caches first. This
triggers a rebalance for the rest of the members, but the rebalance usually finishes only
after the coordinator's channel also shuts down. 

The nodes who finish their state transfer will then send a REBALANCE_CONFIRM command to
the new coordinator, but the new coordinator doesn't know about that rebalance (it
will start the rebalance process from scratch). This results in exceptions like this in
the new coordinator's log:

{noformat}
12:36:04,977 WARN  [org.infinispan.topology.CacheTopologyControlCommand]
(remote-thread-2,ISPN-Node-1) ISPN000071: Caught exception when handling command
CacheTopologyControlCommand{cache=MyCoolCache, type=REBALANCE_CONFIRM,
sender=ISPN-Node-3-54019, joinInfo=null, topologyId=8, currentCH=null, pendingCH=null,
throwable=null, viewId=4}: org.infinispan.commons.CacheException: Received invalid
rebalance confirmation from ISPN-Node-3-54019 for cache MyCoolCache, we don't have a
rebalance in progress
	at
org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalanceCompleted(ClusterTopologyManagerImpl.java:190)
[infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at
org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:147)
[infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at
org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124)
[infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270)
[infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[rt.jar:1.7.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[rt.jar:1.7.0_45]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45]
{noformat}

A simple way to avoid these warnings would be to keep track of the coordinator that
initiated a particular rebalance on each node, and only send the confirmation message to
that coordinator. The same warnings  seem to appear on the old coordinator, when it
receives a confirmation after its ClusterTopologyManager started shutting down, so we may
need another check there.

A more ambitious approach would be to keep the old rebalance when the new coordinator
takes over, and have another round in the cluster state recovery asking if any members
have already sent REBALANCE_CONFIRMATION commands (after the new coordinator is ready to
process those commands). This should eliminate the duplicate state transfer that happens
now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-3791) Silence "Received invalid rebalance confirmation from NodeX" exceptions