[
https://issues.jboss.org/browse/ISPN-8240?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-8240:
------------------------------------
One side-effect is that nodes now confirm the rebalance after every leave, leading to lots
of error messages like this one:
{noformat}
09:50:05,569 WARN (remote-thread-test-NodeA-p2-t2:[dist]) [CacheTopologyControlCommand]
ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=dist,
type=REBALANCE_PHASE_CONFIRM, sender=test-NodeC-41478, joinInfo=null, topologyId=16,
rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null,
actualMembers=null, throwable=null, viewId=4}
org.infinispan.commons.CacheException: Received invalid rebalance confirmation from
test-NodeC-41478 for cache dist, expecting topology id 17 but got 16
at
org.infinispan.topology.RebalanceConfirmationCollector.confirmPhase(RebalanceConfirmationCollector.java:41)
~[classes/:?]
at
org.infinispan.topology.ClusterCacheStatus.confirmRebalancePhase(ClusterCacheStatus.java:337)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalancePhaseConfirm(ClusterTopologyManagerImpl.java:274)
~[classes/:?]
at
org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:189)
~[classes/:?]
at
org.infinispan.topology.CacheTopologyControlCommand.invokeAsync(CacheTopologyControlCommand.java:166)
~[classes/:?]
at
org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.invokeReplicableCommand(GlobalInboundInvocationHandler.java:174)
~[classes/:?]
at
org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.runReplicableCommand(GlobalInboundInvocationHandler.java:155)
~[classes/:?]
at
org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.lambda$handleReplicableCommand$1(GlobalInboundInvocationHandler.java:149)
~[classes/:?]
at
org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:203)
[classes/:?]
{noformat}
Coordinator sends REBALANCE_START command when there is already a
rebalance in progress
---------------------------------------------------------------------------------------
Key: ISPN-8240
URL:
https://issues.jboss.org/browse/ISPN-8240
Project: Infinispan
Issue Type: Bug
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Minor
Normally the {{REBALANCE_START}} command should only be sent at the start of a rebalance,
and any topology updates sent before all the nodes confirm the rebalance phase should have
{{CH_UPDATE}}.
Since the change to 4 phases, this is no longer true: first
{{ClusterCacheStatus.updateTopologyMembers}} first clears the
{{RebalanceConfirmationCollector}}, then it broadcasts a {{CH_UPDATE}}. Then
{{queueRebalance}} immediately creates a new {{RCC}} and broadcasts a {{REBALANCE_START}},
instead of waiting for the current rebalance to finish.
I propose we remove {{REBALANCE_START}}, as it was just a crude version of
{{CacheTopology.Phase}}. We should also remove the {{isRebalance}} parameter from
{{StateConsumerImpl.onTopologyUpdate()}}.
I'm still not sure if rebalancing the pending CH immediately is ok. On the one hand,
I would like the rebalance to finish with {{updateMembers(union(currentCH, pendingCH))}}
as the new pending CH, so that segments that were already transferred keep an extra copy.
On the other hand, that would only help for segments that have at least on owner in the
current CH: if the current CH has 0 owners and {{updateMembers}} allocates new ones, those
new owners won't request data from the pending CH owners anyway. Fixing that case
would require the coordinator to fetch the transfer status from all the nodes before
removing a node from the topology. But if the coordinator knew exactly which segments were
transferred, it could finish the rebalance immediately and start a new one -- so it would
be more similar to the current approach.
Note: the {{SyncConsistentHashFactory}} allocation is not 100% stable, even when nodes
are not added, so A ∈ owners(segment) in topology ABCD does not guarantee that A ∈
owners(segment) in topology ABC. But it should be good enough to keep A an owner in 90% of
the cases.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)