]
Mircea Markus updated ISPN-3120:
--------------------------------
Labels: 5.2.x nbst (was: 5.2.x)
StateConsumerImpl can ignore state received during a rebalance
--------------------------------------------------------------
Key: ISPN-3120
URL:
https://issues.jboss.org/browse/ISPN-3120
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 5.3.0.Beta1
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Critical
Labels: 5.2.x, nbst
Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
This causes random failures in ConcurrentOverlappingLeaveTest and
ConcurrentNonOverlappingLeaveTest.
1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8).
Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in
the new currentCH (otherwise they wouldn't have any owners).
5. The StateConsumerImpl on E requests state for the "lost" segments from G.
6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end
the rebalance (topology 10).
7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
8. While the StateConsumerImpl on E is starting the state transfer, it also receives a
StateResponseCommand for the lost segments from G.
9. Because the structures keeping track of the received state are not properly
initialized, E considers it finished receiving state for topology 11.
10. E receives a StateResponseCommand from G with actual data, but it ignores it because
{{StateConsumerImpl.updatedKeys == null}}.
{noformat}
11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating
local consistent hash(es) for cache dist: new topology = CacheTopology{id=7,
currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027,
NodeG-6339, NodeH-47370]}, pendingCH=null}
11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting
local rebalance for cache dist, topology = CacheTopology{id=8,
currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027,
NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2,
members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving
of segments for cache dist for topology 8.
11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating
local consistent hash(es) for cache dist: new topology = CacheTopology{id=9,
currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027,
NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2,
members=[NodeE-51027, NodeG-6339]}}
11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound
state transfer for segments [38, 36, 47, 44, 45] of cache dist
11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting
local rebalance for cache dist, topology = CacheTopology{id=11,
currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027,
NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2,
members=[NodeE-51027, NodeG-6339]}}
11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling
perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of
segments for cache dist for topology 11.
11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring
consistent hash update 10 for cache dist, we have already received a newer topology 11
11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling
perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State
transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was
already updated by somebody else
11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener]
Test
testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest)
failed.
java.lang.AssertionError: Fail on owner cache NodeE-51027:
dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
{noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: