[infinispan-issues] [JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
Mircea Markus (JIRA)
jira-events at lists.jboss.org
Thu May 23 10:43:06 EDT 2013
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mircea Markus updated ISPN-3120:
--------------------------------
Affects Version/s: (was: 5.2.6.Final)
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9 at NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9 at NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list