[infinispan-issues] [JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance

Dan Berindei (JIRA) jira-events at lists.jboss.org
Mon May 27 18:44:06 EDT 2013


    [ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777231#comment-12777231 ] 

Dan Berindei commented on ISPN-3120:
------------------------------------

My fix was to only set the waitingForState flag in StateConsumerImpl after all the segments have been requested. But it turns out there's a way around that: if a node E requests some segments from another node H, which then dies, the StateConsumerImpl on node E will first remove the transfers from H, and then request the segments from another node F. During this time, the waitingForState flag will remain true, so it's possible to replicate the bug even with my fix.

1. Starting with a 4-node cluster: [E, F, G, H] (topology 6), with numOwners=3. 
2. G leaves, and E sends a REBALANCE_START command with nodes [E, F, H] (topology 8). Some segments are owned by [F, H] in the current CH and by [F, H, E] in the pending CH.
3. E requests some segments from F and some segments from H.
4. H leaves, and E sends a CH_UPDATE command with nodes [E, F] (topology 9).
5. The StateConsumerImpl on E removes the transfers from H.
6. E receives a StateResponseCommand from F. Because it just removed the transfers from H, transfersBySegment/transfersBySource maps will remain empty and E will consider it finished receiving state for topology 9.
7. The StateConsumerImpl on E requests the failed segments from F (via a separate StateRequestCommand).
10. E receives a StateResponseCommand from F with actual data, but it ignores it because StateConsumerImpl.updatedKeys == null.


{noformat}
2013-05-23 16:59:59,589 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Requesting transactions for segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeH-13469
2013-05-23 16:59:59,619 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Requesting transactions for segments [49, 55, 57, 56] of cache dist from node NodeF-2452
2013-05-23 16:59:59,623 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Requesting segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeH-13469
2013-05-23 16:59:59,617 TRACE (transport-thread-4,NodeE) [org.infinispan.topology.ClusterTopologyManagerImpl] Attempting to execute command on self: CacheTopologyControlCommand{cache=dist, type=CH_UPDATE, sender=NodeE-18419, joinInfo=null, topologyId=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=3, members=[NodeE-18419, NodeF-2452]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=3, members=[NodeE-18419, NodeF-2452]}, throwable=null, viewId=3}
2013-05-23 16:59:59,624 TRACE (transport-thread-4,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Removing inbound transfers for segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] from source NodeH-13469 for cache dist
2013-05-23 16:59:59,624 TRACE (transport-thread-4,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Requesting transactions for segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeF-2452
2013-05-23 16:59:59,624 WARN  (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] ISPN000210: Failed to request segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeH-13469 (node will not be retried)
2013-05-23 16:59:59,624 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Requesting segments [49, 55, 57, 56] of cache dist from node NodeF-2452
2013-05-23 16:59:59,632 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Successfully requested segments [49, 55, 57, 56] of cache dist from node NodeF-2452
2013-05-23 16:59:59,634 TRACE (remote-thread-1,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Finished receiving state for segments [49, 55, 57, 56] of cache dist
2013-05-23 16:59:59,634 TRACE (remote-thread-1,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Completion of inbound transfer task: InboundTransferTask{segments=[49, 55, 57, 56], finishedSegments=[49, 55, 57, 56], unfinishedSegments=[], source=NodeF-2452, isCancelled=false, isStartedSuccessfully=true, isCompletedSuccessfully=true, topologyId=8, timeout=240000, cacheName=dist} 
2013-05-23 16:59:59,634 DEBUG (remote-thread-1,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Finished receiving of segments for cache dist for topology 9.
2013-05-23 16:59:59,634 TRACE (remote-thread-1,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Stop keeping track of changed keys for state transfer
2013-05-23 16:59:59,652 DEBUG (transport-thread-4,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Finished adding inbound state transfer for segments [35, 33, 39, 37, 43, 41, 51, 50, 54, 53, 52] of cache dist
2013-05-23 16:59:59,652 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Requesting segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeF-2452
2013-05-23 16:59:59,652 TRACE (transport-thread-4,NodeE) [org.infinispan.statetransfer.StateTransferLockImpl] Signalling transaction data received for topology 9
2013-05-23 16:59:59,661 TRACE (transport-thread-2,NodeE) [org.infinispan.statetransfer.InboundTransferTask] Successfully requested segments [51, 50, 35, 33, 54, 39, 53, 52, 37, 43, 41] of cache dist from node NodeF-2452
2013-05-23 16:59:59,767 DEBUG (remote-thread-1,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Applying new state for segment 43 of cache dist from node NodeF-2452: received 1 cache entries
2013-05-23 16:59:59,767 TRACE (remote-thread-1,NodeE) [org.infinispan.statetransfer.StateConsumerImpl] Received keys [MagicKey#k2{cd331f73 at NodeF-2452}] for segment 43 of cache dist from node NodeF-2452
2013-05-23 16:59:59,768 TRACE (remote-thread-1,NodeE) [org.infinispan.interceptors.EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k2{cd331f73 at NodeF-2452}/v2 because it was already updated by somebody else
{noformat}
                
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
>                 Key: ISPN-3120
>                 URL: https://issues.jboss.org/browse/ISPN-3120
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.3.0.Beta1
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>            Priority: Critical
>              Labels: 5.2.x
>             Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in  ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7). 
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G. 
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9 at NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9 at NodeG-6339}) returned null!
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the infinispan-issues mailing list