[JBoss JIRA] (ISPN-4587) Re-add old owners in the pending CH when a node leaves during rebalance
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4587?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4587:
-------------------------------
Fix Version/s: 7.0.0.CR2
(was: 7.0.0.CR1)
> Re-add old owners in the pending CH when a node leaves during rebalance
> -----------------------------------------------------------------------
>
> Key: ISPN-4587
> URL: https://issues.jboss.org/browse/ISPN-4587
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha5
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Minor
> Fix For: 7.0.0.CR2
>
>
> Say we have a distributed cache \[A, B\] with {{numSegments = 1}} and {{numOwners = 2}}. The initial topology is _T_: currentCH = \{0: A B\}, pendingCH = null
> C joins, and A starts a rebalance. The topology is now _T + 1_: currentCH = \{0: A B\}, pendingCH = \{0: A C\}
> C now leaves, A updates the consistent hashes to remove it with a new topology _T + 2: currentCH = \{0: A B\}, pendingCH = \{0: A\}
> A doesn't need to receive any data, so the rebalance ends and the pending CH is installed as the current CH in topology _T + 3_: currentCH = \{0: A\}, pendingCH = null
> This algorithm is relatively easy to follow and implement, but it does result in reduced availability of the cache data. It would be better if topology _T + 2_ could re-add B as an owner in the pending CH.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
11 years, 6 months
[JBoss JIRA] (ISPN-4575) Map/Reduce incorrect results with a non-shared non-tx intermediate cache
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4575?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4575:
-------------------------------
Fix Version/s: 7.0.0.CR2
(was: 7.0.0.CR1)
> Map/Reduce incorrect results with a non-shared non-tx intermediate cache
> ------------------------------------------------------------------------
>
> Key: ISPN-4575
> URL: https://issues.jboss.org/browse/ISPN-4575
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Distributed Execution and Map/Reduce
> Affects Versions: 7.0.0.Alpha5
> Reporter: Dan Berindei
> Assignee: Vladimir Blagojevic
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR2
>
>
> In a non-tx cache, if a command is started with topology id {{T}}, and when it is replicated on another node the distribution interceptor sees topology {{T+1}}, it throws an {{OutdatedTopologyException}}. The originator of the command will then retry the command, setting topology {{T+1}}.
> When this happens with a {{PutKeyValueCommand(k, MapReduceManagerImpl.DeltaAwareList)}}, it can lead to duplicate intermediate values.
> Say _A_ is the primary owner of {{k}} in {{T}}, _B_ is a backup owner both in {{T}} and {{T+1}}, and _C_ is the backup owner in {{T}} and the primary owner in {{T+1}} (i.e. _C_ just joined and a rebalance is in progress during {{T}} - see {{NonTxBackupOwnerBecomingPrimaryOwnerTest}}).
> _A_ starts the {{PutKeyValueCommand}} and replicates it to _B_ and _C_. _C_ applies the command, but _B_ already has topology {{T+1}} and throws an {{OutdatedTopologyException}}. _A_ installs topology {{T+1}}, sends the command to _C_ (as the new primary owner), which replicates it to _B_ and then applies it locally a second time.
> This scenario can happen during a M/R task even without nodes joining or leaving. That's because {{CreateCacheCommand}} only calls {{getCache()}} on each member, it doesn't wait for the cache to have a certain number of members or for state transfer to be complete for all the members. The last member to join the intermediate cache is guaranteed to have topology {{T+1}}, but the others may have topology {{T}} by the time the combine phase starts inserting values in the intermediate cache.
> I have seen the {{OutdatedTopologyException}} happen pretty often during the test suite, especially after I removed the duplicate {{invokeRemotely}} call in {{MapReduceTask.executeTaskInit()}}. Most of them were harmless, but there was one failure in CI: http://ci.infinispan.org/viewLog.html?buildId=9811&tab=buildResultsDiv&bu...
> A short-term fix would be to wait for all the members to finish joining in {{CreateCacheCommand}}. Long-term, M/R tasks should be resilient to topology changes, so we should investigate making {{PutKeyValue(k, DeltaAwareList)}} handle {{OutdatedTopologyException}} s.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
11 years, 6 months
[JBoss JIRA] (ISPN-4572) StateTransferReplicationQueueTest.testStateTransferWithNodeRestartedAndBusyNonTx random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4572?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4572:
-------------------------------
Fix Version/s: 7.0.0.CR2
(was: 7.0.0.CR1)
> StateTransferReplicationQueueTest.testStateTransferWithNodeRestartedAndBusyNonTx random failures
> ------------------------------------------------------------------------------------------------
>
> Key: ISPN-4572
> URL: https://issues.jboss.org/browse/ISPN-4572
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer, Test Suite - Core
> Affects Versions: 7.0.0.Alpha5
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR2
>
>
> {noformat}
> java.lang.AssertionError:
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
> at org.testng.AssertJUnit.assertTrue(AssertJUnit.java:24)
> at org.testng.AssertJUnit.assertNull(AssertJUnit.java:282)
> at org.testng.AssertJUnit.assertNull(AssertJUnit.java:274)
> at org.infinispan.statetransfer.StateTransferReplicationQueueTest.doWritingCacheTest(StateTransferReplicationQueueTest.java:144)
> at org.infinispan.statetransfer.StateTransferReplicationQueueTest.testStateTransferWithNodeRestartedAndBusyNonTx(StateTransferReplicationQueueTest.java:88)
> {noformat}
> No trace log available for now.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
11 years, 6 months
[JBoss JIRA] (ISPN-4568) DistSyncL1RepeatableReadFuncTest.testNoEntryInL1MultipleConcurrentGetsWithInvalidation random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4568?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4568:
-------------------------------
Fix Version/s: 7.0.0.CR2
(was: 7.0.0.CR1)
> DistSyncL1RepeatableReadFuncTest.testNoEntryInL1MultipleConcurrentGetsWithInvalidation random failures
> ------------------------------------------------------------------------------------------------------
>
> Key: ISPN-4568
> URL: https://issues.jboss.org/browse/ISPN-4568
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 7.0.0.Alpha5
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR2
>
>
> Very likely related to ISPN-4564, as there seem to be 2 unjustified pauses ~ 3s and some log messages also appear to be delayed:
> {noformat}
> 08:23:48,443 TRACE (transport-thread-DistSyncL1RepeatableReadFuncTest-NodeAN-p28720-t1:) [InvocationContextInterceptor] Invoked with command PutKeyValueCommand{key=key-to-the-cache, value=second-put, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true} and InvocationContext [org.infinispan.context.SingleKeyNonTxInvocationContext@e9a3538]
> 08:23:48,470 TRACE (transport-thread-DistSyncL1RepeatableReadFuncTest-NodeAN-p28720-t1:) [JGroupsTransport] dests=[DistSyncL1RepeatableReadFuncTest-NodeAN-7764, DistSyncL1RepeatableReadFuncTest-NodeAM-739], command=SingleRpcCommand{cacheName='dist', command=PutKeyValueCommand{key=key-to-the-cache, value=second-put, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}}, mode=SYNCHRONOUS, timeout=60000
> 08:23:50,953 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28701-t6:) [InvocationContextInterceptor] Invoked with command PutKeyValueCommand{key=key-to-the-cache, value=second-put, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true} and InvocationContext [org.infinispan.context.impl.NonTxInvocationContext@62801f8c]
> 08:23:50,953 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28701-t6:) [L1ManagerImpl] Invalidating keys [key-to-the-cache] on nodes [DistSyncL1RepeatableReadFuncTest-NodeAK-9309]. Use multicast? false
> 08:23:51,060 TRACE (transport-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28700-t2:) [JGroupsTransport] dests=[DistSyncL1RepeatableReadFuncTest-NodeAK-9309], command=SingleRpcCommand{cacheName='dist', command=InvalidateL1Command{num keys=1, origin=DistSyncL1RepeatableReadFuncTest-NodeAN-7764}}, mode=SYNCHRONOUS_IGNORE_LEAVERS, timeout=60000
> 08:23:51,062 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAK-p28661-t5:) [BaseRpcInvokingCommand] Invoking command InvalidateL1Command{num keys=1, origin=DistSyncL1RepeatableReadFuncTest-NodeAN-7764}, with originLocal flag set to false
> 08:23:50,972 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28701-t6:) [CallInterceptor] Executing command: PutKeyValueCommand{key=key-to-the-cache, value=second-put, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}.
> 08:23:51,786 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAK-p28661-t5:) [InboundInvocationHandlerImpl] About to send back response null for command SingleRpcCommand{cacheName='dist', command=InvalidateL1Command{num keys=1, origin=DistSyncL1RepeatableReadFuncTest-NodeAN-7764}}
> 08:23:51,796 TRACE (transport-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28700-t2:) [CommandAwareRpcDispatcher] Responses: [sender=DistSyncL1RepeatableReadFuncTest-NodeAK-9309, received=true, suspected=false]
> 08:23:54,561 TRACE (transport-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28700-t2:) [RpcManagerImpl] Response(s) to SingleRpcCommand{cacheName='dist', command=InvalidateL1Command{num keys=1, origin=DistSyncL1RepeatableReadFuncTest-NodeAN-7764}} is {}
> 08:23:56,955 ERROR (testng-DistSyncL1RepeatableReadFuncTest:) [UnitTestTestNGListener] Test testNoEntryInL1MultipleConcurrentGetsWithInvalidation(org.infinispan.distribution.DistSyncL1RepeatableReadFuncTest) failed.
> java.util.concurrent.TimeoutException
> at java.util.concurrent.FutureTask.get(FutureTask.java:201)
> at org.infinispan.commons.util.concurrent.NotifyingFutureImpl.get(NotifyingFutureImpl.java:84)
> at org.infinispan.distribution.BaseDistSyncL1Test.testNoEntryInL1MultipleConcurrentGetsWithInvalidation(BaseDistSyncL1Test.java:217)
> 08:23:54,578 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28701-t6:) [L1NonTxInterceptor] Allowing entry to commit as local node is owner
> 08:23:57,861 TRACE (remote-thread-DistSyncL1RepeatableReadFuncTest-NodeAM-p28701-t6:) [EntryWrappingInterceptor] About to commit entry RepeatableReadEntry(499752d9){key=key-to-the-cache, value=second-put, oldValue=first-put, isCreated=false, isChanged=true, isRemoved=false, isValid=true, skipRemoteGet=false, metadata=EmbeddedMetadata{version=null}}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
11 years, 6 months