[JBoss JIRA] (ISPN-5454) XSite: RetryMechanismTest random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-5454?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-5454:
-------------------------------
Status: Open (was: New)
> XSite: RetryMechanismTest random failures
> -----------------------------------------
>
> Key: ISPN-5454
> URL: https://issues.jboss.org/browse/ISPN-5454
> Project: Infinispan
> Issue Type: Bug
> Components: Cross-Site Replication
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 8.0.0.Alpha1
>
>
> {{ClusteredCacheBackupReceiver.awaitRemoteTask()}} doesn't respect the state push command's timeout, at least when it's smaller than the sync replication timeout in the target cache. When that happens, the state provider will resend the state, and there will be 2 state push commands executing at the same time.
> RetryMechanismTest changes the state push timeout to 2 seconds, but the sync replication timeout stays at 15 seconds. This causes failures in {{testRetryLocally}} and {{testFailRetryLocally}}, if it takes more than 2 seconds to suspect the killed node.
> {noformat}
> 10:02:13,007 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
> 10:02:16,008 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
> 10:02:16,040 TRACE (asyncTransportThread-4,NodeP:) [RpcManagerImpl] replication exception:
> org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
> 10:02:16,040 TRACE (asyncTransportThread-0,NodeP:) [RpcManagerImpl] replication exception:
> org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
> 10:02:19,147 ERROR (testng-RetryMechanismTest:) [UnitTestTestNGListener] Test testFailRetryLocally(org.infinispan.xsite.statetransfer.failures.RetryMechanismTest) failed.
> java.lang.AssertionError: expected:<2> but was:<3>
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
> at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
> at org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.testFailRetryLocally(RetryMechanismTest.java:227)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-5274) Inconsistent data after transaction rollback (with success on originator)
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-5274?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-5274:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1199490|https://bugzilla.redhat.com/show_bug.cgi?id=1199490] from ASSIGNED to POST
> Inconsistent data after transaction rollback (with success on originator)
> -------------------------------------------------------------------------
>
> Key: ISPN-5274
> URL: https://issues.jboss.org/browse/ISPN-5274
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.1.1.Final
> Reporter: Matej Čimbora
> Assignee: Dan Berindei
> Fix For: 7.2.0.Final
>
>
> Scenario
> Nodes edg-perf[10-13], partition handling on
> 1.Transaction is started on edg-perf10
> {code}
> 10:24:48,228 TRACE [org.infinispan.transaction.xa.TransactionXaAdapter] (DefaultStressor-6) start called on tx GlobalTransaction:<edg-perf10-20667>:38910:local
> {code}
> 2. Value of key_000000000000065B is updated within the transaction
> {code}
> 10:24:48,405 TRACE [org.radargun.service.InfinispanOperations$Cache] (DefaultStressor-6) PUT cache=testCache key=key_000000000000065B value=[6 #15: 296, 501, 1109, 1119, 1459, 1544, 1999, 2083, 2130, 2257, 2298, 2784, 2941, 3009, 3147, ]
> {code}
> 3. Transaction is successfully prepared on edg-perf11 & edg-perf12
> {code}
> 10:24:48,559 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (DefaultStressor-6) Responses: [sender=edg-perf11-61837, received=true, suspected=false]
> [sender=edg-perf12-7305, received=true, suspected=false]
> {code}
> 4. Transaction commit is issued
> {code}
> 10:24:48,562 TRACE [org.infinispan.transaction.TransactionCoordinator] (DefaultStressor-6) Committing transaction GlobalTransaction:<edg-perf10-20667>:38910:local
> {code}
> 5. Other participating nodes (edg-perf11 & edg-perf12) are suspected...
> {code}
> 10:24:52,705 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (DefaultStressor-6) Target node edg-perf11-61837 left during remote call, ignoring
> 10:24:52,716 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (DefaultStressor-6) Target node edg-perf12-7305 left during remote call, ignoring
> {code}
> ... as they received a new view (without edg-perf10) meanwhile.
> {code}
> 10:24:48,547 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,edg-perf12-7305) ISPN000093: Received new, MERGED cluster view: MergeView::[edg-perf12-7305|20] (3) [edg-perf12-7305, edg-perf11-61837, edg-perf13-25187], 2 subgroups: [edg-perf13-25187|8] (1) [edg-perf13-25187], [edg-perf13-25187|19] (1) [edg-perf13-25187]
> {code}
> Still, the transaction is commited on edg-perf10 & updated entry is stored locally
> {code}
> 10:24:52,894 TRACE [org.infinispan.statetransfer.CommitManager] (DefaultStressor-6) Trying to commit. Key=key_000000000000065B. Operation Flag=null, L1 invalidation=false
> 10:24:52,896 TRACE [org.infinispan.statetransfer.CommitManager] (DefaultStressor-6) Committing key=key_000000000000065B. It is a L1 invalidation or a normal put and no tracking is enabled!
> 10:24:52,908 TRACE [org.infinispan.container.DefaultDataContainer] (DefaultStressor-6) Creating new ICE for writing. Existing=ImmortalCacheEntry{key=key_000000000000065B, value=[6 #14: 296, 501, 1109, 1119, 1459, 1544, 1999, 2083, 2130, 2257, 2298, 2784, 2941, 3009, ]}, metadata=EmbeddedMetadata{version=null}, new value=[6 #15: 296, 501, 1109, 1119, 1459, 1544, 1999, 2083, 2130, 2257, 2298, 2784, 2941, 3009, 3147, ]
> {code}
> 6. Other nodes rollback the transaction
> {code}
> 10:24:50,376 DEBUG [org.infinispan.transaction.TransactionTable] (transport-thread-10) Rolling back transaction GlobalTransaction:<edg-perf10-20667>:38910:remote because originator edg-perf10-20667 left the cluster
> {code}
> 7. edg-perf10 receives a new view, containing nodes edg-perf[10,11,13]. Incoming state transfer overwrites the updated value
> {code}
> 10:25:09,614 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf10-20667) ISPN000093: Received new, MERGED cluster view: MergeView::[edg-perf10-20667|22] (3) [edg-perf10-20667, edg-perf11-61837, edg-perf13-25187], 4 subgroups: [edg-perf10-20667|15] (2) [edg-perf10-20667, edg-perf11-61837], [edg-perf10-20667|19] (3) [edg-perf10-20667, edg-perf12-7305, edg-perf11-61837], [edg-perf10-20667|18] (1) [edg-perf10-20667], [edg-perf10-20667|21] (2) [edg-perf10-20667, edg-perf13-25187]
> {code}
> 8. get operation returns outdated value
> {code}
> 10:26:21,020 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (DefaultStressor-6) Response(s) to ClusteredGetCommand{key=key_000000000000065B, flags=null} is {edg-perf12-7305=SuccessfulResponse{responseValue=ImmortalCacheValue {value=[6 #14: 296, 501, 1109, 1119, 1459, 1544, 1999, 2083, 2130, 2257, 2298, 2784, 2941, 3009, ]}} }
> {code}
> From client perspective, this behavior is not transparent. Provided the transaction ended up successfully, presence of the updated entry can be assumed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-4546) Possible stale lock when the primary owner leaves during rebalance
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4546?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4546:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1163727|https://bugzilla.redhat.com/show_bug.cgi?id=1163727] from ASSIGNED to POST
> Possible stale lock when the primary owner leaves during rebalance
> ------------------------------------------------------------------
>
> Key: ISPN-4546
> URL: https://issues.jboss.org/browse/ISPN-4546
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha5, 7.1.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.2.0.Final
>
>
> Topology T: coordinator = A, owners(k) = [C, D], pending_owners(k) = null
> B sends prepareCommand(tx1, put(k, v)) to C, D
> D adds backup locks and replies
> C acquires lock, ready to send reply to B
> A starts installing topology T+1: owners(k) = [C, D], pending_owners(k) = [C, E]
> A, C and E install topology T+1, B and D do not
> E requests and receives tx data from C, including tx1
> C leaves
> B sees a SuspectException, sends rollbackCommand(tx1) to C, D
> D removes tx1
> C has left, but is ignored
> B reports to the user that the tx has been rolled back
> B and D install topology T+1 (optional)
> A starts installing topology T+2: owners(k) = [D], pending_owners(k) = [E]
> A, B, D, E all install topology T+2
> E requests and receives state from D, but it does not remove tx1
> A starts installing topology T+3: owners(k) = [E], pending_owners(k) = null
> E now has a stale backup lock on k
> It seems very hard to reproduce in production: C would have to leave soon enough so that B and D haven't received the T+1 topology yet, but late enough for it to send its transaction data to E.
> A possible solution would be to catch any SuspectException during prepare/commit/rollback (without ignoring leavers), wait for a new topology, and replicate the command again on the new owners. Obviously, this wouldn't work with asynchronous prepare/commit/rollback.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-4503) Commands with topology id 0 are not properly ignored on joiners
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4503?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4503:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1220328|https://bugzilla.redhat.com/show_bug.cgi?id=1220328] from NEW to ASSIGNED
> Commands with topology id 0 are not properly ignored on joiners
> ---------------------------------------------------------------
>
> Key: ISPN-4503
> URL: https://issues.jboss.org/browse/ISPN-4503
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha4
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_stability
> Fix For: 7.0.0.Beta1
>
>
> InboundInvocationHandlerImpl is supposed to ignore commands sent with a topology id smaller than the first topology id in which the local node was a member. But there is a loophole when the command was sent with topology id 0.
> This is visible in StateTransferFunctionalTest, where the writing thread keeps the cpu busy and can delay the 2nd node joining for a long time (especially when run on a single core with {{taskset -c 0}}). For some reason, the PrepareCommands are sent only to the local node, while the TxCompletionNotificationCommands are sent to the entire cluster ({{null}}). When the 2nd node manages to join, it receives a lot of TxCompletionNotificationCommands and processing them delays the processing of the rebalance commands. Since the writes eventually block waiting for the new topology to be installed on the joiner, the delayed rebalance commands cause the write to time out.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-4503) Commands with topology id 0 are not properly ignored on joiners
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4503?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration updated ISPN-4503:
------------------------------------------
Bugzilla Update: Perform
Bugzilla References: https://bugzilla.redhat.com/show_bug.cgi?id=1220328
> Commands with topology id 0 are not properly ignored on joiners
> ---------------------------------------------------------------
>
> Key: ISPN-4503
> URL: https://issues.jboss.org/browse/ISPN-4503
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha4
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_stability
> Fix For: 7.0.0.Beta1
>
>
> InboundInvocationHandlerImpl is supposed to ignore commands sent with a topology id smaller than the first topology id in which the local node was a member. But there is a loophole when the command was sent with topology id 0.
> This is visible in StateTransferFunctionalTest, where the writing thread keeps the cpu busy and can delay the 2nd node joining for a long time (especially when run on a single core with {{taskset -c 0}}). For some reason, the PrepareCommands are sent only to the local node, while the TxCompletionNotificationCommands are sent to the entire cluster ({{null}}). When the 2nd node manages to join, it receives a lot of TxCompletionNotificationCommands and processing them delays the processing of the rebalance commands. Since the writes eventually block waiting for the new topology to be installed on the joiner, the delayed rebalance commands cause the write to time out.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-4504) Topology id is not properly set on ClusteredGetCommands
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4504?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration updated ISPN-4504:
------------------------------------------
Bugzilla Update: Perform
Bugzilla References: https://bugzilla.redhat.com/show_bug.cgi?id=1220328
> Topology id is not properly set on ClusteredGetCommands
> -------------------------------------------------------
>
> Key: ISPN-4504
> URL: https://issues.jboss.org/browse/ISPN-4504
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.0.0.Alpha4
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 7.0.0.Beta1
>
>
> Because the topology id is not properly set on ClusteredGetCommands, they don't wait for the sender's topology to be installed on the target.
> I have tried to fix this while implementing a fix for ISPN-4503. My solution caused 2 failures in RemoteGetDuringStateTransferTest, scenarios 5 and 7::
> | sc | currentTopologyId | currentTopologyId + 1 (rebalance) | currentTopologyId + 2 (finish) |
> | 5 | 0:remoteGet | 2:sendReply | 0:receiveReply, 1:sendReply |
> | 7 | | 0:remoteGet, 2: sendReply | 0:receiveReply, 1:sendReply |
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-5454) XSite: RetryMechanismTest random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-5454?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-5454:
-------------------------------
Description:
{{ClusteredCacheBackupReceiver.awaitRemoteTask()}} doesn't respect the state push command's timeout, at least when it's smaller than the sync replication timeout in the target cache. When that happens, the state provider will resend the state, and there will be 2 state push commands executing at the same time.
RetryMechanismTest changes the state push timeout to 2 seconds, but the sync replication timeout stays at 15 seconds. This causes failures in {{testRetryLocally}} and {{testFailRetryLocally}}, if it takes more than 2 seconds to suspect the killed node.
{noformat}
10:02:13,007 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,008 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,040 TRACE (asyncTransportThread-4,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:16,040 TRACE (asyncTransportThread-0,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:19,147 ERROR (testng-RetryMechanismTest:) [UnitTestTestNGListener] Test testFailRetryLocally(org.infinispan.xsite.statetransfer.failures.RetryMechanismTest) failed.
java.lang.AssertionError: expected:<2> but was:<3>
at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
at org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.testFailRetryLocally(RetryMechanismTest.java:227)
{noformat}
was:
{{ClusteredCacheBackupReceiver.awaitRemoteTask()}} doesn't respect the state push command's timeout, at least when it's smaller than the sync replication timeout in the target cache. When that happens, the state provider will resend the state, and there will be 2 state push commands executing at the same time.
RetryMechanismTest changes the state push timeout to 2 seconds, but the sync replication timeout stays at 15 seconds. This causes failures in {{testRetryLocally}} and {{testFailRetryLocally}}, if it takes more than 2 seconds to suspect the killed node.
{noformat}
10:02:13,007 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,008 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,040 TRACE (asyncTransportThread-4,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:16,040 TRACE (asyncTransportThread-0,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:19,147 ERROR (testng-RetryMechanismTest:) [UnitTestTestNGListener] Test testFailRetryLocally(org.infinispan.xsite.statetransfer.failures.RetryMechanismTest) failed.
java.lang.AssertionError: expected:<2> but was:<3>
at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
at org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.testFailRetryLocally(RetryMechanismTest.java:227)
{noformat}
{noformat}
> XSite: RetryMechanismTest random failures
> -----------------------------------------
>
> Key: ISPN-5454
> URL: https://issues.jboss.org/browse/ISPN-5454
> Project: Infinispan
> Issue Type: Bug
> Components: Cross-Site Replication
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 8.0.0.Alpha1
>
>
> {{ClusteredCacheBackupReceiver.awaitRemoteTask()}} doesn't respect the state push command's timeout, at least when it's smaller than the sync replication timeout in the target cache. When that happens, the state provider will resend the state, and there will be 2 state push commands executing at the same time.
> RetryMechanismTest changes the state push timeout to 2 seconds, but the sync replication timeout stays at 15 seconds. This causes failures in {{testRetryLocally}} and {{testFailRetryLocally}}, if it takes more than 2 seconds to suspect the killed node.
> {noformat}
> 10:02:13,007 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
> 10:02:16,008 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
> 10:02:16,040 TRACE (asyncTransportThread-4,NodeP:) [RpcManagerImpl] replication exception:
> org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
> 10:02:16,040 TRACE (asyncTransportThread-0,NodeP:) [RpcManagerImpl] replication exception:
> org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
> 10:02:19,147 ERROR (testng-RetryMechanismTest:) [UnitTestTestNGListener] Test testFailRetryLocally(org.infinispan.xsite.statetransfer.failures.RetryMechanismTest) failed.
> java.lang.AssertionError: expected:<2> but was:<3>
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
> at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
> at org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.testFailRetryLocally(RetryMechanismTest.java:227)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months
[JBoss JIRA] (ISPN-5454) XSite: RetryMechanismTest random failures
by Dan Berindei (JIRA)
Dan Berindei created ISPN-5454:
----------------------------------
Summary: XSite: RetryMechanismTest random failures
Key: ISPN-5454
URL: https://issues.jboss.org/browse/ISPN-5454
Project: Infinispan
Issue Type: Bug
Components: Cross-Site Replication
Affects Versions: 7.2.1.Final
Reporter: Dan Berindei
Priority: Blocker
Fix For: 8.0.0.Alpha1
{{ClusteredCacheBackupReceiver.awaitRemoteTask()}} doesn't respect the state push command's timeout, at least when it's smaller than the sync replication timeout in the target cache. When that happens, the state provider will resend the state, and there will be 2 state push commands executing at the same time.
RetryMechanismTest changes the state push timeout to 2 seconds, but the sync replication timeout stays at 15 seconds. This causes failures in {{testRetryLocally}} and {{testFailRetryLocally}}, if it takes more than 2 seconds to suspect the killed node.
{noformat}
10:02:13,007 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,008 TRACE (asyncTransportThread-8,NodeN:) [RetryOnFailureXSiteCommand] Sending XSiteStatePushCommand{cacheName=___defaultcache, timeout=2000 (1 keys)} to [NYC (sync, timeout=2000)]
10:02:16,040 TRACE (asyncTransportThread-4,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:16,040 TRACE (asyncTransportThread-0,NodeP:) [RpcManagerImpl] replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeQ-56809 was suspected
10:02:19,147 ERROR (testng-RetryMechanismTest:) [UnitTestTestNGListener] Test testFailRetryLocally(org.infinispan.xsite.statetransfer.failures.RetryMechanismTest) failed.
java.lang.AssertionError: expected:<2> but was:<3>
at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
at org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.testFailRetryLocally(RetryMechanismTest.java:227)
{noformat}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
9 years, 8 months