[infinispan-issues] [JBoss JIRA] (ISPN-6997) PessimisticTxPartitionAndMergeDuringRuntimeTest.testOriginatorIsolatedPartition random failures

Mon Jun 19 05:56:00 EDT 2017

    [ https://issues.jboss.org/browse/ISPN-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422859#comment-13422859 ] 

Dan Berindei edited comment on ISPN-6997 at 6/19/17 5:55 AM:
-------------------------------------------------------------

I started investigating this again when I stumbled across some new failures on the ISPN-6997 branch. My previous comment was only partially correct: the {{LockControlCommand}} is not delivered during the partition, but immediately before (or possibly in parallel) with the view message.

The reason is that {{UNICAST3}} connections are not closed immediately when the peer disappears from the view, and messages sent during the partition stay in the connection's send table and may be delivered after the merge. But the caller already got a {{CacheNotFoundResponse}} when the partition view got installed, so it committed the transaction locally with a {{PrepareCommand(1PC)}}. Crucially, the 1PC command is *not* sent to all the nodes that got the {{LockControlCommand}}, but only those that are still in the cluster. So the {{LockControlCommand}} is retransmitted after the merge, but the {{PrepareCommand(1PC)}}/{{RollbackCommand}} isn't, and the remote transaction is only cleaned up after the remote transaction timeout expires (60 seconds by default).

Bela added a [method in {{UNICAST3}}|https://issues.jboss.org/browse/JGRP-2194] to manually close a connection to a peer, but I've decided instead to change our code so that the targets of {{RollbackCommand}} and {{PrepareCommand}} include the targets of any previous {{LockControlCommand}}.

was (Author: dan.berindei):
I started investigating this again when I stumbled across some new failures on the ISPN-6997 branch. My previous comment was only partially correct: the {{LockControlCommand}} is not delivered during the partition, but immediately before (or possibly in parallel) with the view message.

The reason is that {{UNICAST3}} connections are not closed immediately when the peer disappears from the view, and messages sent during the partition stay in the connection's send table and may be delivered after the merge. But the caller already got a {{CacheNotFoundResponse}} when the partition view got installed, so it committed the transaction locally with a {{PrepareCommand(1PC)}}. Crucially, the 1PC command is *not* sent to all the nodes that got the {{LockControlCommand}}, but only those that are still in the cluster. So the {{LockControlCommand}} is retransmitted after the merge, but the {{PrepareCommand(1PC)}}/{{RollbackCommand}} isn't, and the remote transaction is only cleaned up after the remote transaction timeout expires (60 seconds by default).

Bela added a method in {{UNICAST3}} to manually close a connection to a peer, but I've decided instead to change our code so that the targets of {{RollbackCommand}} and {{PrepareCommand}} include the targets of any previous {{LockControlCommand}}.

> PessimisticTxPartitionAndMergeDuringRuntimeTest.testOriginatorIsolatedPartition random failures
> -----------------------------------------------------------------------------------------------
>
>                 Key: ISPN-6997
>                 URL: https://issues.jboss.org/browse/ISPN-6997
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core, Test Suite - Core
>    Affects Versions: 9.0.0.Alpha4
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>              Labels: testsuite_stability
>             Fix For: 9.1.0.Final
>
>
> The test starts with a cluster of 4 nodes, and splits it in 2 partitions while a transaction is trying to lock a key. After the transaction fails, it checks that the transaction has been cleaned up properly.
> On one of the owners, {{transactionTable.cleanupLeaverTransactions}} is being called only before the split and after the merge, never with the list of members during the split. That means it never sees the transaction as an orphan, and doesn't remove it.
> {noformat}
> 15:16:18,893 TRACE (testng-PTPAMDRT:[]) [PTPAMDRT] Local tx=[], remote tx=[GlobalTx:PTPAMDRT-NodeI-3337:28616], for cache PTPAMDRT-NodeJ-27814 
> 15:16:18,893 ERROR (testng-PTPAMDRT:[]) [TestSuiteProgress] Test failed: org.infinispan.partitionhandling.PTPAMDRT.testOriginatorIsolatedPartition
> java.lang.AssertionError: There are pending transactions!
> 	at org.testng.AssertJUnit.fail(AssertJUnit.java:59) ~[testng-6.8.8.jar:?]
> 	at org.testng.AssertJUnit.assertTrue(AssertJUnit.java:24) ~[testng-6.8.8.jar:?]
> 	at org.infinispan.test.AbstractInfinispanTest.eventually(AbstractInfinispanTest.java:223) ~[test-classes/:?]
> 	at org.infinispan.test.AbstractInfinispanTest.eventually(AbstractInfinispanTest.java:519) ~[test-classes/:?]
> 	at org.infinispan.test.MultipleCacheManagersTest.assertNoTransactions(MultipleCacheManagersTest.java:794) ~[test-classes/:?]
> 	at org.infinispan.partitionhandling.BaseTxPartitionAndMergeTest.finalAsserts(BaseTxPartitionAndMergeTest.java:96) ~[test-classes/:?]
> 	at org.infinispan.partitionhandling.BasePessimisticTxPartitionAndMergeTest.doTest(BasePessimisticTxPartitionAndMergeTest.java:82) ~[test-classes/:?]
> 	at org.infinispan.partitionhandling.tionAndMergeDuringRuntimeTest.testOriginatorIsolatedPartition(PessimisticTxPartitionAndMergeDuringRuntimeTest.java:33) ~[test-classes/:?]
> {noformat}
> {{OptimisticTxPartitionAndMergeDuringCommitTest.testPrimaryOwnerIsolatedPartition}} has similar random failures.

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)