[JBoss JIRA] (ISPN-9003) Clustered maxIdle expiration
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-9003?page=com.atlassian.jira.plugin.... ]
William Burns edited comment on ISPN-9003 at 3/30/18 1:57 PM:
--------------------------------------------------------------
Talking with [~NadirX] we believe the best approach may be a consensus based removal of maxIdle entries. That is it will do the following things
1. Read occurs on node where maxIdle is expired and sends request to primary owner (blocks the read)
2. Primary owner locks the key
3. Primary sends message to all nodes asking if entry is expired due to maxIdle -(also updates the updated time on all keys - just in case it isn't expired)-
4. Primary receives responses to decide if it is expired
5a. Primary responds to read node and tells them if it was expired
5b. Primary removes from all nodes if it was expired
6. Cluster wide expiration occurs if necessary allowing for listeners to be invoked (including cluster listeners - which we didn't do before)
The above approach gives no overhead for reads when the entry is not present and is not expired. The only overhead is if a node finds that the entry was expired via maxIdle and then must block to confirm if it should be removed. However the chance of this occurring seems a bit low depending on what maxIdle was set to. If the expiration reaper finds the expired entry first it will properly update its timestamps or remove as needed.
Another side effect of this is that maxIdle may be refreshed for all entries that are transferred via state transfer since I don't think the access time is currently replicated when state transfer occurs.
was (Author: william.burns):
Talking with [~NadirX] we believe the best approach may be a consensus based removal of maxIdle entries. That is it will do the following things
1. Read occurs on node where maxIdle is expired and sends request to primary owner (blocks the read)
2. Primary owner locks the key
3. Primary sends message to all nodes asking if entry is expired due to maxIdle (also updates the updated time on all keys - just in case it isn't expired)
4. Primary receives responses to decide if it is expired
5a. Primary responds to read node and tells them if it was expired
5b. Primary removes from all nodes if it was expired
6. Cluster wide expiration occurs if necessary allowing for listeners to be invoked (including cluster listeners - which we didn't do before)
The above approach gives no overhead for reads when the entry is not present and is not expired. The only overhead is if a node finds that the entry was expired via maxIdle and then must block to confirm if it should be removed. However the chance of this occurring seems a bit low depending on what maxIdle was set to. If the expiration reaper finds the expired entry first it will properly update its timestamps or remove as needed.
Another side effect of this is that maxIdle may be refreshed for all entries that are transferred via state transfer since I don't think the access time is currently replicated when state transfer occurs.
> Clustered maxIdle expiration
> ----------------------------
>
> Key: ISPN-9003
> URL: https://issues.jboss.org/browse/ISPN-9003
> Project: Infinispan
> Issue Type: Enhancement
> Reporter: Tristan Tarrant
> Assignee: William Burns
> Fix For: 9.3.0.Beta1, 9.3.0.Final
>
>
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-8962) PreferAvailabilityStrategy: Rely less on the stable topology
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-8962?page=com.atlassian.jira.plugin.... ]
William Burns updated ISPN-8962:
--------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> PreferAvailabilityStrategy: Rely less on the stable topology
> ------------------------------------------------------------
>
> Key: ISPN-8962
> URL: https://issues.jboss.org/browse/ISPN-8962
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 9.2.2.Final, 9.3.0.Final
>
>
> {{PreferAvailabilityStrategy}} checks the size of the stable topology, and only considers cache topologies that are derived from the biggest topology (in size) when picking a post-merge topology.
> Unfortunately, in some situations this algorithm fails pretty badly. If a node has a very long GC pause, when it comes back it will report the old topology *and* the old stable topology. If the rest of the cluster rebalanced, it now has both a smaller current topology and a smaller stable topology.
> Furthermore, the stable topology is updated asynchronously, independent from the current topology. So even if there's a split and the minority partition installs a current topology with fewer members, it may take some time for its stable topology to be updated with fewer members. In fact, it appears that when a rebalance is not needed (e.g. because the partition has a single node), the stable topology is never updated!
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-8993) LocalTopologyManager should not install a reset topology before conflict resolution
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-8993?page=com.atlassian.jira.plugin.... ]
William Burns updated ISPN-8993:
--------------------------------
Status: Resolved (was: Pull Request Sent)
Fix Version/s: 9.3.0.Alpha1
Resolution: Done
> LocalTopologyManager should not install a reset topology before conflict resolution
> -----------------------------------------------------------------------------------
>
> Key: ISPN-8993
> URL: https://issues.jboss.org/browse/ISPN-8993
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 9.2.2.Final, 9.3.0.Alpha1
>
>
> When the coordinator changes, either because the old one left or because there was a merge, the new coordinator first resets any ongoing rebalance, and then starts a rebalance with the new members. But some nodes could see the post-merge rebalance before the post-merge reset, and they need to reset any old data or inbound transfers first.
> [LocalTopologyManagerImpl|https://github.com/infinispan/infinispan/blob/22...] is responsible for faking that reset topology update, in order to keep StateConsumerImpl simpler. The problem is that LocalTopologyManagerImpl also creates a reset topology update if the new coordinator started conflict resolution, and that topology update clears all the segments not owned by the local node in the preferred topology.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-8962) PreferAvailabilityStrategy: Rely less on the stable topology
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-8962?page=com.atlassian.jira.plugin.... ]
William Burns updated ISPN-8962:
--------------------------------
Fix Version/s: 9.3.0.Alpha1
(was: 9.3.0.Final)
> PreferAvailabilityStrategy: Rely less on the stable topology
> ------------------------------------------------------------
>
> Key: ISPN-8962
> URL: https://issues.jboss.org/browse/ISPN-8962
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 9.2.2.Final, 9.3.0.Alpha1
>
>
> {{PreferAvailabilityStrategy}} checks the size of the stable topology, and only considers cache topologies that are derived from the biggest topology (in size) when picking a post-merge topology.
> Unfortunately, in some situations this algorithm fails pretty badly. If a node has a very long GC pause, when it comes back it will report the old topology *and* the old stable topology. If the rest of the cluster rebalanced, it now has both a smaller current topology and a smaller stable topology.
> Furthermore, the stable topology is updated asynchronously, independent from the current topology. So even if there's a split and the minority partition installs a current topology with fewer members, it may take some time for its stable topology to be updated with fewer members. In fact, it appears that when a rebalance is not needed (e.g. because the partition has a single node), the stable topology is never updated!
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-8728) ExceptionEvictionTest.testSizeCorrectWithStateTransfer random failures
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-8728?page=com.atlassian.jira.plugin.... ]
William Burns updated ISPN-8728:
--------------------------------
Status: Pull Request Sent (was: Open)
> ExceptionEvictionTest.testSizeCorrectWithStateTransfer random failures
> ----------------------------------------------------------------------
>
> Key: ISPN-8728
> URL: https://issues.jboss.org/browse/ISPN-8728
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.2.0.CR1
> Reporter: Dan Berindei
> Assignee: William Burns
> Labels: testsuite_stability
> Fix For: 9.2.2.Final, 9.3.0.Final
>
> Attachments: ExceptionEvictionTest_20180129.log.gz, ExceptionEvictionTest_ISPN-8962_preferavailabilitystrategy_20180328.log.gz
>
>
> {noformat}
> 15:10:01,610 ERROR (testng-Test:[]) [TestSuiteProgress] Test failed: org.infinispan.eviction.impl.ExceptionEvictionTest.testSizeCorrectWithStateTransfer[DIST_SYNC, nodeCount=3, storageType=BINARY, optimisticTransaction=true]
> java.lang.AssertionError: expected:<1920> but was:<1984>
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:170) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:177) ~[testng-6.8.8.jar:?]
> at org.infinispan.eviction.impl.ExceptionEvictionTest.assertInterceptorCount(ExceptionEvictionTest.java:252) ~[test-classes/:?]
> at org.infinispan.eviction.impl.ExceptionEvictionTest.testSizeCorrectWithStateTransfer(ExceptionEvictionTest.java:600) ~[test-classes/:?]
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-8728) ExceptionEvictionTest.testSizeCorrectWithStateTransfer random failures
by William Burns (JIRA)
[ https://issues.jboss.org/browse/ISPN-8728?page=com.atlassian.jira.plugin.... ]
William Burns updated ISPN-8728:
--------------------------------
Fix Version/s: 9.2.2.Final
> ExceptionEvictionTest.testSizeCorrectWithStateTransfer random failures
> ----------------------------------------------------------------------
>
> Key: ISPN-8728
> URL: https://issues.jboss.org/browse/ISPN-8728
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.2.0.CR1
> Reporter: Dan Berindei
> Assignee: William Burns
> Labels: testsuite_stability
> Fix For: 9.2.2.Final, 9.3.0.Final
>
> Attachments: ExceptionEvictionTest_20180129.log.gz, ExceptionEvictionTest_ISPN-8962_preferavailabilitystrategy_20180328.log.gz
>
>
> {noformat}
> 15:10:01,610 ERROR (testng-Test:[]) [TestSuiteProgress] Test failed: org.infinispan.eviction.impl.ExceptionEvictionTest.testSizeCorrectWithStateTransfer[DIST_SYNC, nodeCount=3, storageType=BINARY, optimisticTransaction=true]
> java.lang.AssertionError: expected:<1920> but was:<1984>
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:170) ~[testng-6.8.8.jar:?]
> at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:177) ~[testng-6.8.8.jar:?]
> at org.infinispan.eviction.impl.ExceptionEvictionTest.assertInterceptorCount(ExceptionEvictionTest.java:252) ~[test-classes/:?]
> at org.infinispan.eviction.impl.ExceptionEvictionTest.testSizeCorrectWithStateTransfer(ExceptionEvictionTest.java:600) ~[test-classes/:?]
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-6827) ReplTotalOrderVersionedStateTransferTest.testStateTransfer random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-6827?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-6827:
-------------------------------
Fix Version/s: 9.2.2.Final
9.3.0.Alpha1
(was: 9.3.0.Final)
> ReplTotalOrderVersionedStateTransferTest.testStateTransfer random failures
> ---------------------------------------------------------------------------
>
> Key: ISPN-6827
> URL: https://issues.jboss.org/browse/ISPN-6827
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.0.0.Alpha2
> Reporter: Dan Berindei
> Assignee: Pedro Ruivo
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 9.2.2.Final, 9.3.0.Alpha1
>
> Attachments: ReplTotalOrderVersionedStateTransferTest_pr_rvansa_ISPN-5989_20160314.log.zip
>
>
> {noformat}
> java.lang.RuntimeException: Timed out waiting for rebalancing to complete on node ReplTotalOrderVersionedStateTransferTest-NodeB-17608, expected member list is [ReplTotalOrderVersionedStateTransferTest-NodeB-17608, ReplTotalOrderVersionedStateTransferTest-NodeC-39826], current member list is [ReplTotalOrderVersionedStateTransferTest-NodeA-54647, ReplTotalOrderVersionedStateTransferTest-NodeB-17608, ReplTotalOrderVersionedStateTransferTest-NodeC-39826]!
> at org.infinispan.test.TestingUtil.waitForRehashToComplete(TestingUtil.java:267)
> at org.infinispan.test.TestingUtil.waitForRehashToComplete(TestingUtil.java:277)
> at org.infinispan.container.versioning.VersionedReplStateTransferTest.testStateTransfer(VersionedReplStateTransferTest.java:74)
> {noformat}
> http://ci.infinispan.org/project.html?tab=testDetails&testNameId=-7264982...
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-6827) ReplTotalOrderVersionedStateTransferTest.testStateTransfer random failures
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-6827?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-6827:
-------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> ReplTotalOrderVersionedStateTransferTest.testStateTransfer random failures
> ---------------------------------------------------------------------------
>
> Key: ISPN-6827
> URL: https://issues.jboss.org/browse/ISPN-6827
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.0.0.Alpha2
> Reporter: Dan Berindei
> Assignee: Pedro Ruivo
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 9.3.0.Final
>
> Attachments: ReplTotalOrderVersionedStateTransferTest_pr_rvansa_ISPN-5989_20160314.log.zip
>
>
> {noformat}
> java.lang.RuntimeException: Timed out waiting for rebalancing to complete on node ReplTotalOrderVersionedStateTransferTest-NodeB-17608, expected member list is [ReplTotalOrderVersionedStateTransferTest-NodeB-17608, ReplTotalOrderVersionedStateTransferTest-NodeC-39826], current member list is [ReplTotalOrderVersionedStateTransferTest-NodeA-54647, ReplTotalOrderVersionedStateTransferTest-NodeB-17608, ReplTotalOrderVersionedStateTransferTest-NodeC-39826]!
> at org.infinispan.test.TestingUtil.waitForRehashToComplete(TestingUtil.java:267)
> at org.infinispan.test.TestingUtil.waitForRehashToComplete(TestingUtil.java:277)
> at org.infinispan.container.versioning.VersionedReplStateTransferTest.testStateTransfer(VersionedReplStateTransferTest.java:74)
> {noformat}
> http://ci.infinispan.org/project.html?tab=testDetails&testNameId=-7264982...
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-9028) Write-only segments should be invalidated during the READ_NEW phase
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-9028?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-9028:
-------------------------------
Status: Open (was: New)
> Write-only segments should be invalidated during the READ_NEW phase
> -------------------------------------------------------------------
>
> Key: ISPN-9028
> URL: https://issues.jboss.org/browse/ISPN-9028
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_stability
> Fix For: 9.3.0.Alpha1
>
>
> When a rebalance removes a segment X from node A, node A keeps updating entries in segment X until the rebalance finishes, and only deletes the entries of segment X after entering the NO_REBALANCE phase.
> This is problematic for tests that work with the data container directly, because {{waitForNoRebalance()}} doesn't wait for the removal of stale entries. The test will work without an explicit wait most of the time, so this is a recipe for random test failures (e.g. ISPN-8728).
> As described in ISPN-5021, we can prevent any writes to segment X at the start of the READ_NEW_WRITE_ALL phase, send the phase confirmation to the coordinator, and then remove the entries asynchronously. We just need to keep track of the removal task and only install/confirm the NO_REBALANCE phase once all the entries that we don't own have been removed.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months
[JBoss JIRA] (ISPN-9028) Write-only segments should be invalidated during the READ_NEW phase
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-9028?page=com.atlassian.jira.plugin.... ]
Dan Berindei reassigned ISPN-9028:
----------------------------------
Assignee: Dan Berindei
> Write-only segments should be invalidated during the READ_NEW phase
> -------------------------------------------------------------------
>
> Key: ISPN-9028
> URL: https://issues.jboss.org/browse/ISPN-9028
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_stability
> Fix For: 9.3.0.Alpha1
>
>
> When a rebalance removes a segment X from node A, node A keeps updating entries in segment X until the rebalance finishes, and only deletes the entries of segment X after entering the NO_REBALANCE phase.
> This is problematic for tests that work with the data container directly, because {{waitForNoRebalance()}} doesn't wait for the removal of stale entries. The test will work without an explicit wait most of the time, so this is a recipe for random test failures (e.g. ISPN-8728).
> As described in ISPN-5021, we can prevent any writes to segment X at the start of the READ_NEW_WRITE_ALL phase, send the phase confirmation to the coordinator, and then remove the entries asynchronously. We just need to keep track of the removal task and only install/confirm the NO_REBALANCE phase once all the entries that we don't own have been removed.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
6 years, 8 months