[JBoss JIRA] (ISPN-4777) Replace command not atomic in REPL_SYNC cache mode
by Anuj Shah (JIRA)
[ https://issues.jboss.org/browse/ISPN-4777?page=com.atlassian.jira.plugin.... ]
Anuj Shah updated ISPN-4777:
----------------------------
Affects Version/s: 5.2.6.Final
> Replace command not atomic in REPL_SYNC cache mode
> --------------------------------------------------
>
> Key: ISPN-4777
> URL: https://issues.jboss.org/browse/ISPN-4777
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 5.2.6.Final
> Reporter: Anuj Shah
> Assignee: Mircea Markus
> Attachments: ReaderLockerTest.java
>
>
> This problem was discovered using the Lucene InfinispanDirectory with DistributedSegmentReadLocker. We found after a while of production usage that some Lucene files were randomly removed from the caches, but remained in the file listing entry, which resulted in an unusable index.
> We managed to replicate the problem in a test that acquires and releases read lock concurrently and checks for file deletion. We found this fails quickly when using REPL_SYNC mode, but runs for a while with DIST_SYNC.
> Some extra logging indicated that the replace command used to increment the lock counter across multiple cluster members, results in an single increment when called concurrently, with both calls reporting success. This eventually causes the file deletion, as we have now mis-counted the number of readers. We also observed the opposite effect of the counter only decrementing by one when releasing.
> Our conclusion is that the replace command fails atomicity when in REPL_SYNC mode, but works in other modes, we tried DIST_SYNC, DIST_ASYNC and REPL_ASYNC.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4777) Replace command not atomic in REPL_SYNC cache mode
by Anuj Shah (JIRA)
[ https://issues.jboss.org/browse/ISPN-4777?page=com.atlassian.jira.plugin.... ]
Anuj Shah updated ISPN-4777:
----------------------------
Fix Version/s: (was: 5.2.6.Final)
> Replace command not atomic in REPL_SYNC cache mode
> --------------------------------------------------
>
> Key: ISPN-4777
> URL: https://issues.jboss.org/browse/ISPN-4777
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 5.2.6.Final
> Reporter: Anuj Shah
> Assignee: Mircea Markus
> Attachments: ReaderLockerTest.java
>
>
> This problem was discovered using the Lucene InfinispanDirectory with DistributedSegmentReadLocker. We found after a while of production usage that some Lucene files were randomly removed from the caches, but remained in the file listing entry, which resulted in an unusable index.
> We managed to replicate the problem in a test that acquires and releases read lock concurrently and checks for file deletion. We found this fails quickly when using REPL_SYNC mode, but runs for a while with DIST_SYNC.
> Some extra logging indicated that the replace command used to increment the lock counter across multiple cluster members, results in an single increment when called concurrently, with both calls reporting success. This eventually causes the file deletion, as we have now mis-counted the number of readers. We also observed the opposite effect of the counter only decrementing by one when releasing.
> Our conclusion is that the replace command fails atomicity when in REPL_SYNC mode, but works in other modes, we tried DIST_SYNC, DIST_ASYNC and REPL_ASYNC.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4777) Replace command not atomic in REPL_SYNC cache mode
by Anuj Shah (JIRA)
[ https://issues.jboss.org/browse/ISPN-4777?page=com.atlassian.jira.plugin.... ]
Anuj Shah updated ISPN-4777:
----------------------------
Attachment: ReaderLockerTest.java
Attached the test we used to determine the problem.
> Replace command not atomic in REPL_SYNC cache mode
> --------------------------------------------------
>
> Key: ISPN-4777
> URL: https://issues.jboss.org/browse/ISPN-4777
> Project: Infinispan
> Issue Type: Bug
> Reporter: Anuj Shah
> Assignee: Mircea Markus
> Fix For: 5.2.6.Final
>
> Attachments: ReaderLockerTest.java
>
>
> This problem was discovered using the Lucene InfinispanDirectory with DistributedSegmentReadLocker. We found after a while of production usage that some Lucene files were randomly removed from the caches, but remained in the file listing entry, which resulted in an unusable index.
> We managed to replicate the problem in a test that acquires and releases read lock concurrently and checks for file deletion. We found this fails quickly when using REPL_SYNC mode, but runs for a while with DIST_SYNC.
> Some extra logging indicated that the replace command used to increment the lock counter across multiple cluster members, results in an single increment when called concurrently, with both calls reporting success. This eventually causes the file deletion, as we have now mis-counted the number of readers. We also observed the opposite effect of the counter only decrementing by one when releasing.
> Our conclusion is that the replace command fails atomicity when in REPL_SYNC mode, but works in other modes, we tried DIST_SYNC, DIST_ASYNC and REPL_ASYNC.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4777) Replace command not atomic in REPL_SYNC cache mode
by Anuj Shah (JIRA)
Anuj Shah created ISPN-4777:
-------------------------------
Summary: Replace command not atomic in REPL_SYNC cache mode
Key: ISPN-4777
URL: https://issues.jboss.org/browse/ISPN-4777
Project: Infinispan
Issue Type: Bug
Reporter: Anuj Shah
Assignee: Mircea Markus
Fix For: 5.2.6.Final
This problem was discovered using the Lucene InfinispanDirectory with DistributedSegmentReadLocker. We found after a while of production usage that some Lucene files were randomly removed from the caches, but remained in the file listing entry, which resulted in an unusable index.
We managed to replicate the problem in a test that acquires and releases read lock concurrently and checks for file deletion. We found this fails quickly when using REPL_SYNC mode, but runs for a while with DIST_SYNC.
Some extra logging indicated that the replace command used to increment the lock counter across multiple cluster members, results in an single increment when called concurrently, with both calls reporting success. This eventually causes the file deletion, as we have now mis-counted the number of readers. We also observed the opposite effect of the counter only decrementing by one when releasing.
Our conclusion is that the replace command fails atomicity when in REPL_SYNC mode, but works in other modes, we tried DIST_SYNC, DIST_ASYNC and REPL_ASYNC.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4776) The topology id for the merged cache topology is not always bigger than all the partition topology ids
by Dan Berindei (JIRA)
Dan Berindei created ISPN-4776:
----------------------------------
Summary: The topology id for the merged cache topology is not always bigger than all the partition topology ids
Key: ISPN-4776
URL: https://issues.jboss.org/browse/ISPN-4776
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 7.0.0.Beta2
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Blocker
Fix For: 7.0.0.CR1
With the ISPN-4574 fix, I changed the merge algorithm to pick the partition with the most members (both in the _stable_ topology and in the _current_ topology) instead of the partition with the highest topology id.
However, the biggest topology is not necessarily the partition with the highest topology id, so it's possible that some nodes will ignore the merged topology because they already have a higher topology installed. This happened once in ClusterTopologyManagerTest.testClusterRecoveryAfterThreeWaySplit:
{noformat}
00:24:59,286 DEBUG (transport-thread-NodeL-p33097-t6:) [ClusterCacheStatus] Recovered 3 partition(s) for cache cache: [CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns = 60, owners = (1)[, NodeL-25322: 60+0]}, pendingCH=null, unionCH=null}, CacheTopology{id=6, rebalanceId=3, currentCH=DefaultConsistentHash{ns = 60, owners = (2)[, NodeL-25322: 30+10, NodeN-6727: 30+10]}, pendingCH=DefaultConsistentHash{ns = 60, owners = (2)[, NodeL-25322: 30+30, NodeN-6727: 30+30]}, unionCH=null}, CacheTopology{id=5, rebalanceId=2, currentCH=DefaultConsistentHash{ns = 60, owners = (1)[, NodeM-12972: 60+0]}, pendingCH=null, unionCH=null}]
00:24:59,287 DEBUG (transport-thread-NodeL-p33097-t6:) [ClusterCacheStatus] Updating topologies after merge for cache cache, current topology = CacheTopology{id=5, rebalanceId=2, currentCH=DefaultConsistentHash{ns = 60, owners = (1)[, NodeM-12972: 60+0]}, pendingCH=null, unionCH=null}, stable topology = CacheTopology{id=4, rebalanceId=2, currentCH=DefaultConsistentHash{ns = 60, owners = (3)[, NodeL-25322: 20+20, NodeM-12972: 20+20, NodeN-6727: 20+20]}, pendingCH=null, unionCH=null}, availability mode = null
00:24:59,287 DEBUG (transport-thread-NodeL-p33097-t6:) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache cache, topology = CacheTopology{id=5, rebalanceId=2, currentCH=DefaultConsistentHash{ns = 60, owners = (1)[, NodeM-12972: 60+0]}, pendingCH=null, unionCH=null}, availability mode = null
00:24:59,288 TRACE (transport-thread-NodeL-p33097-t3:) [LocalTopologyManagerImpl] Ignoring consistent hash update for cache cache, current topology is 8: CacheTopology{id=5, rebalanceId=2, currentCH=DefaultConsistentHash{ns = 60, owners = (1)[, NodeM-12972: 60+0]}, pendingCH=null, unionCH=null}
{noformat}
Failure logs here: http://ci.infinispan.org/viewLog.html?buildId=12364&buildTypeId=Infinispa...
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4773) Clean up distribution packaging
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-4773?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-4773:
----------------------------------
Status: Pull Request Sent (was: Open)
Git Pull Request: https://github.com/infinispan/infinispan/pull/2907
> Clean up distribution packaging
> -------------------------------
>
> Key: ISPN-4773
> URL: https://issues.jboss.org/browse/ISPN-4773
> Project: Infinispan
> Issue Type: Enhancement
> Components: Build process
> Affects Versions: 7.0.0.Beta2
> Reporter: Tristan Tarrant
> Assignee: Tristan Tarrant
> Fix For: 7.0.0.CR1
>
>
> The distribution build process is currently very messy:
> - it is based on overly complex assembly files
> - the layout of the distribution packages is not ideal
> - some of the documentation is not obviously accessible (jmx, xsd)
> - since the javadoc plugin is configured to use aggregate builds, it cannot be run without building the entire tree
> The solution:
> - move the distribution logic to its own module
> - use ant instead of assembly descriptors to build the distribution packages
> - improve the layout, include the missing docs and remove the aging cruft
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months
[JBoss JIRA] (ISPN-4773) Clean up distribution packaging
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-4773?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-4773:
----------------------------------
Status: Open (was: New)
> Clean up distribution packaging
> -------------------------------
>
> Key: ISPN-4773
> URL: https://issues.jboss.org/browse/ISPN-4773
> Project: Infinispan
> Issue Type: Enhancement
> Components: Build process
> Affects Versions: 7.0.0.Beta2
> Reporter: Tristan Tarrant
> Assignee: Tristan Tarrant
> Fix For: 7.0.0.CR1
>
>
> The distribution build process is currently very messy:
> - it is based on overly complex assembly files
> - the layout of the distribution packages is not ideal
> - some of the documentation is not obviously accessible (jmx, xsd)
> - since the javadoc plugin is configured to use aggregate builds, it cannot be run without building the entire tree
> The solution:
> - move the distribution logic to its own module
> - use ant instead of assembly descriptors to build the distribution packages
> - improve the layout, include the missing docs and remove the aging cruft
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
9 years, 7 months