[JBoss JIRA] (ISPN-8241) Refactor RocksDB clearThreshold
by Ryan Emerson (JIRA)
[ https://issues.jboss.org/browse/ISPN-8241?page=com.atlassian.jira.plugin.... ]
Ryan Emerson updated ISPN-8241:
-------------------------------
Affects Version/s: 9.1.0.Final
> Refactor RocksDB clearThreshold
> -------------------------------
>
> Key: ISPN-8241
> URL: https://issues.jboss.org/browse/ISPN-8241
> Project: Infinispan
> Issue Type: Sub-task
> Components: Loaders and Stores
> Affects Versions: 9.1.0.Final
> Reporter: Ryan Emerson
> Assignee: Ryan Emerson
> Fix For: 9.2.0.Final
>
>
> Currently the RocksDB store utilises a "clearThreshold" to try to delete entries individually before deleting and re-initiating the database. We should deprecate this threshold and always delete/reinit the database.
> Currently when deleting the database, we utilise Util.recursiveFileRemove which does not confirm that the file has actually been deleted. Instead, we should provide a nio based implementation instead, similar to the one stated [here|https://stackoverflow.com/questions/779519/delete-directories-recurs...]. This has the advantage that an IOException is thrown by java.nio.file.Files::delete
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8241) Refactor RocksDB clearThreshold
by Ryan Emerson (JIRA)
[ https://issues.jboss.org/browse/ISPN-8241?page=com.atlassian.jira.plugin.... ]
Ryan Emerson updated ISPN-8241:
-------------------------------
Component/s: Loaders and Stores
> Refactor RocksDB clearThreshold
> -------------------------------
>
> Key: ISPN-8241
> URL: https://issues.jboss.org/browse/ISPN-8241
> Project: Infinispan
> Issue Type: Sub-task
> Components: Loaders and Stores
> Affects Versions: 9.1.0.Final
> Reporter: Ryan Emerson
> Assignee: Ryan Emerson
> Fix For: 9.2.0.Final
>
>
> Currently the RocksDB store utilises a "clearThreshold" to try to delete entries individually before deleting and re-initiating the database. We should deprecate this threshold and always delete/reinit the database.
> Currently when deleting the database, we utilise Util.recursiveFileRemove which does not confirm that the file has actually been deleted. Instead, we should provide a nio based implementation instead, similar to the one stated [here|https://stackoverflow.com/questions/779519/delete-directories-recurs...]. This has the advantage that an IOException is thrown by java.nio.file.Files::delete
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8241) Refactor RocksDB clearThreshold
by Ryan Emerson (JIRA)
Ryan Emerson created ISPN-8241:
----------------------------------
Summary: Refactor RocksDB clearThreshold
Key: ISPN-8241
URL: https://issues.jboss.org/browse/ISPN-8241
Project: Infinispan
Issue Type: Sub-task
Reporter: Ryan Emerson
Assignee: Ryan Emerson
Currently the RocksDB store utilises a "clearThreshold" to try to delete entries individually before deleting and re-initiating the database. We should deprecate this threshold and always delete/reinit the database.
Currently when deleting the database, we utilise Util.recursiveFileRemove which does not confirm that the file has actually been deleted. Instead, we should provide a nio based implementation instead, similar to the one stated [here|https://stackoverflow.com/questions/779519/delete-directories-recurs...]. This has the advantage that an IOException is thrown by java.nio.file.Files::delete
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8186) DistTopologyChangeUnderLoadTest.testPutsSucceedWhileTopologyChanges sometimes fails with magic error
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-8186?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8186:
----------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> DistTopologyChangeUnderLoadTest.testPutsSucceedWhileTopologyChanges sometimes fails with magic error
> ----------------------------------------------------------------------------------------------------
>
> Key: ISPN-8186
> URL: https://issues.jboss.org/browse/ISPN-8186
> Project: Infinispan
> Issue Type: Bug
> Components: Remote Protocols
> Affects Versions: 9.1.0.Final
> Reporter: Tristan Tarrant
> Assignee: Galder Zamarreño
> Fix For: 9.1.1.Final
>
>
> Happens sometimes on CI:
> DistTopologyChangeUnderLoadTest.testPutsSucceedWhileTopologyChanges
> {code}
> org.infinispan.client.hotrod.exceptions.InvalidResponseException::
> Invalid magic number. Expected 0xa1 and received 0xdb at
> org.infinispan.client.hotrod.impl.protocol.Codec20.readMagic(Codec20.java:333) at
> org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:135) at
> org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:60) at
> org.infinispan.client.hotrod.impl.operations.AbstractKeyValueOperation.sendPutOperation(AbstractKeyValueOperation.java:58) at
> org.infinispan.client.hotrod.impl.operations.PutOperation.executeOperation(PutOperation.java:34) at
> org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:56) at
> org.infinispan.client.hotrod.impl.RemoteCacheImpl.put(RemoteCacheImpl.java:268) at
> org.infinispan.client.hotrod.impl.RemoteCacheSupport.put(RemoteCacheSupport.java:77) at
> org.infinispan.client.hotrod.DistTopologyChangeUnderLoadTest.testPutsSucceedWhileTopologyChanges(DistTopologyChangeUnderLoadTest.java:57) at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at
> java.lang.Thread.run(Thread.java:748) ... Removed 16 stack frames
> {code}
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8240) Coordinator sends REBALANCE_START command when there is already a rebalance in progress
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-8240?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-8240:
------------------------------------
One side-effect is that nodes now confirm the rebalance after every leave, leading to lots of error messages like this one:
{noformat}
09:50:05,569 WARN (remote-thread-test-NodeA-p2-t2:[dist]) [CacheTopologyControlCommand] ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=dist, type=REBALANCE_PHASE_CONFIRM, sender=test-NodeC-41478, joinInfo=null, topologyId=16, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=4}
org.infinispan.commons.CacheException: Received invalid rebalance confirmation from test-NodeC-41478 for cache dist, expecting topology id 17 but got 16
at org.infinispan.topology.RebalanceConfirmationCollector.confirmPhase(RebalanceConfirmationCollector.java:41) ~[classes/:?]
at org.infinispan.topology.ClusterCacheStatus.confirmRebalancePhase(ClusterCacheStatus.java:337) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalancePhaseConfirm(ClusterTopologyManagerImpl.java:274) ~[classes/:?]
at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:189) ~[classes/:?]
at org.infinispan.topology.CacheTopologyControlCommand.invokeAsync(CacheTopologyControlCommand.java:166) ~[classes/:?]
at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.invokeReplicableCommand(GlobalInboundInvocationHandler.java:174) ~[classes/:?]
at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.runReplicableCommand(GlobalInboundInvocationHandler.java:155) ~[classes/:?]
at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.lambda$handleReplicableCommand$1(GlobalInboundInvocationHandler.java:149) ~[classes/:?]
at org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:203) [classes/:?]
{noformat}
> Coordinator sends REBALANCE_START command when there is already a rebalance in progress
> ---------------------------------------------------------------------------------------
>
> Key: ISPN-8240
> URL: https://issues.jboss.org/browse/ISPN-8240
> Project: Infinispan
> Issue Type: Bug
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Minor
>
> Normally the {{REBALANCE_START}} command should only be sent at the start of a rebalance, and any topology updates sent before all the nodes confirm the rebalance phase should have {{CH_UPDATE}}.
> Since the change to 4 phases, this is no longer true: first {{ClusterCacheStatus.updateTopologyMembers}} first clears the {{RebalanceConfirmationCollector}}, then it broadcasts a {{CH_UPDATE}}. Then {{queueRebalance}} immediately creates a new {{RCC}} and broadcasts a {{REBALANCE_START}}, instead of waiting for the current rebalance to finish.
> I propose we remove {{REBALANCE_START}}, as it was just a crude version of {{CacheTopology.Phase}}. We should also remove the {{isRebalance}} parameter from {{StateConsumerImpl.onTopologyUpdate()}}.
> I'm still not sure if rebalancing the pending CH immediately is ok. On the one hand, I would like the rebalance to finish with {{updateMembers(union(currentCH, pendingCH))}} as the new pending CH, so that segments that were already transferred keep an extra copy. On the other hand, that would only help for segments that have at least on owner in the current CH: if the current CH has 0 owners and {{updateMembers}} allocates new ones, those new owners won't request data from the pending CH owners anyway. Fixing that case would require the coordinator to fetch the transfer status from all the nodes before removing a node from the topology. But if the coordinator knew exactly which segments were transferred, it could finish the rebalance immediately and start a new one -- so it would be more similar to the current approach.
> Note: the {{SyncConsistentHashFactory}} allocation is not 100% stable, even when nodes are not added, so A ∈ owners(segment) in topology ABCD does not guarantee that A ∈ owners(segment) in topology ABC. But it should be good enough to keep A an owner in 90% of the cases.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8240) Coordinator sends REBALANCE_START command when there is already a rebalance in progress
by Dan Berindei (JIRA)
Dan Berindei created ISPN-8240:
----------------------------------
Summary: Coordinator sends REBALANCE_START command when there is already a rebalance in progress
Key: ISPN-8240
URL: https://issues.jboss.org/browse/ISPN-8240
Project: Infinispan
Issue Type: Bug
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Minor
Normally the {{REBALANCE_START}} command should only be sent at the start of a rebalance, and any topology updates sent before all the nodes confirm the rebalance phase should have {{CH_UPDATE}}.
Since the change to 4 phases, this is no longer true: first {{ClusterCacheStatus.updateTopologyMembers}} first clears the {{RebalanceConfirmationCollector}}, then it broadcasts a {{CH_UPDATE}}. Then {{queueRebalance}} immediately creates a new {{RCC}} and broadcasts a {{REBALANCE_START}}, instead of waiting for the current rebalance to finish.
I propose we remove {{REBALANCE_START}}, as it was just a crude version of {{CacheTopology.Phase}}. We should also remove the {{isRebalance}} parameter from {{StateConsumerImpl.onTopologyUpdate()}}.
I'm still not sure if rebalancing the pending CH immediately is ok. On the one hand, I would like the rebalance to finish with {{updateMembers(union(currentCH, pendingCH))}} as the new pending CH, so that segments that were already transferred keep an extra copy. On the other hand, that would only help for segments that have at least on owner in the current CH: if the current CH has 0 owners and {{updateMembers}} allocates new ones, those new owners won't request data from the pending CH owners anyway. Fixing that case would require the coordinator to fetch the transfer status from all the nodes before removing a node from the topology. But if the coordinator knew exactly which segments were transferred, it could finish the rebalance immediately and start a new one -- so it would be more similar to the current approach.
Note: the {{SyncConsistentHashFactory}} allocation is not 100% stable, even when nodes are not added, so A ∈ owners(segment) in topology ABCD does not guarantee that A ∈ owners(segment) in topology ABC. But it should be good enough to keep A an owner in 90% of the cases.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-7358) Hot Rod server not dealing with pipe requests properly
by Karl von Randow (JIRA)
[ https://issues.jboss.org/browse/ISPN-7358?page=com.atlassian.jira.plugin.... ]
Karl von Randow commented on ISPN-7358:
---------------------------------------
We are seeing something that might be related to this using Infinispan 8.2.6 as a remote cache. We're unable to upgrade to Infinispan 9.1 as we also use Infinispan as Hibernate 2LC, and it appears that the hotrod connector cannot be upgraded independently (class conflicts at runtime).
We are receiving exceptions like:
{code}
org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=457992 returned server error (status=0x81): org.infinispan.server.hotrod.InvalidMagicIdException: Error reading magic byte or message id: 3
org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=3858 returned server error (status=0x81): org.infinispan.server.hotrod.InvalidMagicIdException: Error reading magic byte or message id: 180
{code}
We're receiving this _really frequently_ when we are using the `replace` method. Here is a representative stack trace:
{code}
org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=110117 returned server error (status=0x81): org.infinispan.server.hotrod.InvalidMagicIdException: Error reading magic byte or message id: 1
42
at org.infinispan.client.hotrod.impl.protocol.Codec20.checkForErrorsInResponseStatus(Codec20.java:350)
at org.infinispan.client.hotrod.impl.protocol.Codec20.readPartialHeader(Codec20.java:139)
at org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:125)
at org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:56)
at org.infinispan.client.hotrod.impl.operations.AbstractKeyOperation.returnVersionedOperationResponse(AbstractKeyOperation.java:63)
at org.infinispan.client.hotrod.impl.operations.ReplaceIfUnmodifiedOperation.executeOperation(ReplaceIfUnmodifiedOperation.java:41)
at org.infinispan.client.hotrod.impl.operations.ReplaceIfUnmodifiedOperation.executeOperation(ReplaceIfUnmodifiedOperation.java:19)
at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:54)
at org.infinispan.client.hotrod.impl.RemoteCacheImpl.replaceWithVersion(RemoteCacheImpl.java:153)
at org.infinispan.client.hotrod.impl.RemoteCacheImpl.replaceWithVersion(RemoteCacheImpl.java:145)
{code}
Should we attempt to backport these fixes to 8.2.6 or does this appear to be a separate issue that we should raise?
> Hot Rod server not dealing with pipe requests properly
> ------------------------------------------------------
>
> Key: ISPN-7358
> URL: https://issues.jboss.org/browse/ISPN-7358
> Project: Infinispan
> Issue Type: Bug
> Components: Remote Protocols
> Affects Versions: 8.2.5.Final, 9.0.0.Beta1
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Priority: Blocker
> Fix For: 9.0.0.Beta2, 9.0.0.Final
>
>
> This might not become so apparent with the current synchronous Java client, but with fully asynchronous clients such as the Javascript one, multiple requests can be piped one after the other.
> Hot Rod server does not often deal with those well, showing exceptions such as:
> {code}
> org.infinispan.server.hotrod.InvalidMagicIdException:
> Error reading magic byte or message id: 119
> {code}
> {code}
> org.infinispan.server.hotrod.UnknownVersionException:
> Unknown version:-96
> {code}
> This exceptions appear when applying considerable load with the Javascript client (see HRJS-24), but the same effect can be replicated with a Netty based, fully asynchronous client, such as the simplified version used [here|https://gist.github.com/galderz/94705dd73d5339b1ab5aa0a5157a9803].
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months
[JBoss JIRA] (ISPN-8232) Transaction inconsistency during network partitions
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-8232?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-8232:
------------------------------------
True, the stable topology doesn't fix the problem by itself. But having a stable topology does allow us to do some extra processing before updating the stable topology in the majority partition, like I suggested in ISPN-3421/ISPN-5046.
OTOH I don't see how an "expected members" list different from the list of members in the stable topology could be maintained without the administrator having to manually intervene every time a node crashes.
> Transaction inconsistency during network partitions
> ---------------------------------------------------
>
> Key: ISPN-8232
> URL: https://issues.jboss.org/browse/ISPN-8232
> Project: Infinispan
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 9.1.0.Final
> Reporter: Pedro Ruivo
> Assignee: Pedro Ruivo
> Priority: Critical
>
> In scenario where the originator stays in minor partition (in our test suite, the originator isolated tests), it is possible to a transaction to be committed and rolled back in the majority partition.
> In {{Pessimitic Locking}}, the transaction is committed in one-phase using the {{PrepareCommand}}. If the partition happens when the originator sends the {{PrepareCommand}}, the nodes in the majority partition may or may not receive it. We can have the case where some nodes receive the {{PrepareCommand}} and applied and other don't receive it.
> When the topology is updated in the majority partition, the {{TransactionTable}} rollbacks all transaction in which the originator isn't present. So, in the nodes where the {{PrepareCommand}} isn't received, the transaction is rolled back.
> The originator in the minory partition detects the partition and marks the transaction partially completed. When the merge occurs, it tries to commit the transaction again. In the nodes where the transaction is rolled back, the transaction is marked as completed and when the {{PrepareCommand}} is received, it throws an {{IllegalStateException}} ({{TransactionTable:386, getOrCreateRemoteTransaction()}}). In this case, the transaction isn't removed from the {{PartitionHandlingManager}} and our test suite fails with {{"there are pending tx".}}
> Other theoretically scenario is the {{PrepareCommand}} to be executed when no locks are acquired.
> The same issue can happen with {{Optimistic Locking}} for the {{CommitCommand}}.
> The problem is the transaction table can't identify is the node left gracefully or not. A solution would be to have an {{"expected members"}} list, ideally separated from the {{CacheTopology}} to avoid sending it every time. Also, it would need some sysadmin tools for the case where the node crashes and it won't be back online for a while (or for some reason, it doesn't need to be back online).
> A sysadmin could remove the node from this list ({{CacheTopology}} is updated and there is no need to increase it) and decide what to do with the pending transactions (or an automatic mechanism to auto-commit/rollback the transaction).
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 4 months