[infinispan-issues] [JBoss JIRA] Updated: (ISPN-902) Data consistency across rehashing
Manik Surtani (JIRA)
jira-events at lists.jboss.org
Mon Jan 31 17:04:39 EST 2011
[ https://issues.jboss.org/browse/ISPN-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manik Surtani updated ISPN-902:
-------------------------------
Fix Version/s: 4.2.1.Final
Affects Version/s: 4.2.0.Final
Priority: Blocker (was: Critical)
Description:
After much testing and analysis (and reopening and fixing ISPN-865), the final issue here is that certain transactions throw an IllegalStateException in commit() - and this cascades into a series of problems.
See http://lists.jboss.org/pipermail/infinispan-dev/2011-January/007320.html for a more detailed discussion.
Original request:
{quote}
There are two scenarios we're seeing on rehashing, both of which are critical.
1. On a node leaving a running cluster, we're seeing an inordinate amount of timeout errors, such as the one below. The end result of this is that the cluster ends up losing data.
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
at org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)
at org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)
at org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)
06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)
2. Joining a node into a running cluster causes transactional failures on the other nodes. Most of the time, depending on the load, a node can take upwards of 8 minutes to join.
I've attached a unit test that can reproduce these issues.
{quote}
was:
There are two scenarios we're seeing on rehashing, both of which are critical.
1. On a node leaving a running cluster, we're seeing an inordinate amount of timeout errors, such as the one below. The end result of this is that the cluster ends up losing data.
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
at org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)
at org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)
at org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)
06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)
2. Joining a node into a running cluster causes transactional failures on the other nodes. Most of the time, depending on the load, a node can take upwards of 8 minutes to join.
I've attached a unit test that can reproduce these issues.
Complexity: High
Component/s: Distributed Cache
Transactions
> Data consistency across rehashing
> ---------------------------------
>
> Key: ISPN-902
> URL: https://issues.jboss.org/browse/ISPN-902
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Cache, Transactions
> Affects Versions: 4.2.0.Final
> Reporter: Erik Salter
> Assignee: Manik Surtani
> Priority: Blocker
> Fix For: 4.2.1.Final
>
> Attachments: cacheTest.zip
>
>
> After much testing and analysis (and reopening and fixing ISPN-865), the final issue here is that certain transactions throw an IllegalStateException in commit() - and this cascades into a series of problems.
> See http://lists.jboss.org/pipermail/infinispan-dev/2011-January/007320.html for a more detailed discussion.
> Original request:
> {quote}
> There are two scenarios we're seeing on rehashing, both of which are critical.
> 1. On a node leaving a running cluster, we're seeing an inordinate amount of timeout errors, such as the one below. The end result of this is that the cluster ends up losing data.
> org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
> at org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)
> at org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)
> at org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)
> 06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)
> 2. Joining a node into a running cluster causes transactional failures on the other nodes. Most of the time, depending on the load, a node can take upwards of 8 minutes to join.
> I've attached a unit test that can reproduce these issues.
> {quote}
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list