[infinispan-issues] [JBoss JIRA] Updated: (ISPN-902) Data consistency across rehashing

Monday, 31 January 2011

     [
https://issues.jboss.org/browse/ISPN-902?page=com.atlassian.jira.plugin.s...
]

Manik Surtani updated ISPN-902:
-------------------------------

        Fix Version/s: 4.2.1.Final
    Affects Version/s: 4.2.0.Final
             Priority: Blocker  (was: Critical)
          Description: 
After much testing and analysis (and reopening and fixing ISPN-865), the final issue here
is that certain transactions throw an IllegalStateException in commit() - and this
cascades into a series of problems.

See http://lists.jboss.org/pipermail/infinispan-dev/2011-January/007320.html for a more
detailed discussion.

Original request:
{quote}
There are two scenarios we're seeing on rehashing, both of which are critical.

1.  On a node leaving a running cluster, we're seeing an inordinate amount of timeout
errors, such as the one below.  The end result of this is that the cluster ends up losing
data.

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses! 
        at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)

        at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101) 
        at
org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)

        at
org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)

        at
org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)

        06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all
partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at
org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)

2.  Joining a node into a running cluster causes transactional failures on the other
nodes.  Most of the time, depending on the load, a node can take upwards of 8 minutes to
join.

I've attached a unit test that can reproduce these issues.  
{quote}

  was:
There are two scenarios we're seeing on rehashing, both of which are critical.

1.  On a node leaving a running cluster, we're seeing an inordinate amount of timeout
errors, such as the one below.  The end result of this is that the cluster ends up losing
data.

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses! 
        at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)

        at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101) 
        at
org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)

        at
org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)

        at
org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)

        06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all
partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at
org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)

2.  Joining a node into a running cluster causes transactional failures on the other
nodes.  Most of the time, depending on the load, a node can take upwards of 8 minutes to
join.

I've attached a unit test that can reproduce these issues.  

           Complexity: High
          Component/s: Distributed Cache
                       Transactions

...
 Data consistency across rehashing
 ---------------------------------

                 Key: ISPN-902
                 URL: https://issues.jboss.org/browse/ISPN-902
             Project: Infinispan
          Issue Type: Bug
          Components: Distributed Cache, Transactions
    Affects Versions: 4.2.0.Final
            Reporter: Erik Salter
            Assignee: Manik Surtani
            Priority: Blocker
             Fix For: 4.2.1.Final

         Attachments: cacheTest.zip

 After much testing and analysis (and reopening and fixing ISPN-865), the final issue here
is that certain transactions throw an IllegalStateException in commit() - and this
cascades into a series of problems.
 See http://lists.jboss.org/pipermail/infinispan-dev/2011-January/007320.html for a more
detailed discussion.
 Original request:
 {quote}
 There are two scenarios we're seeing on rehashing, both of which are critical.
 1.  On a node leaving a running cluster, we're seeing an inordinate amount of timeout
errors, such as the one below.  The end result of this is that the cluster ends up losing
data.
 org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses! 
         at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417)

         at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101) 
         at
org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341)

         at
org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143)

         at
org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131)

         06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all
partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at
org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)

 2.  Joining a node into a running cluster causes transactional failures on the other
nodes.  Most of the time, depending on the load, a node can take upwards of 8 minutes to
join.
 I've attached a unit test that can reproduce these issues.  
 {quote} 
-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] Updated: (ISPN-902) Data consistency across rehashing