[infinispan-issues] [JBoss JIRA] Commented: (ISPN-902) Data consistency across rehashing

Fri Jan 28 08:18:39 EST 2011

    [ https://issues.jboss.org/browse/ISPN-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578689#comment-12578689 ] 

Manik Surtani commented on ISPN-902:
------------------------------------

Hmm - running testRehashJoin(), I actually see the test fail due to unexpected data?

{code}

java.lang.AssertionError: expected:<200000> but was:<204506>

{code}

You may want to double check your test (in the way it gathers stats)

Trying testRehashLeave() now. 

> Data consistency across rehashing
> ---------------------------------
>
>                 Key: ISPN-902
>                 URL: https://issues.jboss.org/browse/ISPN-902
>             Project: Infinispan
>          Issue Type: Bug
>            Reporter: Erik Salter
>            Assignee: Manik Surtani
>            Priority: Critical
>         Attachments: cacheTest.zip
>
>
> There are two scenarios we're seeing on rehashing, both of which are critical.
> 1.  On a node leaving a running cluster, we're seeing an inordinate amount of timeout errors, such as the one below.  The end result of this is that the cluster ends up losing data.
> org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses! 
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:417) 
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101) 
>         at org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:341) 
>         at org.infinispan.interceptors.DistributionInterceptor.realRemoteGet(DistributionInterceptor.java:143) 
>         at org.infinispan.interceptors.DistributionInterceptor.remoteGetAndStoreInL1(DistributionInterceptor.java:131) 
>         06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59) 
> 2.  Joining a node into a running cluster causes transactional failures on the other nodes.  Most of the time, depending on the load, a node can take upwards of 8 minutes to join.
> I've attached a unit test that can reproduce these issues.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira