[infinispan-issues] [JBoss JIRA] Commented: (ISPN-860) Rehashing into a running cluster causes lock timeouts and lock cleanup errors

Mon Jan 10 12:03:49 EST 2011

    [ https://issues.jboss.org/browse/ISPN-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574431#comment-12574431 ] 

Manik Surtani commented on ISPN-860:
------------------------------------

I think this may be the culprit - https://github.com/infinispan/infinispan/blob/master/core/src/main/java/org/infinispan/interceptors/DistributionInterceptor.java#L247

Happens when a prepare() occurs on, say, nodes {A, B, C} and the commit is sent to {A, B, D} since D joins between the prepare and commit, and D takes ownership of the key.

Do you see this exception as a signature of this failure, prior to seeing timeout exceptions?

java.lang.IllegalStateException: Can not commit since DldGlobalTransaction{coinToss=NNNNN, isMarkedForRollback=false, lockIntention=null, affectedKeys=[], locksAtOrigin=[K]} GlobalTransaction:<address>:port:local was prepared on [C1, C2, C3] nodes while it is being committed to [C1, C2, C4]

> Rehashing into a running cluster causes lock timeouts and lock cleanup errors
> -----------------------------------------------------------------------------
>
>                 Key: ISPN-860
>                 URL: https://issues.jboss.org/browse/ISPN-860
>             Project: Infinispan
>          Issue Type: Bug
>    Affects Versions: 4.2.0.Final
>            Reporter: Erik Salter
>            Assignee: Manik Surtani
>             Fix For: 4.2.1.Final
>
>         Attachments: multinode-rehash.zip
>
>
> We are seeing some severe issues with a new node joining a cluster running transactions.  Specifically, when a new node added to the system, some transactions running against the previous nodes will fail due to locks never being released.  There will be a lot of lock timeouts as well.
> All of our caches are in DIST mode.  The number of owners is 3.  We are also making liberal use of the new "eagerLockSingleNode" flag.
> The attached test case illustrates the lock timeout problem.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira