[infinispan-issues] [JBoss JIRA] (ISPN-1581) Improve resiliency of retrying commits on state transfer

Fri Dec 9 05:11:40 EST 2011

    [ https://issues.jboss.org/browse/ISPN-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649556#comment-12649556 ] 

Dan Berindei commented on ISPN-1581:
------------------------------------

Before we released 5.0 I was talking with Manik about skipping the state transfer lock (transaction lock at that time) for commit commands and queuing them instead, to be sent to the new owners after the state transfer is finished. That would have solved the commit resiliency problem, but I felt the cost in complexity was too great. It would have required something similar to the 4.2 non-blocking state transfer, but with only a small part of the benefits (prepare commands would still be blocked). The approach in ISPN-1424 seems much more promising in the long term.

For the short term I was thinking of retrying the commit command with a timeout. That way the user could set the timeout to a very high value to avoid stale locks or to something lower to enforce an upper bound on the execution time of a transaction.

However, because we don't have any automatic way of recovering after a stale lock like this, the only feasible option is setting a very high timeout, so we might as well retry the commit command forever, without giving the user any choice.

> Improve resiliency of retrying commits on state transfer 
> ---------------------------------------------------------
>
>                 Key: ISPN-1581
>                 URL: https://issues.jboss.org/browse/ISPN-1581
>             Project: Infinispan
>          Issue Type: Enhancement
>    Affects Versions: 5.1.0.BETA5
>            Reporter: Erik Salter
>            Assignee: Dan Berindei
>             Fix For: 5.1.0.CR2
>
>
> The current implementation of ISPN-1484 will retry up to 3 times to retry a commit on a remote node.  This is resilient to 3 state transfers happening in rapid succession.  However, if the cluster loses > 3 nodes, there can still be stale locks.
> This is evident in testing this with the TopologyAwareConsistentHash.  I lost a "site" consisting of 4 nodes, and I was able to get stale locks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira