[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks

Fri Dec 9 09:17:41 EST 2011

    [ https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649582#comment-12649582 ] 

Dan Berindei commented on ISPN-1602:
------------------------------------

Erik, what's your JGroups configuration? I have seen an error after a merge, the new coordinator (dht10) was unable to get the list of running caches from dht11 (which started up as a separate partition). 

It seems that dht11 received the RECOVER_VIEWS command before the merged view was installed, and dht10 did not attempt to retransmit the message for over 1 minute (until the command timed out on the coordinator). Can you reduce {{STABLE.desired_avg_gossip}} in your JGroups configuration to 30000 and see if you still get the stale lock?

We had a similar problem in our test environment but our retransmission delays were ~ 4 seconds so the merge never failed, it just took longer than usual. I started testing unicast messages for the RECOVER_VIEWS command to ensure that it arrives after the view installation messages, but I didn't reach any conclusion at the time. I have a hunch that the {{Message.OOB}} flag may also make it more likely for the message to be dropped, but I need to run more tests.

> Single view change causes stale locks
> -------------------------------------
>
>                 Key: ISPN-1602
>                 URL: https://issues.jboss.org/browse/ISPN-1602
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core API
>    Affects Versions: 5.1.0.CR1
>            Reporter: Erik Salter
>            Assignee: Dan Berindei
>            Priority: Critical
>             Fix For: 5.1.0.CR2
>
>
> During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views.  We know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock on a failed commit.  However, we're seeing stale locks occur on a single view change.
> In the following logs, the affected cluster is the erm-cluster-xxxx
> (We also don't know why JGroups 3.x is unstable.  We suspected FLUSH and incorrect FD settings, but we removed them, and we're still getting dropped messages)
> The trace logs (It isn't long at all before the issue occurs) are at:
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.log.gz

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira