[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks
Dan Berindei (Commented) (JIRA)
jira-events at lists.jboss.org
Fri Dec 9 09:17:41 EST 2011
[ https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649582#comment-12649582 ]
Dan Berindei commented on ISPN-1602:
------------------------------------
Erik, what's your JGroups configuration? I have seen an error after a merge, the new coordinator (dht10) was unable to get the list of running caches from dht11 (which started up as a separate partition).
It seems that dht11 received the RECOVER_VIEWS command before the merged view was installed, and dht10 did not attempt to retransmit the message for over 1 minute (until the command timed out on the coordinator). Can you reduce {{STABLE.desired_avg_gossip}} in your JGroups configuration to 30000 and see if you still get the stale lock?
We had a similar problem in our test environment but our retransmission delays were ~ 4 seconds so the merge never failed, it just took longer than usual. I started testing unicast messages for the RECOVER_VIEWS command to ensure that it arrives after the view installation messages, but I didn't reach any conclusion at the time. I have a hunch that the {{Message.OOB}} flag may also make it more likely for the message to be dropped, but I need to run more tests.
> Single view change causes stale locks
> -------------------------------------
>
> Key: ISPN-1602
> URL: https://issues.jboss.org/browse/ISPN-1602
> Project: Infinispan
> Issue Type: Bug
> Components: Core API
> Affects Versions: 5.1.0.CR1
> Reporter: Erik Salter
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 5.1.0.CR2
>
>
> During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views. We know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock on a failed commit. However, we're seeing stale locks occur on a single view change.
> In the following logs, the affected cluster is the erm-cluster-xxxx
> (We also don't know why JGroups 3.x is unstable. We suspected FLUSH and incorrect FD settings, but we removed them, and we're still getting dropped messages)
> The trace logs (It isn't long at all before the issue occurs) are at:
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.log.gz
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list