[
https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-1602:
------------------------------------
Erik, what's your JGroups configuration? I have seen an error after a merge, the new
coordinator (dht10) was unable to get the list of running caches from dht11 (which started
up as a separate partition).
It seems that dht11 received the RECOVER_VIEWS command before the merged view was
installed, and dht10 did not attempt to retransmit the message for over 1 minute (until
the command timed out on the coordinator). Can you reduce {{STABLE.desired_avg_gossip}} in
your JGroups configuration to 30000 and see if you still get the stale lock?
We had a similar problem in our test environment but our retransmission delays were ~ 4
seconds so the merge never failed, it just took longer than usual. I started testing
unicast messages for the RECOVER_VIEWS command to ensure that it arrives after the view
installation messages, but I didn't reach any conclusion at the time. I have a hunch
that the {{Message.OOB}} flag may also make it more likely for the message to be dropped,
but I need to run more tests.
Single view change causes stale locks
-------------------------------------
Key: ISPN-1602
URL:
https://issues.jboss.org/browse/ISPN-1602
Project: Infinispan
Issue Type: Bug
Components: Core API
Affects Versions: 5.1.0.CR1
Reporter: Erik Salter
Assignee: Dan Berindei
Priority: Critical
Fix For: 5.1.0.CR2
During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views. We
know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock
on a failed commit. However, we're seeing stale locks occur on a single view change.
In the following logs, the affected cluster is the erm-cluster-xxxx
(We also don't know why JGroups 3.x is unstable. We suspected FLUSH and incorrect FD
settings, but we removed them, and we're still getting dropped messages)
The trace logs (It isn't long at all before the issue occurs) are at:
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.l...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira