[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks
Bela Ban (Commented) (JIRA)
jira-events at lists.jboss.org
Fri Dec 9 10:40:41 EST 2011
[ https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649607#comment-12649607 ]
Bela Ban commented on ISPN-1602:
--------------------------------
Erik, I see address with include a siteId. Can I assume you're using RELAY then ? If you do, the weird thing is that I see more than 2 siteIds, which current RELAY doesn't support, how come I'm seeing siteIds ?
If you do a
grep "protocols.FD" *.log | grep suspect
, you'll see that FD did indeed suspect a few members:
- dht12 suspected dht11 multiple times
- dht12 suspected dht11 because it didn't receive a heartbeat for 24s
- dht11 also suspects dht12...
So dht11 and dht12 suspected each other, and those were the only members ever suspected by FD. (Note that FD_SOCK did *not* suspect anyone.)
Again, possible causes are:
- GC. Are you sure you really only observed GCs under 500ms ? Kind of hard to believe, as GC can take ~ 1s / GB, and you mentioned the other day that you had 6GB of heap...
- High traffic can cause a loss of heartbeat
- Busy CPU: the CPU is busy processing and this can slow the sending of heartbeat-acks down, too
- Thread pools sized too small
- Thread pools with a rejection policy of "run": this is dangerous because, *if* the thread which received a message has to process it (and blocks in app code), it will not be able to process other messages !
This brings me back to what I said before:
#1 Please use one of the default configurations shipped with JGroups 3.0.1 (e.g. udp-largecluster.xml), and make only minor modifications (e.g. thread pool sizes) !
#2 Increase the timeouts in FD
#3 If you use RELAY, then be aware that we haven't tested it yet in the context of Infinispan 5.x and JGroups 3.x ! This task will be started soon, but for now the behavior of rebalancing / state transfer in Infinispan is completely undefined/untested for 5.x/3.x !
> Single view change causes stale locks
> -------------------------------------
>
> Key: ISPN-1602
> URL: https://issues.jboss.org/browse/ISPN-1602
> Project: Infinispan
> Issue Type: Bug
> Components: Core API
> Affects Versions: 5.1.0.CR1
> Reporter: Erik Salter
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 5.1.0.CR2
>
>
> During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views. We know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock on a failed commit. However, we're seeing stale locks occur on a single view change.
> In the following logs, the affected cluster is the erm-cluster-xxxx
> (We also don't know why JGroups 3.x is unstable. We suspected FLUSH and incorrect FD settings, but we removed them, and we're still getting dropped messages)
> The trace logs (It isn't long at all before the issue occurs) are at:
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.log.gz
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list