[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks

Fri Dec 9 10:40:41 EST 2011

    [ https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649607#comment-12649607 ] 

Bela Ban commented on ISPN-1602:
--------------------------------

Erik, I see address with include a siteId. Can I assume you're using RELAY then ? If you do, the weird thing is that I see more than 2 siteIds, which current RELAY doesn't support, how come I'm seeing siteIds ?

If you do a 

grep "protocols.FD" *.log | grep suspect

, you'll see that FD did indeed suspect a few members:
- dht12 suspected dht11 multiple times
- dht12 suspected dht11 because it didn't receive a heartbeat for 24s
- dht11 also suspects dht12...

So dht11 and dht12 suspected each other, and those were the only members ever suspected by FD. (Note that FD_SOCK did *not* suspect anyone.)

Again, possible causes are:
- GC. Are you sure you really only observed GCs under 500ms ? Kind of hard to believe, as GC can take ~ 1s / GB, and you mentioned the other day that you had 6GB of heap...
- High traffic can cause a loss of heartbeat
- Busy CPU: the CPU is busy processing and this can slow the sending of heartbeat-acks down, too
- Thread pools sized too small
- Thread pools with a rejection policy of "run": this is dangerous because, *if* the thread which received a message has to process it (and blocks in app code), it will not be able to process other messages !

This brings me back to what I said before: 
#1 Please use one of the default configurations shipped with JGroups 3.0.1 (e.g. udp-largecluster.xml), and make only minor modifications (e.g. thread pool sizes) !

#2 Increase the timeouts in FD

#3 If you use RELAY, then be aware that we haven't tested it yet in the context of Infinispan 5.x and JGroups 3.x ! This task will be started soon, but for now the behavior of rebalancing / state transfer in Infinispan is completely undefined/untested for 5.x/3.x !

> Single view change causes stale locks
> -------------------------------------
>
>                 Key: ISPN-1602
>                 URL: https://issues.jboss.org/browse/ISPN-1602
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core API
>    Affects Versions: 5.1.0.CR1
>            Reporter: Erik Salter
>            Assignee: Dan Berindei
>            Priority: Critical
>             Fix For: 5.1.0.CR2
>
>
> During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views.  We know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock on a failed commit.  However, we're seeing stale locks occur on a single view change.
> In the following logs, the affected cluster is the erm-cluster-xxxx
> (We also don't know why JGroups 3.x is unstable.  We suspected FLUSH and incorrect FD settings, but we removed them, and we're still getting dropped messages)
> The trace logs (It isn't long at all before the issue occurs) are at:
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.log.gz

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira