[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks

Tuesday, 20 December 2011

    [
https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-1602:
------------------------------------

Erik, I've investigated the {{ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]}}
logs and the lock is never "leaked". I did find some transactions that kept it
locked ~ 1 minute, but that's because a remote get and a TxCompletionNotification took
~ 30 seconds each to start executing on the remote node:

{noformat}
2011-12-16 17:19:38,935 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending
ClusteredGetCommand{key=ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144], flags=null}
to addresses [dht12-15582(site03), dht11-11001(site02), dht10-34698(site01)]
2011-12-16 17:20:04,634 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command:
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:5524:local, cacheName=eigAllocation} 
[sender=dht14-65317(site02)]

2011-12-16 17:20:04,744 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:5065:local, cacheName=serviceGroup}  to
addresses [dht11-11001(site02), dht10-34698(site01), dht12-15582(site03)]
2011-12-16 17:18:46,653 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command:
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:160:local, cacheName=qamAllocation} 
[sender=dht14-65317(site02)]
{noformat}

One reason might be that the non-OOB thread pool has queueing enabled, although with a
queue size of 100 and only ClusteredGetCommands/TxCompletionNotifications executing on the
non-OOB thread pool I don't see how one element could take > 20 seconds to reach
the front of the queue.

I suggest doing a shorter test run with TRACE enabled for JGroups as well as for
Infinispan and taking a thread dump every 5 seconds in order to investigate this better.

PS: This is how I checked for stale locks:
{noformat}
zcat 200threads-withoutrehash/*/server.log*.gz| grep "Failed to acquire lock" |
sed -n -e 's/.*Failed to acquire lock \(.*\), owner is.*/\1/p' | sort |
uniq>keys.txt

for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat
200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to
acquire|Successfully acquired lock|Attempting to unlock" > tmp.txt ; tail -1
tmp.txt | grep -q "Failed to acquire" && (echo --- $k --- ; cat
$tmp.txt) ; done
{noformat}

To see the lock acquisitions/releases for
ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]:
{noformat}
for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat
200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to
acquire|Successfully acquired lock|Attempting to unlock" >
edgeDeviceId=45,serviceGroupNo=144.txt ; done

cat edgeDeviceId=45,serviceGroupNo=144.txt | sed -n 's/\(.*\) \(DEBUG\|TRACE\)
.*\(Fail\|acquired\|unlock\).*/\1 \3/p' | less
{noformat}

...
 Single view change causes stale locks
 -------------------------------------

                 Key: ISPN-1602
                 URL: https://issues.jboss.org/browse/ISPN-1602
             Project: Infinispan
          Issue Type: Bug
          Components: Core API
    Affects Versions: 5.1.0.CR1
            Reporter: Erik Salter
            Assignee: Dan Berindei
            Priority: Critical
             Fix For: 5.1.0.CR2

         Attachments: erm_tcp.xml, session_udp.xml

 During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views.  We
know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock
on a failed commit.  However, we're seeing stale locks occur on a single view change.
 In the following logs, the affected cluster is the erm-cluster-xxxx
 (We also don't know why JGroups 3.x is unstable.  We suspected FLUSH and incorrect FD
settings, but we removed them, and we're still getting dropped messages)
 The trace logs (It isn't long at all before the issue occurs) are at:
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.l...
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.l...
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.l...
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.l...
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.l...
 http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.l... 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks