[infinispan-issues] [JBoss JIRA] (ISPN-1602) Single view change causes stale locks

Tue Dec 20 10:39:10 EST 2011

    [ https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652311#comment-12652311 ] 

Dan Berindei commented on ISPN-1602:
------------------------------------

Erik, I've investigated the {{ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]}} logs and the lock is never "leaked". I did find some transactions that kept it locked ~ 1 minute, but that's because a remote get and a TxCompletionNotification took ~ 30 seconds each to start executing on the remote node:

{noformat}
2011-12-16 17:19:38,935 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending ClusteredGetCommand{key=ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144], flags=null} to addresses [dht12-15582(site03), dht11-11001(site02), dht10-34698(site01)]
2011-12-16 17:20:04,634 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command: TxCompletionNotificationCommand{ xid=null, internalId=0, gtx=GlobalTransaction:<dht14-65317(site02)>:5524:local, cacheName=eigAllocation}  [sender=dht14-65317(site02)]

2011-12-16 17:20:04,744 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending TxCompletionNotificationCommand{ xid=null, internalId=0, gtx=GlobalTransaction:<dht14-65317(site02)>:5065:local, cacheName=serviceGroup}  to addresses [dht11-11001(site02), dht10-34698(site01), dht12-15582(site03)]
2011-12-16 17:18:46,653 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command: TxCompletionNotificationCommand{ xid=null, internalId=0, gtx=GlobalTransaction:<dht14-65317(site02)>:160:local, cacheName=qamAllocation}  [sender=dht14-65317(site02)]
{noformat}

One reason might be that the non-OOB thread pool has queueing enabled, although with a queue size of 100 and only ClusteredGetCommands/TxCompletionNotifications executing on the non-OOB thread pool I don't see how one element could take > 20 seconds to reach the front of the queue.

I suggest doing a shorter test run with TRACE enabled for JGroups as well as for Infinispan and taking a thread dump every 5 seconds in order to investigate this better.

PS: This is how I checked for stale locks:
{noformat}
zcat 200threads-withoutrehash/*/server.log*.gz| grep "Failed to acquire lock" | sed -n -e 's/.*Failed to acquire lock \(.*\), owner is.*/\1/p' | sort | uniq>keys.txt

for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat 200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to acquire|Successfully acquired lock|Attempting to unlock" > tmp.txt ; tail -1 tmp.txt | grep -q "Failed to acquire" && (echo --- $k --- ; cat $tmp.txt) ; done
{noformat}

To see the lock acquisitions/releases for ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]:
{noformat}
for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat 200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to acquire|Successfully acquired lock|Attempting to unlock" > edgeDeviceId=45,serviceGroupNo=144.txt ; done

cat edgeDeviceId=45,serviceGroupNo=144.txt | sed -n 's/\(.*\) \(DEBUG\|TRACE\) .*\(Fail\|acquired\|unlock\).*/\1 \3/p' | less
{noformat}

> Single view change causes stale locks
> -------------------------------------
>
>                 Key: ISPN-1602
>                 URL: https://issues.jboss.org/browse/ISPN-1602
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core API
>    Affects Versions: 5.1.0.CR1
>            Reporter: Erik Salter
>            Assignee: Dan Berindei
>            Priority: Critical
>             Fix For: 5.1.0.CR2
>
>         Attachments: erm_tcp.xml, session_udp.xml
>
>
> During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views.  We know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock on a failed commit.  However, we're seeing stale locks occur on a single view change.
> In the following logs, the affected cluster is the erm-cluster-xxxx
> (We also don't know why JGroups 3.x is unstable.  We suspected FLUSH and incorrect FD settings, but we removed them, and we're still getting dropped messages)
> The trace logs (It isn't long at all before the issue occurs) are at:
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.log.gz
> http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.log.gz

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira