[
https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-1602:
------------------------------------
Erik, I've investigated the {{ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]}}
logs and the lock is never "leaked". I did find some transactions that kept it
locked ~ 1 minute, but that's because a remote get and a TxCompletionNotification took
~ 30 seconds each to start executing on the remote node:
{noformat}
2011-12-16 17:19:38,935 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending
ClusteredGetCommand{key=ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144], flags=null}
to addresses [dht12-15582(site03), dht11-11001(site02), dht10-34698(site01)]
2011-12-16 17:20:04,634 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command:
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:5524:local, cacheName=eigAllocation}
[sender=dht14-65317(site02)]
2011-12-16 17:20:04,744 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-136,erm-cluster-0a0a0e3c,dht14-65317(site02)) Replication task sending
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:5065:local, cacheName=serviceGroup} to
addresses [dht11-11001(site02), dht10-34698(site01), dht12-15582(site03)]
2011-12-16 17:18:46,653 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(Incoming-7,erm-cluster-0a0a0e3c,dht12-15582(site03)) Attempting to execute command:
TxCompletionNotificationCommand{ xid=null, internalId=0,
gtx=GlobalTransaction:<dht14-65317(site02)>:160:local, cacheName=qamAllocation}
[sender=dht14-65317(site02)]
{noformat}
One reason might be that the non-OOB thread pool has queueing enabled, although with a
queue size of 100 and only ClusteredGetCommands/TxCompletionNotifications executing on the
non-OOB thread pool I don't see how one element could take > 20 seconds to reach
the front of the queue.
I suggest doing a shorter test run with TRACE enabled for JGroups as well as for
Infinispan and taking a thread dump every 5 seconds in order to investigate this better.
PS: This is how I checked for stale locks:
{noformat}
zcat 200threads-withoutrehash/*/server.log*.gz| grep "Failed to acquire lock" |
sed -n -e 's/.*Failed to acquire lock \(.*\), owner is.*/\1/p' | sort |
uniq>keys.txt
for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat
200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to
acquire|Successfully acquired lock|Attempting to unlock" > tmp.txt ; tail -1
tmp.txt | grep -q "Failed to acquire" && (echo --- $k --- ; cat
$tmp.txt) ; done
{noformat}
To see the lock acquisitions/releases for
ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]:
{noformat}
for k in "ServiceGroupKey[edgeDeviceId=45,serviceGroupNo=144]" ; do zcat
200threads-withoutrehash/*/server.log*.gz | grep -F $k | egrep "Failed to
acquire|Successfully acquired lock|Attempting to unlock" >
edgeDeviceId=45,serviceGroupNo=144.txt ; done
cat edgeDeviceId=45,serviceGroupNo=144.txt | sed -n 's/\(.*\) \(DEBUG\|TRACE\)
.*\(Fail\|acquired\|unlock\).*/\1 \3/p' | less
{noformat}
Single view change causes stale locks
-------------------------------------
Key: ISPN-1602
URL:
https://issues.jboss.org/browse/ISPN-1602
Project: Infinispan
Issue Type: Bug
Components: Core API
Affects Versions: 5.1.0.CR1
Reporter: Erik Salter
Assignee: Dan Berindei
Priority: Critical
Fix For: 5.1.0.CR2
Attachments: erm_tcp.xml, session_udp.xml
During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views. We
know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock
on a failed commit. However, we're seeing stale locks occur on a single view change.
In the following logs, the affected cluster is the erm-cluster-xxxx
(We also don't know why JGroups 3.x is unstable. We suspected FLUSH and incorrect FD
settings, but we removed them, and we're still getting dropped messages)
The trace logs (It isn't long at all before the issue occurs) are at:
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.l...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira