<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Hi all,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>This is related to ISPN-2291. I think I know what the root cause of this, and it’s pretty insidious. However, I don’t really know the best way to solve it.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Here’s the scenario. A cluster is happily running, acquiring locks, etc., and all of a sudden, a node is killed. The new view is passed up, the new CH is calculated, and the StaleTransactionCleanupService starts its job. However, it doesn’t account for registered transactions trying to acquire locks. Consider the following 3-node cluster:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>A transaction is started on a node 2:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.TransactionTable] (pool-40-thread-1) Created a new local transaction: LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false, lockedKeys=null, backupKeyLocks=null, viewId=4} org.infinispan.transaction.synchronization.SyncLocalTransaction@2ad<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,586 TRACE [org.infinispan.interceptors.locking.ClusteringDependentLogic] (pool-40-thread-1) My address is phl-dg2-45331(phl). Am I main owner? - false<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.LocalTransaction] (pool-40-thread-1) Adding remote locks on [phl-dg3-37382(phl)]. Remote locks are null<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,586 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (pool-40-thread-1) phl-dg2-45331(phl) broadcasting call LockControlCommand{cache=serviceGroup, keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null, unlock=false} to recipient list [phl-dg3-37382(phl)]<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>…<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (pool-40-thread-1) Response: SuccessfulResponse{responseValue=true} <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (pool-40-thread-1) Response(s) to LockControlCommand{cache=serviceGroup, keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null, unlock=false} is {phl-dg3-37382(phl)=SuccessfulResponse{responseValue=true} }<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Node 3 receives this message and registers a RemoteTransaction and tries to acquire the lock.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-<b>05 17:28:24,603</b> TRACE [org.infinispan.transaction.TransactionTable] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Created and registered remote transaction RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false, tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote}<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>…<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,605 TRACE [org.infinispan.interceptors.locking.PessimisticLockingInterceptor] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Locking key ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201], no need to check for pending locks.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,605 TRACE [org.infinispan.util.concurrent.locks.LockManagerImpl] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Attempting to lock ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201] with acquisition timeout of 5000 millis<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 <b>17:28:24,660</b> TRACE [org.infinispan.util.concurrent.locks.LockManagerImpl] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Successfully acquired lock ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]!<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,660 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Registering locked key: ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'><o:p> </o:p></span></p><p class=MsoNormal>The bolded parts are important. The RemoteTransaction is in the TransactionTable, but the lock has not yet been acquired. The lock is not registered with the transaction until that time.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>The first node was killed around this time. Node 2 detects this and executes the StaleTransactionCleanupService. Here’s where the issue comes in:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,626 DEBUG [org.infinispan.topology.LocalTopologyManagerImpl] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Updating local consistent hash(es) for cache serviceGroup: …<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,628 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Saw 1 leavers - kicking off a lock breaking task<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>…<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,628 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Unlocking keys for which we are no longer an owner<o:p></o:p></span></p><p class=MsoNormal><b><span style='font-size:8.0pt'>2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Killing remote transaction without any local keys GlobalTransaction:<phl-dg2-45331(phl)>:685:remote<o:p></o:p></span></b></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,629 TRACE [org.infinispan.commands.tx.AbstractTransactionBoundaryCommand] (OOB-15,erm-cluster,phl-dg3-37382(phl)) About to execute tx command RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote, cacheName='serviceGroup', topologyId=-1}<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,629 TRACE [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Invoked with command RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote, cacheName='serviceGroup', topologyId=-1} and InvocationContext [org.infinispan.context.impl.RemoteTxInvocationContext@a7d0d9ae]<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.TransactionTable] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false, tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote} from transaction table.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Transaction GlobalTransaction:<phl-dg2-45331(phl)>:685:remote has completed, notifying listening threads.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.TransactionTable] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed local transaction GlobalTransaction:<phl-dg2-45331(phl)>:685:remote? true<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,630 TRACE [org.infinispan.interceptors.CallInterceptor] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Suppressing invocation of method handleRollbackCommand. <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'>2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Clearing locked keys: null<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'><o:p> </o:p></span></p><p class=MsoNormal>Aside – since the tx was started on node 2, why was the rollback suppressed? (Node 2 tries to commit the transaction)<o:p></o:p></p><p class=MsoNormal><span style='font-size:8.0pt'><o:p> </o:p></span></p><p class=MsoNormal>Since there isn’t a lock registered, it removes the transaction from the table. But when the lock is acquired, it’s placed in the LockContainer (with the owner being the cleaned up transaction). Subsequently, the lock never cleans up:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:8.0pt'>org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [5 seconds] on key [ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]] for requestor [GlobalTransaction:<phl-dg2-45331(phl)>:1100:remote]! Lock held by [GlobalTransaction:<phl-dg2-45331(phl)>:685:remote]<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:8.0pt'><o:p> </o:p></span></p><p class=MsoNormal>I was experimenting with exposing some of the internals of the LockManager/LockContainers to see if there’s a requestor on the stack before deciding to rollback the transaction. In this case, the lock would have still been owned by the same node, so it wouldn’t have been cleaned up. But this seems very, well, risky (like what happens if 2+ txs are requesting the same lock? We only know about 1 requestor, not any others queued.) Okay, more like “stupid”<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>What do you guys think?<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Thanks,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Erik<o:p></o:p></p></div></body></html>