EDIT:

I left out a paragraph (going cross-eyed looking at TRACE logs)

The LockingInterceptors have code to wait on pending locks from old views, but at this time both nodes had installed the new view info. I was thinking about adding “pending” locks on a to the AbstractCacheTransaction and when a lock is registered, atomically removing and updating the lock list. This way, the StaleTxCleanupService can skip over pending locks that will belong to the same node. Also, this could be forwarded with the TransactionInfo (before the StaleTx/Lock Breaking service?). But this seems too simplistic.

Thanks,

Erik

From: infinispan-dev-bounces@lists.jboss.org [mailto:infinispan-dev-bounces@lists.jboss.org] On Behalf Of Erik Salter
Sent: Friday, October 05, 2012 9:51 PM
To: 'infinispan -Dev List'
Subject: [infinispan-dev] How to solve ISPN-2291?

Hi all,

This is related to ISPN-2291. I think I know what the root cause of this, and it’s pretty insidious. However, I don’t really know the best way to solve it.

Here’s the scenario. A cluster is happily running, acquiring locks, etc., and all of a sudden, a node is killed. The new view is passed up, the new CH is calculated, and the StaleTransactionCleanupService starts its job. However, it doesn’t account for registered transactions trying to acquire locks. Consider the following 3-node cluster:

A transaction is started on a node 2:

2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.TransactionTable] (pool-40-thread-1) Created a new local transaction: LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false, lockedKeys=null, backupKeyLocks=null, viewId=4} org.infinispan.transaction.synchronization.SyncLocalTransaction@2ad

2012-10-05 17:28:24,586 TRACE [org.infinispan.interceptors.locking.ClusteringDependentLogic] (pool-40-thread-1) My address is phl-dg2-45331(phl). Am I main owner? - false

2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.LocalTransaction] (pool-40-thread-1) Adding remote locks on [phl-dg3-37382(phl)]. Remote locks are null

2012-10-05 17:28:24,586 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (pool-40-thread-1) phl-dg2-45331(phl) broadcasting call LockControlCommand{cache=serviceGroup, keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null, unlock=false} to recipient list [phl-dg3-37382(phl)]

…

2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (pool-40-thread-1) Response: SuccessfulResponse{responseValue=true}

2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (pool-40-thread-1) Response(s) to LockControlCommand{cache=serviceGroup, keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null, unlock=false} is {phl-dg3-37382(phl)=SuccessfulResponse{responseValue=true} }

Node 3 receives this message and registers a RemoteTransaction and tries to acquire the lock.

2012-10-05 17:28:24,603 TRACE [org.infinispan.transaction.TransactionTable] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Created and registered remote transaction RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false, tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote}

…

2012-10-05 17:28:24,605 TRACE [org.infinispan.interceptors.locking.PessimisticLockingInterceptor] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Locking key ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201], no need to check for pending locks.

2012-10-05 17:28:24,605 TRACE [org.infinispan.util.concurrent.locks.LockManagerImpl] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Attempting to lock ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201] with acquisition timeout of 5000 millis

2012-10-05 17:28:24,660 TRACE [org.infinispan.util.concurrent.locks.LockManagerImpl] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Successfully acquired lock ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]!

2012-10-05 17:28:24,660 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-16,erm-cluster,phl-dg3-37382(phl)) Registering locked key: ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]

The bolded parts are important. The RemoteTransaction is in the TransactionTable, but the lock has not yet been acquired. The lock is not registered with the transaction until that time.

The first node was killed around this time. Node 2 detects this and executes the StaleTransactionCleanupService. Here’s where the issue comes in:

2012-10-05 17:28:24,626 DEBUG [org.infinispan.topology.LocalTopologyManagerImpl] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Updating local consistent hash(es) for cache serviceGroup: …

2012-10-05 17:28:24,628 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Saw 1 leavers - kicking off a lock breaking task

…

2012-10-05 17:28:24,628 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Unlocking keys for which we are no longer an owner

2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.StaleTransactionCleanupService] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Killing remote transaction without any local keys GlobalTransaction:<phl-dg2-45331(phl)>:685:remote

2012-10-05 17:28:24,629 TRACE [org.infinispan.commands.tx.AbstractTransactionBoundaryCommand] (OOB-15,erm-cluster,phl-dg3-37382(phl)) About to execute tx command RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote, cacheName='serviceGroup', topologyId=-1}

2012-10-05 17:28:24,629 TRACE [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Invoked with command RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote, cacheName='serviceGroup', topologyId=-1} and InvocationContext [org.infinispan.context.impl.RemoteTxInvocationContext@a7d0d9ae]

2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.TransactionTable] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false, tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote} from transaction table.

2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Transaction GlobalTransaction:<phl-dg2-45331(phl)>:685:remote has completed, notifying listening threads.

2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.TransactionTable] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed local transaction GlobalTransaction:<phl-dg2-45331(phl)>:685:remote? true

2012-10-05 17:28:24,630 TRACE [org.infinispan.interceptors.CallInterceptor] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Suppressing invocation of method handleRollbackCommand.

2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.AbstractCacheTransaction] (OOB-15,erm-cluster,phl-dg3-37382(phl)) Clearing locked keys: null

Aside – since the tx was started on node 2, why was the rollback suppressed? (Node 2 tries to commit the transaction)

Since there isn’t a lock registered, it removes the transaction from the table. But when the lock is acquired, it’s placed in the LockContainer (with the owner being the cleaned up transaction). Subsequently, the lock never cleans up:

org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [5 seconds] on key [ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]] for requestor [GlobalTransaction:<phl-dg2-45331(phl)>:1100:remote]! Lock held by [GlobalTransaction:<phl-dg2-45331(phl)>:685:remote]

I was experimenting with exposing some of the internals of the LockManager/LockContainers to see if there’s a requestor on the stack before deciding to rollback the transaction. In this case, the lock would have still been owned by the same node, so it wouldn’t have been cleaned up. But this seems very, well, risky (like what happens if 2+ txs are requesting the same lock? We only know about 1 requestor, not any others queued.) Okay, more like “stupid”

What do you guys think?

Thanks,

Erik