[infinispan-dev] How to solve ISPN-2291?

Fri Oct 5 21:50:52 EDT 2012

Hi all,

This is related to ISPN-2291.  I think I know what the root cause of this,
and it's pretty insidious.  However, I don't really know the best way to
solve it.

Here's the scenario.  A cluster is happily running, acquiring locks, etc.,
and all of a sudden, a node is killed.   The new view is passed up, the new
CH is calculated, and the StaleTransactionCleanupService starts its job.
However, it doesn't account for registered transactions trying to acquire
locks.  Consider the following 3-node cluster:

A transaction is started on a node 2:

2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.TransactionTable]
(pool-40-thread-1) Created a new local transaction:
LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false,
lockedKeys=null, backupKeyLocks=null, viewId=4}
org.infinispan.transaction.synchronization.SyncLocalTransaction at 2ad

2012-10-05 17:28:24,586 TRACE
[org.infinispan.interceptors.locking.ClusteringDependentLogic]
(pool-40-thread-1) My address is phl-dg2-45331(phl). Am I main owner? -
false

2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.LocalTransaction]
(pool-40-thread-1) Adding remote locks on [phl-dg3-37382(phl)]. Remote locks
are null

2012-10-05 17:28:24,586 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(pool-40-thread-1) phl-dg2-45331(phl) broadcasting call
LockControlCommand{cache=serviceGroup,
keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null,
unlock=false} to recipient list [phl-dg3-37382(phl)]

.

2012-10-05 17:28:24,665 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(pool-40-thread-1) Response: SuccessfulResponse{responseValue=true} 

2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(pool-40-thread-1) Response(s) to LockControlCommand{cache=serviceGroup,
keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null,
unlock=false} is {phl-dg3-37382(phl)=SuccessfulResponse{responseValue=true}
}

Node 3 receives this message and registers a RemoteTransaction and tries to
acquire the lock.

2012-10-05 17:28:24,603 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Created and registered remote
transaction RemoteTransaction{modifications=[], lookedUpEntries={},
lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false,
tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote}

.

2012-10-05 17:28:24,605 TRACE
[org.infinispan.interceptors.locking.PessimisticLockingInterceptor]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Locking key
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201], no need to check for
pending locks.

2012-10-05 17:28:24,605 TRACE
[org.infinispan.util.concurrent.locks.LockManagerImpl]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Attempting to lock
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201] with acquisition timeout
of 5000 millis

2012-10-05 17:28:24,660 TRACE
[org.infinispan.util.concurrent.locks.LockManagerImpl]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Successfully acquired lock
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]!

2012-10-05 17:28:24,660 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Registering locked key:
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]

The bolded parts are important.  The RemoteTransaction is in the
TransactionTable, but the lock has not yet been acquired.  The lock is not
registered with the transaction until that time.

The first node was killed around this time.  Node 2 detects this and
executes the StaleTransactionCleanupService.  Here's where the issue comes
in:

2012-10-05 17:28:24,626 DEBUG
[org.infinispan.topology.LocalTopologyManagerImpl]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Updating local consistent hash(es)
for cache serviceGroup: .

2012-10-05 17:28:24,628 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Saw 1 leavers - kicking off a lock
breaking task

.

2012-10-05 17:28:24,628 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Unlocking keys for which we are no
longer an owner

2012-10-05 17:28:24,629 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Killing remote transaction without
any local keys GlobalTransaction:<phl-dg2-45331(phl)>:685:remote

2012-10-05 17:28:24,629 TRACE
[org.infinispan.commands.tx.AbstractTransactionBoundaryCommand]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) About to execute tx command
RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote,
cacheName='serviceGroup', topologyId=-1}

2012-10-05 17:28:24,629 TRACE
[org.infinispan.interceptors.InvocationContextInterceptor]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Invoked with command RollbackCommand
{gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote,
cacheName='serviceGroup', topologyId=-1} and InvocationContext
[org.infinispan.context.impl.RemoteTxInvocationContext at a7d0d9ae]

2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed
RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null,
backupKeyLocks null, missingLookedUpEntries false,
tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote} from transaction
table.

2012-10-05 17:28:24,630 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Transaction
GlobalTransaction:<phl-dg2-45331(phl)>:685:remote has completed, notifying
listening threads.

2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed local transaction
GlobalTransaction:<phl-dg2-45331(phl)>:685:remote? true

2012-10-05 17:28:24,630 TRACE [org.infinispan.interceptors.CallInterceptor]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Suppressing invocation of method
handleRollbackCommand.  

2012-10-05 17:28:24,630 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Clearing locked keys: null

Aside - since the tx was started on node 2, why was the rollback suppressed?
(Node 2 tries to commit the transaction)

Since there isn't a lock registered, it removes the transaction from the
table.  But when the lock is acquired, it's placed in the LockContainer
(with the owner being the cleaned up transaction).  Subsequently, the lock
never cleans up:

org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock
after [5 seconds] on key
[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]] for requestor
[GlobalTransaction:<phl-dg2-45331(phl)>:1100:remote]! Lock held by
[GlobalTransaction:<phl-dg2-45331(phl)>:685:remote]

I was experimenting with exposing some of the internals of the
LockManager/LockContainers to see if there's a requestor on the stack before
deciding to rollback the transaction.  In this case, the lock would have
still been owned by the same node, so it wouldn't have been cleaned up.  But
this seems very, well, risky (like what happens if 2+ txs are requesting the
same lock?  We only know about 1 requestor, not any others queued.)  Okay,
more like "stupid"

What do you guys think?

Thanks,

Erik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20121005/6dbcc476/attachment.html