EDIT:
I left out a paragraph (going cross-eyed looking at TRACE logs)
The LockingInterceptors have code to wait on pending locks from old views,
but at this time both nodes had installed the new view info. I was thinking
about adding "pending" locks on a to the AbstractCacheTransaction and when a
lock is registered, atomically removing and updating the lock list. This
way, the StaleTxCleanupService can skip over pending locks that will belong
to the same node. Also, this could be forwarded with the TransactionInfo
(before the StaleTx/Lock Breaking service?). But this seems too simplistic.
Thanks,
Erik
From: infinispan-dev-bounces(a)lists.jboss.org
[mailto:infinispan-dev-bounces@lists.jboss.org] On Behalf Of Erik Salter
Sent: Friday, October 05, 2012 9:51 PM
To: 'infinispan -Dev List'
Subject: [infinispan-dev] How to solve ISPN-2291?
Hi all,
This is related to ISPN-2291. I think I know what the root cause of this,
and it's pretty insidious. However, I don't really know the best way to
solve it.
Here's the scenario. A cluster is happily running, acquiring locks, etc.,
and all of a sudden, a node is killed. The new view is passed up, the new
CH is calculated, and the StaleTransactionCleanupService starts its job.
However, it doesn't account for registered transactions trying to acquire
locks. Consider the following 3-node cluster:
A transaction is started on a node 2:
2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.TransactionTable]
(pool-40-thread-1) Created a new local transaction:
LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false,
lockedKeys=null, backupKeyLocks=null, viewId=4}
org.infinispan.transaction.synchronization.SyncLocalTransaction@2ad
2012-10-05 17:28:24,586 TRACE
[org.infinispan.interceptors.locking.ClusteringDependentLogic]
(pool-40-thread-1) My address is phl-dg2-45331(phl). Am I main owner? -
false
2012-10-05 17:28:24,586 TRACE [org.infinispan.transaction.LocalTransaction]
(pool-40-thread-1) Adding remote locks on [phl-dg3-37382(phl)]. Remote locks
are null
2012-10-05 17:28:24,586 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(pool-40-thread-1) phl-dg2-45331(phl) broadcasting call
LockControlCommand{cache=serviceGroup,
keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null,
unlock=false} to recipient list [phl-dg3-37382(phl)]
.
2012-10-05 17:28:24,665 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(pool-40-thread-1) Response: SuccessfulResponse{responseValue=true}
2012-10-05 17:28:24,665 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl]
(pool-40-thread-1) Response(s) to LockControlCommand{cache=serviceGroup,
keys=[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]], flags=null,
unlock=false} is {phl-dg3-37382(phl)=SuccessfulResponse{responseValue=true}
}
Node 3 receives this message and registers a RemoteTransaction and tries to
acquire the lock.
2012-10-05 17:28:24,603 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Created and registered remote
transaction RemoteTransaction{modifications=[], lookedUpEntries={},
lockedKeys= null, backupKeyLocks null, missingLookedUpEntries false,
tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote}
.
2012-10-05 17:28:24,605 TRACE
[org.infinispan.interceptors.locking.PessimisticLockingInterceptor]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Locking key
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201], no need to check for
pending locks.
2012-10-05 17:28:24,605 TRACE
[org.infinispan.util.concurrent.locks.LockManagerImpl]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Attempting to lock
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201] with acquisition timeout
of 5000 millis
2012-10-05 17:28:24,660 TRACE
[org.infinispan.util.concurrent.locks.LockManagerImpl]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Successfully acquired lock
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]!
2012-10-05 17:28:24,660 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-16,erm-cluster,phl-dg3-37382(phl)) Registering locked key:
ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]
The bolded parts are important. The RemoteTransaction is in the
TransactionTable, but the lock has not yet been acquired. The lock is not
registered with the transaction until that time.
The first node was killed around this time. Node 2 detects this and
executes the StaleTransactionCleanupService. Here's where the issue comes
in:
2012-10-05 17:28:24,626 DEBUG
[org.infinispan.topology.LocalTopologyManagerImpl]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Updating local consistent hash(es)
for cache serviceGroup: .
2012-10-05 17:28:24,628 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Saw 1 leavers - kicking off a lock
breaking task
.
2012-10-05 17:28:24,628 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Unlocking keys for which we are no
longer an owner
2012-10-05 17:28:24,629 TRACE
[org.infinispan.transaction.StaleTransactionCleanupService]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Killing remote transaction without
any local keys GlobalTransaction:<phl-dg2-45331(phl)>:685:remote
2012-10-05 17:28:24,629 TRACE
[org.infinispan.commands.tx.AbstractTransactionBoundaryCommand]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) About to execute tx command
RollbackCommand {gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote,
cacheName='serviceGroup', topologyId=-1}
2012-10-05 17:28:24,629 TRACE
[org.infinispan.interceptors.InvocationContextInterceptor]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Invoked with command RollbackCommand
{gtx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote,
cacheName='serviceGroup', topologyId=-1} and InvocationContext
[org.infinispan.context.impl.RemoteTxInvocationContext@a7d0d9ae]
2012-10-05 17:28:24,629 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed
RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys= null,
backupKeyLocks null, missingLookedUpEntries false,
tx=GlobalTransaction:<phl-dg2-45331(phl)>:685:remote} from transaction
table.
2012-10-05 17:28:24,630 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Transaction
GlobalTransaction:<phl-dg2-45331(phl)>:685:remote has completed, notifying
listening threads.
2012-10-05 17:28:24,630 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Removed local transaction
GlobalTransaction:<phl-dg2-45331(phl)>:685:remote? true
2012-10-05 17:28:24,630 TRACE [org.infinispan.interceptors.CallInterceptor]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Suppressing invocation of method
handleRollbackCommand.
2012-10-05 17:28:24,630 TRACE
[org.infinispan.transaction.AbstractCacheTransaction]
(OOB-15,erm-cluster,phl-dg3-37382(phl)) Clearing locked keys: null
Aside - since the tx was started on node 2, why was the rollback suppressed?
(Node 2 tries to commit the transaction)
Since there isn't a lock registered, it removes the transaction from the
table. But when the lock is acquired, it's placed in the LockContainer
(with the owner being the cleaned up transaction). Subsequently, the lock
never cleans up:
org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock
after [5 seconds] on key
[ServiceGroupKey[edgeDeviceId=2,serviceGroupNo=201]] for requestor
[GlobalTransaction:<phl-dg2-45331(phl)>:1100:remote]! Lock held by
[GlobalTransaction:<phl-dg2-45331(phl)>:685:remote]
I was experimenting with exposing some of the internals of the
LockManager/LockContainers to see if there's a requestor on the stack before
deciding to rollback the transaction. In this case, the lock would have
still been owned by the same node, so it wouldn't have been cleaned up. But
this seems very, well, risky (like what happens if 2+ txs are requesting the
same lock? We only know about 1 requestor, not any others queued.) Okay,
more like "stupid"
What do you guys think?
Thanks,
Erik