]
RH Bugzilla Integration commented on ISPN-4872:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug
During state transfer, a pessimistic transaction can fail due to a
ClusteredGetCommand forcing a lock command
-------------------------------------------------------------------------------------------------------------
Key: ISPN-4872
URL:
https://issues.jboss.org/browse/ISPN-4872
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 5.2.7.Final, 6.0.2.Final, 7.0.0.CR2
Reporter: Erik Salter
Assignee: Dan Berindei
Fix For: 7.0.0.Final
This is a strange one:
A pessimistic transaction started on the primary owner during state transfer can fail
because a backup owner issues a ClusteredGetCommand to the primary while processing the
write command. This can happen if the backup needs to perform a remote get (due to a
DELTA_WRITE, conditional, or reliable return values). In this case, it's a
DELTA_WRITE.
In this chain of events, state transfer is ongoing, and the union CH is (east-dht5,
west-dht5, east-dht6) In this case, the pessimistic transaction originated on east-dht5
must be invoked on west-dht5 and east-dht6:
{code}
2014-10-21 02:15:01,309 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Created
a new local transaction: LocalXaTransaction{xid=null}
LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false, lockedKeys=null,
backupKeyLocks=null, isFromStateTransfer=false}
globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local,
topologyId=33, age(ms)=0
2014-10-21 02:15:01,309 TRACE
[org.infinispan.util.concurrent.locks.containers.OwnableReentrantPerEntryLockContainer]
(OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO))
Creating and acquiring new lock instance for key
EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831]
2014-10-21 02:15:01,345 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO))
Replication task sending PrepareCommand
{modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831],
value=QamResource [qamId=431831, currentBandwidth=3750000, maxBandwidth=38810000,
maxPrograms=255, currentPrograms=1], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE],
putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1, successful=true,
ignorePreviousValue=false}], onePhaseCommit=true,
gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local,
cacheName='qamAllocation', topologyId=33} to addresses
[240-west-dht5.comcast.net-35898(CH2-Chicago-IL),
240-east-dht6.comcast.net-47049(CMC-Denver-CO)] with response mode GET_ALL
{code}
Let's examine east-dht6. It gets the write request. It needs to perform a remote
get to pull the full key context, since it receives only the delta.
{code}
2014-10-21 02:15:01,362 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO))
Attempting to execute command: PrepareCommand
{modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831],
value=QamResourceDelta [qamId=431831, basePort=2049, currentBandwidth=3750000,
maxBandwidth=38810000, maxPrograms=255, currentPrograms=1, baseProgramNumber=1,
portStep=1, program=Program [portNumber=2053, programNumber=5, status=UNUSED,
sessionToken=, lastUpdatedTime=1413872101310, bandwidth=0, inputId=-1]],
flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], putIfAbsent=false, lifespanMillis=-1,
maxIdleTimeMillis=-1, successful=true, ignorePreviousValue=false}], onePhaseCommit=true,
gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local,
cacheName='qamAllocation', topologyId=33}
[sender=240-east-dht5.comcast.net-26106(CMC-Denver-CO)]
...
2014-10-21 02:15:01,362 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO)) Created
and registered remote transaction RemoteTransaction{modifications=...
globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:remote,
topologyId=33, age(ms)=0
2014-10-21 02:15:01,362 TRACE
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
(OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO))
Replication task sending
ClusteredGetCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831],
flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE]} to addresses
[240-east-dht5.comcast.net-26106(CMC-Denver-CO),
240-west-dht5.comcast.net-35898(CH2-Chicago-IL)] with response mode GET_FIRST
{code}
east-dht5 receives the ClusteredGetCommand request. Because
TxDistributionInterceptor::line 325 is true, the acquireRemoteLock flag is set to true.
{code}
boolean acquireRemoteLock = false;
if (ctx.isInTxScope()) {
TxInvocationContext txContext = (TxInvocationContext) ctx;
acquireRemoteLock = isWrite && isPessimisticCache &&
!txContext.getAffectedKeys().contains(key);
}
{code}
Thus, when east-dht5 runs the ClusteredGetCommand. it will create and invoke a
LockControlCommand. Because the LCC is created inline, it will not have its origin set.
But the LCC will create a RemoteTxInvocationContext. When the LCC runs through the
interceptor chain, TxInterceptor::invokeNextInterceptorAndVerifyTransaction will check the
originator (because of the remote context). When it doesn't find it, it will force a
rollback.
{code}
2014-10-21 02:15:01,347 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Calling
perform() on
ClusteredGetCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831],
flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE]}
2014-10-21 02:15:01,347 TRACE [org.infinispan.transaction.TransactionTable]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Created
and registered remote transaction RemoteTransaction{modifications=[], lookedUpEntries={},
lockedKeys=null, backupKeyLocks=null, missingLookedUpEntries=false,
isMarkedForRollback=false}
globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local,
topologyId=33, age(ms)=0
2014-10-21 02:15:01,347 TRACE [org.infinispan.statetransfer.StateTransferInterceptor]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO))
handleTopologyAffectedCommand for command LockControlCommand{cache=qamAllocation,
keys=[EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831]],
flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], unlock=false}
2014-10-21 02:15:01,347 TRACE
[org.infinispan.interceptors.locking.PessimisticLockingInterceptor]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Locking
key EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], no need to check for
pending locks.
2014-10-21 02:15:01,347 TRACE [org.infinispan.interceptors.TxInterceptor]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO))
invokeNextInterceptorAndVerifyTransaction :: originatorMissing=true,
alreadyCompleted=false
2014-10-21 02:15:01,347 TRACE [org.infinispan.interceptors.TxInterceptor]
(OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Rolling
back remote transaction
GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local
because either already completed(false) or originator(null) no longer in the
cluster(true).
// ROLLBACK HAPPENS HERE
{code}
When the transaction attempts to commit, it fails due to the spurious rollback. For
instance:
{code}
2014-10-21 02:15:01,384 WARN [org.infinispan.remoting.rpc.RpcManagerImpl]
(OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO))
ISPN000071: Caught exception when handling command PrepareCommand
{modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831],
value=QamResource [qamId=431831, currentBandwidth=3750000, maxBandwidth=38810000,
maxPrograms=255, currentPrograms=1], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE],
putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1, successful=true,
ignorePreviousValue=false}], onePhaseCommit=true,
gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local,
cacheName='qamAllocation', topologyId=33}
org.infinispan.remoting.RemoteException: ISPN000217: Received exception from
240-west-dht5.comcast.net-35898(CH2-Chicago-IL), see cause for remote stack trace
...
at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:124)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at
org.infinispan.transaction.TransactionCoordinator.commit(TransactionCoordinator.java:175)
at
org.infinispan.transaction.xa.TransactionXaAdapter.commit(TransactionXaAdapter.java:122)
...
com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelOnePhaseCommit(XAResourceRecord.java:636)
at com.arjuna.ats.arjuna.coordinator.BasicAction.onePhaseCommit(BasicAction.java:2619)
at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1779)
at
com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:88)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:177)
Caused by: org.infinispan.transaction.xa.InvalidTransactionException: This remote
transaction
GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local is
already rolled back
at
org.infinispan.transaction.RemoteTransaction.checkIfRolledBack(RemoteTransaction.java:137)
at
org.infinispan.transaction.RemoteTransaction.putLookedUpEntry(RemoteTransaction.java:73)
at
org.infinispan.context.impl.RemoteTxInvocationContext.putLookedUpEntry(RemoteTxInvocationContext.java:99)
at
org.infinispan.container.EntryFactoryImpl.wrapInternalCacheEntryForPut(EntryFactoryImpl.java:242)
at org.infinispan.container.EntryFactoryImpl.wrapEntryForPut(EntryFactoryImpl.java:166)
...
{code}
So it looks like there's a couple ways to handle this. One would be to only acquire
a remote lock in TxDistributionInterceptor if the current node was the originator.
Another -- possibly less intrusive -- would be to set the origin of the invoked LCC
command to that of the ClusteredGetCommand. The former should be preferable, but this
code tends to be a bit labyrinthine with the various pessimistic use cases out there.