[jbosscache-dev] LockParentForChildInsertRemove and PessimisticLocking

Brian Stansberry brian.stansberry at redhat.com
Wed Aug 5 21:22:24 EDT 2009


I think I may have the WL figured out. Maybe. :)

Key thing is DataGravitationCleanupCommand is sent to all nodes, not 
just buddies. Result is caches that never stored the fqn will be 
attempting to remove a node that doesn't exist.

In this case PessimisticNodeLockManager.lock(...) will attempt to create 
the node to be removed as well as all nodes leading to it. It will 
acquire a WL on all nodes it creates. It sets a local variable "created" 
to "true" when it creates a node and once that variable is "true" it 
will try to acquire a WL on any node it locks.

I suspect the problem comes here:

// make sure the lock we acquired isn't on a deleted node/is an orphan!!
// look into invalidated nodes as well
NodeSPI repeek = dataContainer.peek(currentNodeFqn, true, true);
if (currentNode != repeek)
{
   // A bunch of "reset" logic
}

The "reset logic" involves resetting variables so the loop will start 
over higher in the tree. But it doesn't "reset" the "created" flag, so 
when it starts over it starts trying to acquire WLs on nodes like / and 
/BUDDY_MANAGER where it only needs a RL.

This explains why the thread is trying to "upgrade". On an earlier pass 
through the loop it got a RL, second pass it's trying for a WL.

I'm not exactly sure why the if (currentNode != repeek) test fails, but 
it's happening when a bunch of threads are all doing data gravitation 
cleanups. So there's a lot of concurrent creation of 
nodes/parents/grandparents just so they can be deleted.

I'll see if I can patch this. I won't commit any fix; will just attach 
to the JIRA. I played with your DGTest and couldn't produce a failure, 
but I instrumented the code with logging and for sure see WL's being 
acquired. Now that I understand this better I can probably make a 
variant of DGTest that fails.


Brian Stansberry wrote:
> Just in case it matters, suggest you use a deeper Fqn:
> 
> for (int i = 0; i < DATA; i++) c1.put("/JSESSION/st_localhost/" + i, 
> "k", "v");
> 
> Perhaps add a listener to cause a thread to block on reading 
> /JSESSION/st_localhost/0 ?  Confirm that the others proceed?  That might 
> help expose if/why a WL is being acquired.
> 
> Manik Surtani wrote:
>>
>> On 5 Aug 2009, at 16:42, Brian Stansberry wrote:
>>
>>> Sorry, nope. :(
>>>
>>> I might (30% chance) be able to create one later today.
>>
>> I've started on a test (attached to JIRA) but no luck in reproducing 
>> this so far.  Care to cast your eye over it to validate the use case?
>>
>> Cheers
>> Manik
>>
>>>
>>> Manik Surtani wrote:
>>>> You wouldn't have this encapsulated as a unit test, would you?
>>>> On 5 Aug 2009, at 14:42, Brian Stansberry wrote:
>>>>> Paul and I have been poking through code trying to diagnose the issue
>>>>> that led to this thread for quite a while yesterday. Time to throw it
>>>>> out there and see if anyone hopefully spots something:
>>>>>
>>>>> This is web session replication with buddy replication. The
>>>>> "LockParentForChildInsertRemove" property isn't configured on the 
>>>>> cache,
>>>>> so the default "false" should be in effect. A thread trying to 
>>>>> execute a
>>>>> remote DataGravitationCleanupCommand on a session node fails upgrading
>>>>> to a write lock on fqn /BUDDY_MANAGER.[1] We can't understand why it
>>>>> would want a WL.
>>>>>
>>>>> Looking at the PessimisticNodeBasedLockManager code, the only 
>>>>> reasons I
>>>>> see why a WL would be requested here are:
>>>>>
>>>>> 1) The invocation created the previously non-existent /BUDDY_MANAGER
>>>>> node in order to execute the remove of the
>>>>> great-great-great-great-grandchild. Highly unlikely in general and
>>>>> logging shows another thread successfully doing a put on a different
>>>>> great-great-great-great-grandchild a millisecond earlier. (Also other
>>>>> test runs w/ less verbose logging showed failures getting a WL on 
>>>>> fqn /
>>>>> which of course exists).
>>>>>
>>>>> 2) In call to PNBLM.writeLockNeeded(...)
>>>>> currentNode.isLockForChildInsertRemove() is returning "true". Which
>>>>> AFAICT it shouldn't be.
>>>>>
>>>>> [1] The logging from the thread:
>>>>>
>>>>> [JBoss] 20:29:27,488 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) Executing command:
>>>>> DataGravitationCleanupCommand{fqn=/JSESSION/st_localhost/wUaU4FhUzWcuiDEjkJzDiQ__, 
>>>>>
>>>>> backup=/_BUDDY_BACKUP_/10.34.32.155_41754:DEAD/1/JSESSION/st_localhost/wUaU4FhUzWcuiDEjkJzDiQ__} 
>>>>>
>>>>> [sender=10.34.32.153:24595]
>>>>> [JBoss] 20:29:27,489 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) This is a non-visitable command - so
>>>>> performing directly and not via the invoker.
>>>>> [JBoss] 20:29:42,497 ERROR
>>>>> [org.jboss.cache.lock.ReadWriteLockWithUpgrade]
>>>>> (Incoming-5,10.34.32.154:46064) upgradeLock(): failed
>>>>> [JBoss] 20:29:42,500 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) Problems invoking command.
>>>>> [JBoss] org.jboss.cache.lock.UpgradeException: upgrade lock for
>>>>> /_BUDDY_BACKUP_ could not be acquired after 15000 ms. Lock map 
>>>>> ownership
>>>>> Read lock owners: [GlobalTransaction:&lt;10.34.32.153:24595>:70,
>>>>> GlobalTransaction:&lt;10.34.32.153:24595>:69]
>>>>> [JBoss] Write lock owner: null
>>>>> [JBoss]  (caller=Thread[Incoming-5,10.34.32.154:46064,5,Thread Pools],
>>>>> lock info: read owners=[GlobalTransaction:&lt;10.34.32.153:24595>:70,
>>>>> GlobalTransaction:&lt;10.34.32.153:24595>:69] (activeReaders=2,
>>>>> activeWriter=null, waitingReaders=1, waitingWriters=0, 
>>>>> waitingUpgrader=0))
>>>>> [JBoss]     at
>>>>> org.jboss.cache.lock.IdentityLock.acquireWriteLock0(IdentityLock.java:227) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:176) 
>>>>>
>>>>> [JBoss]     at 
>>>>> org.jboss.cache.lock.IdentityLock.acquire(IdentityLock.java:503)
>>>>> [JBoss]     at
>>>>> org.jboss.cache.lock.PessimisticNodeBasedLockManager.acquireNodeLock(PessimisticNodeBasedLockManager.java:260) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.lock.PessimisticNodeBasedLockManager.lock(PessimisticNodeBasedLockManager.java:169) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.lock.PessimisticNodeBasedLockManager.lockPessimistically(PessimisticNodeBasedLockManager.java:97) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.PessimisticLockInterceptor.handleRemoveNodeCommand(PessimisticLockInterceptor.java:265) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.PrePostProcessingCommandInterceptor.visitRemoveNodeCommand(PrePostProcessingCommandInterceptor.java:126) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:148) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.ReplicationInterceptor.visitRemoveNodeCommand(ReplicationInterceptor.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:131) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.TxInterceptor.attachGtxAndPassUpChain(TxInterceptor.java:301) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.TxInterceptor.handleDefault(TxInterceptor.java:283) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:131) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:178) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.InvocationContextInterceptor.visitRemoveNodeCommand(InvocationContextInterceptor.java:88) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.interceptors.InterceptorChain.invoke(InterceptorChain.java:287) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.remote.DataGravitationCleanupCommand.executeRemove(DataGravitationCleanupCommand.java:151) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.commands.remote.DataGravitationCleanupCommand.perform(DataGravitationCleanupCommand.java:105) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.marshall.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:305) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jboss.cache.marshall.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:242) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:637) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:545) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:368) 
>>>>>
>>>>> [JBoss]     at
>>>>> org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:775) 
>>>>>
>>>>> [JBoss]     at org.jgroups.JChannel.up(JChannel.java:1339)
>>>>> [JBoss]     at 
>>>>> org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:462)
>>>>> [JBoss]     at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:432)
>>>>> [JBoss]     at
>>>>> org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:144) 
>>>>>
>>>>> [JBoss]     at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>>>>> [JBoss]     at org.jgroups.protocols.FC.up(FC.java:473)
>>>>> [JBoss]     at org.jgroups.protocols.pbcast.GMS.up(GMS.java:824)
>>>>> [JBoss]     at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:233)
>>>>> [JBoss]     at
>>>>> org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:616)
>>>>> [JBoss]     at org.jgroups.protocols.UNICAST.up(UNICAST.java:282)
>>>>> [JBoss]     at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:758)
>>>>> [JBoss]     at 
>>>>> org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
>>>>> [JBoss]     at org.jgroups.protocols.FD.up(FD.java:284)
>>>>> [JBoss]     at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:309)
>>>>> [JBoss]     at org.jgroups.protocols.MERGE2.up(MERGE2.java:144)
>>>>> [JBoss]     at org.jgroups.protocols.Discovery.up(Discovery.java:264)
>>>>> [JBoss]     at org.jgroups.protocols.PING.up(PING.java:273)
>>>>> [JBoss]     at 
>>>>> org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2327)
>>>>> [JBoss]     at org.jgroups.protocols.TP.passMessageUp(TP.java:1261)
>>>>> [JBoss]     at org.jgroups.protocols.TP.access$100(TP.java:49)
>>>>> [JBoss]     at
>>>>> org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1838)
>>>>> [JBoss]     at 
>>>>> org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1817)
>>>>> [JBoss]     at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) 
>>>>>
>>>>> [JBoss]     at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
>>>>>
>>>>> [JBoss]     at java.lang.Thread.run(Thread.java:619)
>>>>>
>>>>> Manik Surtani wrote:
>>>>>>
>>>>>> On 4 Aug 2009, at 21:28, Brian Stansberry wrote:
>>>>>>
>>>>>>> Continuing dialogue with myself. :)
>>>>>>>
>>>>>>> I've found the logic in PessimisticNodeBasedLockManager where the
>>>>>>> LockParentForChildInsertRemove configuration is meant to be coming
>>>>>>> through via the Node.
>>>>>>
>>>>>> Yes, this is because you can override this on a per-node basis.
>>>>>>
>>>>>>> In our testing we're seeing cases where write
>>>>>>> locks are being acquired on parent nodes, which isn't the intent. 
>>>>>>> I'll
>>>>>>> dig further.
>>>>>>>
>>>>>>> Brian Stansberry wrote:
>>>>>>>> https://jira.jboss.org/jira/browse/JBCACHE-1527
>>>>>>>>
>>>>>>>> Brian Stansberry wrote:
>>>>>>>>> From looking at the JBC 3 code, it seems the
>>>>>>>>> LockParentForChildInsertRemove configuration is no longer 
>>>>>>>>> respected for
>>>>>>>>> pessimistic locking.  I can't trace any path from the property in
>>>>>>>>> Configuration to code that uses it.
>>>>>>>>> PessimisticLockInterceptor.handlePutCommand, handleMoveCommand and
>>>>>>>>> handleRemoveNodeCommand all always tell the lock manager to lock
>>>>>>>>> parents. handleEvictFqnCommand always tells the lock manager 
>>>>>>>>> not to
>>>>>>>>> lock
>>>>>>>>> parents.
>>>>>>>>>
>>>>>>>>> This is causing failures in buddy replication testing when nodes
>>>>>>>>> join/leave clusters under load. There's a lot of data 
>>>>>>>>> gravitation plus
>>>>>>>>> stuff like migrating defunct backup trees to "DEAD" regions. 
>>>>>>>>> Too much
>>>>>>>>> contention for parent level locks.
>>>>>>>>>
>>>>>>>>> Plus locking on the structural parent to add/remove session 
>>>>>>>>> nodes will
>>>>>>>>> suck for the session caching use case.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Brian Stansberry
>>>>>>> Lead, AS Clustering
>>>>>>> JBoss by Red Hat
>>>>>>> _______________________________________________
>>>>>>> jbosscache-dev mailing list
>>>>>>> jbosscache-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>>>>
>>>>>> -- 
>>>>>> Manik Surtani
>>>>>> manik at jboss.org
>>>>>> Lead, Infinispan
>>>>>> Lead, JBoss Cache
>>>>>> http://www.infinispan.org
>>>>>> http://www.jbosscache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Brian Stansberry
>>>>> Lead, AS Clustering
>>>>> JBoss by Red Hat
>>>>> _______________________________________________
>>>>> jbosscache-dev mailing list
>>>>> jbosscache-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>> -- 
>>>> Manik Surtani
>>>> manik at jboss.org
>>>> Lead, Infinispan
>>>> Lead, JBoss Cache
>>>> http://www.infinispan.org
>>>> http://www.jbosscache.org
>>>
>>>
>>> -- 
>>> Brian Stansberry
>>> Lead, AS Clustering
>>> JBoss by Red Hat
>>
>> -- 
>> Manik Surtani
>> manik at jboss.org
>> Lead, Infinispan
>> Lead, JBoss Cache
>> http://www.infinispan.org
>> http://www.jbosscache.org
>>
>>
>>
>>
> 
> 


-- 
Brian Stansberry
Lead, AS Clustering
JBoss by Red Hat


More information about the jbosscache-dev mailing list