Patch attached to the JIRA. I was wrong in thinking I could adapt DGTest
to expose the issue. But anyway the testsuite passes with the patch.
Brian Stansberry wrote:
I think I may have the WL figured out. Maybe. :)
Key thing is DataGravitationCleanupCommand is sent to all nodes, not
just buddies. Result is caches that never stored the fqn will be
attempting to remove a node that doesn't exist.
In this case PessimisticNodeLockManager.lock(...) will attempt to create
the node to be removed as well as all nodes leading to it. It will
acquire a WL on all nodes it creates. It sets a local variable "created"
to "true" when it creates a node and once that variable is "true" it
will try to acquire a WL on any node it locks.
I suspect the problem comes here:
// make sure the lock we acquired isn't on a deleted node/is an orphan!!
// look into invalidated nodes as well
NodeSPI repeek = dataContainer.peek(currentNodeFqn, true, true);
if (currentNode != repeek)
{
// A bunch of "reset" logic
}
The "reset logic" involves resetting variables so the loop will start
over higher in the tree. But it doesn't "reset" the "created"
flag, so
when it starts over it starts trying to acquire WLs on nodes like / and
/BUDDY_MANAGER where it only needs a RL.
This explains why the thread is trying to "upgrade". On an earlier pass
through the loop it got a RL, second pass it's trying for a WL.
I'm not exactly sure why the if (currentNode != repeek) test fails, but
it's happening when a bunch of threads are all doing data gravitation
cleanups. So there's a lot of concurrent creation of
nodes/parents/grandparents just so they can be deleted.
I'll see if I can patch this. I won't commit any fix; will just attach
to the JIRA. I played with your DGTest and couldn't produce a failure,
but I instrumented the code with logging and for sure see WL's being
acquired. Now that I understand this better I can probably make a
variant of DGTest that fails.
Brian Stansberry wrote:
> Just in case it matters, suggest you use a deeper Fqn:
>
> for (int i = 0; i < DATA; i++) c1.put("/JSESSION/st_localhost/" + i,
> "k", "v");
>
> Perhaps add a listener to cause a thread to block on reading
> /JSESSION/st_localhost/0 ? Confirm that the others proceed? That might
> help expose if/why a WL is being acquired.
>
> Manik Surtani wrote:
>> On 5 Aug 2009, at 16:42, Brian Stansberry wrote:
>>
>>> Sorry, nope. :(
>>>
>>> I might (30% chance) be able to create one later today.
>> I've started on a test (attached to JIRA) but no luck in reproducing
>> this so far. Care to cast your eye over it to validate the use case?
>>
>> Cheers
>> Manik
>>
>>> Manik Surtani wrote:
>>>> You wouldn't have this encapsulated as a unit test, would you?
>>>> On 5 Aug 2009, at 14:42, Brian Stansberry wrote:
>>>>> Paul and I have been poking through code trying to diagnose the
issue
>>>>> that led to this thread for quite a while yesterday. Time to throw
it
>>>>> out there and see if anyone hopefully spots something:
>>>>>
>>>>> This is web session replication with buddy replication. The
>>>>> "LockParentForChildInsertRemove" property isn't
configured on the
>>>>> cache,
>>>>> so the default "false" should be in effect. A thread trying
to
>>>>> execute a
>>>>> remote DataGravitationCleanupCommand on a session node fails
upgrading
>>>>> to a write lock on fqn /BUDDY_MANAGER.[1] We can't understand why
it
>>>>> would want a WL.
>>>>>
>>>>> Looking at the PessimisticNodeBasedLockManager code, the only
>>>>> reasons I
>>>>> see why a WL would be requested here are:
>>>>>
>>>>> 1) The invocation created the previously non-existent /BUDDY_MANAGER
>>>>> node in order to execute the remove of the
>>>>> great-great-great-great-grandchild. Highly unlikely in general and
>>>>> logging shows another thread successfully doing a put on a different
>>>>> great-great-great-great-grandchild a millisecond earlier. (Also
other
>>>>> test runs w/ less verbose logging showed failures getting a WL on
>>>>> fqn /
>>>>> which of course exists).
>>>>>
>>>>> 2) In call to PNBLM.writeLockNeeded(...)
>>>>> currentNode.isLockForChildInsertRemove() is returning
"true". Which
>>>>> AFAICT it shouldn't be.
>>>>>
>>>>> [1] The logging from the thread:
>>>>>
>>>>> [JBoss] 20:29:27,488 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) Executing command:
>>>>>
DataGravitationCleanupCommand{fqn=/JSESSION/st_localhost/wUaU4FhUzWcuiDEjkJzDiQ__,
>>>>>
>>>>>
backup=/_BUDDY_BACKUP_/10.34.32.155_41754:DEAD/1/JSESSION/st_localhost/wUaU4FhUzWcuiDEjkJzDiQ__}
>>>>>
>>>>> [sender=10.34.32.153:24595]
>>>>> [JBoss] 20:29:27,489 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) This is a non-visitable command - so
>>>>> performing directly and not via the invoker.
>>>>> [JBoss] 20:29:42,497 ERROR
>>>>> [org.jboss.cache.lock.ReadWriteLockWithUpgrade]
>>>>> (Incoming-5,10.34.32.154:46064) upgradeLock(): failed
>>>>> [JBoss] 20:29:42,500 TRACE
>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>> (Incoming-5,10.34.32.154:46064) Problems invoking command.
>>>>> [JBoss] org.jboss.cache.lock.UpgradeException: upgrade lock for
>>>>> /_BUDDY_BACKUP_ could not be acquired after 15000 ms. Lock map
>>>>> ownership
>>>>> Read lock owners:
[GlobalTransaction:<10.34.32.153:24595>:70,
>>>>> GlobalTransaction:<10.34.32.153:24595>:69]
>>>>> [JBoss] Write lock owner: null
>>>>> [JBoss] (caller=Thread[Incoming-5,10.34.32.154:46064,5,Thread
Pools],
>>>>> lock info: read
owners=[GlobalTransaction:<10.34.32.153:24595>:70,
>>>>> GlobalTransaction:<10.34.32.153:24595>:69]
(activeReaders=2,
>>>>> activeWriter=null, waitingReaders=1, waitingWriters=0,
>>>>> waitingUpgrader=0))
>>>>> [JBoss] at
>>>>>
org.jboss.cache.lock.IdentityLock.acquireWriteLock0(IdentityLock.java:227)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:176)
>>>>>
>>>>> [JBoss] at
>>>>> org.jboss.cache.lock.IdentityLock.acquire(IdentityLock.java:503)
>>>>> [JBoss] at
>>>>>
org.jboss.cache.lock.PessimisticNodeBasedLockManager.acquireNodeLock(PessimisticNodeBasedLockManager.java:260)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.lock.PessimisticNodeBasedLockManager.lock(PessimisticNodeBasedLockManager.java:169)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.lock.PessimisticNodeBasedLockManager.lockPessimistically(PessimisticNodeBasedLockManager.java:97)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.PessimisticLockInterceptor.handleRemoveNodeCommand(PessimisticLockInterceptor.java:265)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.PrePostProcessingCommandInterceptor.visitRemoveNodeCommand(PrePostProcessingCommandInterceptor.java:126)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:148)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.ReplicationInterceptor.visitRemoveNodeCommand(ReplicationInterceptor.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:131)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.TxInterceptor.attachGtxAndPassUpChain(TxInterceptor.java:301)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.TxInterceptor.handleDefault(TxInterceptor.java:283)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:131)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:178)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.InvocationContextInterceptor.visitRemoveNodeCommand(InvocationContextInterceptor.java:88)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:125)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.interceptors.InterceptorChain.invoke(InterceptorChain.java:287)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.remote.DataGravitationCleanupCommand.executeRemove(DataGravitationCleanupCommand.java:151)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.commands.remote.DataGravitationCleanupCommand.perform(DataGravitationCleanupCommand.java:105)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.marshall.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:305)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jboss.cache.marshall.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:242)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:637)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:545)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:368)
>>>>>
>>>>> [JBoss] at
>>>>>
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:775)
>>>>>
>>>>> [JBoss] at org.jgroups.JChannel.up(JChannel.java:1339)
>>>>> [JBoss] at
>>>>> org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:462)
>>>>> [JBoss] at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:432)
>>>>> [JBoss] at
>>>>>
org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:144)
>>>>>
>>>>> [JBoss] at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>>>>> [JBoss] at org.jgroups.protocols.FC.up(FC.java:473)
>>>>> [JBoss] at org.jgroups.protocols.pbcast.GMS.up(GMS.java:824)
>>>>> [JBoss] at
org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:233)
>>>>> [JBoss] at
>>>>> org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:616)
>>>>> [JBoss] at org.jgroups.protocols.UNICAST.up(UNICAST.java:282)
>>>>> [JBoss] at
org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:758)
>>>>> [JBoss] at
>>>>> org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
>>>>> [JBoss] at org.jgroups.protocols.FD.up(FD.java:284)
>>>>> [JBoss] at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:309)
>>>>> [JBoss] at org.jgroups.protocols.MERGE2.up(MERGE2.java:144)
>>>>> [JBoss] at
org.jgroups.protocols.Discovery.up(Discovery.java:264)
>>>>> [JBoss] at org.jgroups.protocols.PING.up(PING.java:273)
>>>>> [JBoss] at
>>>>> org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2327)
>>>>> [JBoss] at org.jgroups.protocols.TP.passMessageUp(TP.java:1261)
>>>>> [JBoss] at org.jgroups.protocols.TP.access$100(TP.java:49)
>>>>> [JBoss] at
>>>>>
org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1838)
>>>>> [JBoss] at
>>>>> org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1817)
>>>>> [JBoss] at
>>>>>
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>
>>>>> [JBoss] at
>>>>>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>
>>>>> [JBoss] at java.lang.Thread.run(Thread.java:619)
>>>>>
>>>>> Manik Surtani wrote:
>>>>>> On 4 Aug 2009, at 21:28, Brian Stansberry wrote:
>>>>>>
>>>>>>> Continuing dialogue with myself. :)
>>>>>>>
>>>>>>> I've found the logic in PessimisticNodeBasedLockManager
where the
>>>>>>> LockParentForChildInsertRemove configuration is meant to be
coming
>>>>>>> through via the Node.
>>>>>> Yes, this is because you can override this on a per-node basis.
>>>>>>
>>>>>>> In our testing we're seeing cases where write
>>>>>>> locks are being acquired on parent nodes, which isn't the
intent.
>>>>>>> I'll
>>>>>>> dig further.
>>>>>>>
>>>>>>> Brian Stansberry wrote:
>>>>>>>>
https://jira.jboss.org/jira/browse/JBCACHE-1527
>>>>>>>>
>>>>>>>> Brian Stansberry wrote:
>>>>>>>>> From looking at the JBC 3 code, it seems the
>>>>>>>>> LockParentForChildInsertRemove configuration is no
longer
>>>>>>>>> respected for
>>>>>>>>> pessimistic locking. I can't trace any path from
the property in
>>>>>>>>> Configuration to code that uses it.
>>>>>>>>> PessimisticLockInterceptor.handlePutCommand,
handleMoveCommand and
>>>>>>>>> handleRemoveNodeCommand all always tell the lock
manager to lock
>>>>>>>>> parents. handleEvictFqnCommand always tells the lock
manager
>>>>>>>>> not to
>>>>>>>>> lock
>>>>>>>>> parents.
>>>>>>>>>
>>>>>>>>> This is causing failures in buddy replication testing
when nodes
>>>>>>>>> join/leave clusters under load. There's a lot of
data
>>>>>>>>> gravitation plus
>>>>>>>>> stuff like migrating defunct backup trees to
"DEAD" regions.
>>>>>>>>> Too much
>>>>>>>>> contention for parent level locks.
>>>>>>>>>
>>>>>>>>> Plus locking on the structural parent to add/remove
session
>>>>>>>>> nodes will
>>>>>>>>> suck for the session caching use case.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Brian Stansberry
>>>>>>> Lead, AS Clustering
>>>>>>> JBoss by Red Hat
>>>>>>> _______________________________________________
>>>>>>> jbosscache-dev mailing list
>>>>>>> jbosscache-dev(a)lists.jboss.org
>>>>>>>
https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>>>> --
>>>>>> Manik Surtani
>>>>>> manik(a)jboss.org
>>>>>> Lead, Infinispan
>>>>>> Lead, JBoss Cache
>>>>>>
http://www.infinispan.org
>>>>>>
http://www.jbosscache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Brian Stansberry
>>>>> Lead, AS Clustering
>>>>> JBoss by Red Hat
>>>>> _______________________________________________
>>>>> jbosscache-dev mailing list
>>>>> jbosscache-dev(a)lists.jboss.org
>>>>>
https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>> --
>>>> Manik Surtani
>>>> manik(a)jboss.org
>>>> Lead, Infinispan
>>>> Lead, JBoss Cache
>>>>
http://www.infinispan.org
>>>>
http://www.jbosscache.org
>>>
>>> --
>>> Brian Stansberry
>>> Lead, AS Clustering
>>> JBoss by Red Hat
>> --
>> Manik Surtani
>> manik(a)jboss.org
>> Lead, Infinispan
>> Lead, JBoss Cache
>>
http://www.infinispan.org
>>
http://www.jbosscache.org
>>
>>
>>
>>
>