Thanks for the patch, it looks good.
I've just kicked off a test run now. If all goes well, do you guys
feel like kicking off another stress test run if I publish a snapshot?
Cheers
Manik
On 6 Aug 2009, at 04:25, Brian Stansberry wrote:
Patch attached to the JIRA. I was wrong in thinking I could adapt
DGTest to expose the issue. But anyway the testsuite passes with the
patch.
Brian Stansberry wrote:
> I think I may have the WL figured out. Maybe. :)
> Key thing is DataGravitationCleanupCommand is sent to all nodes,
> not just buddies. Result is caches that never stored the fqn will
> be attempting to remove a node that doesn't exist.
> In this case PessimisticNodeLockManager.lock(...) will attempt to
> create the node to be removed as well as all nodes leading to it.
> It will acquire a WL on all nodes it creates. It sets a local
> variable "created" to "true" when it creates a node and once that
> variable is "true" it will try to acquire a WL on any node it locks.
> I suspect the problem comes here:
> // make sure the lock we acquired isn't on a deleted node/is an
> orphan!!
> // look into invalidated nodes as well
> NodeSPI repeek = dataContainer.peek(currentNodeFqn, true, true);
> if (currentNode != repeek)
> {
> // A bunch of "reset" logic
> }
> The "reset logic" involves resetting variables so the loop will
> start over higher in the tree. But it doesn't "reset" the
"created"
> flag, so when it starts over it starts trying to acquire WLs on
> nodes like / and /BUDDY_MANAGER where it only needs a RL.
> This explains why the thread is trying to "upgrade". On an earlier
> pass through the loop it got a RL, second pass it's trying for a WL.
> I'm not exactly sure why the if (currentNode != repeek) test fails,
> but it's happening when a bunch of threads are all doing data
> gravitation cleanups. So there's a lot of concurrent creation of
> nodes/parents/grandparents just so they can be deleted.
> I'll see if I can patch this. I won't commit any fix; will just
> attach to the JIRA. I played with your DGTest and couldn't produce
> a failure, but I instrumented the code with logging and for sure
> see WL's being acquired. Now that I understand this better I can
> probably make a variant of DGTest that fails.
> Brian Stansberry wrote:
>> Just in case it matters, suggest you use a deeper Fqn:
>>
>> for (int i = 0; i < DATA; i++) c1.put("/JSESSION/st_localhost/" +
>> i, "k", "v");
>>
>> Perhaps add a listener to cause a thread to block on reading /
>> JSESSION/st_localhost/0 ? Confirm that the others proceed? That
>> might help expose if/why a WL is being acquired.
>>
>> Manik Surtani wrote:
>>> On 5 Aug 2009, at 16:42, Brian Stansberry wrote:
>>>
>>>> Sorry, nope. :(
>>>>
>>>> I might (30% chance) be able to create one later today.
>>> I've started on a test (attached to JIRA) but no luck in
>>> reproducing this so far. Care to cast your eye over it to
>>> validate the use case?
>>>
>>> Cheers
>>> Manik
>>>
>>>> Manik Surtani wrote:
>>>>> You wouldn't have this encapsulated as a unit test, would you?
>>>>> On 5 Aug 2009, at 14:42, Brian Stansberry wrote:
>>>>>> Paul and I have been poking through code trying to diagnose
>>>>>> the issue
>>>>>> that led to this thread for quite a while yesterday. Time to
>>>>>> throw it
>>>>>> out there and see if anyone hopefully spots something:
>>>>>>
>>>>>> This is web session replication with buddy replication. The
>>>>>> "LockParentForChildInsertRemove" property isn't
configured on
>>>>>> the cache,
>>>>>> so the default "false" should be in effect. A thread
trying to
>>>>>> execute a
>>>>>> remote DataGravitationCleanupCommand on a session node fails
>>>>>> upgrading
>>>>>> to a write lock on fqn /BUDDY_MANAGER.[1] We can't understand
>>>>>> why it
>>>>>> would want a WL.
>>>>>>
>>>>>> Looking at the PessimisticNodeBasedLockManager code, the only
>>>>>> reasons I
>>>>>> see why a WL would be requested here are:
>>>>>>
>>>>>> 1) The invocation created the previously non-existent /
>>>>>> BUDDY_MANAGER
>>>>>> node in order to execute the remove of the
>>>>>> great-great-great-great-grandchild. Highly unlikely in general
>>>>>> and
>>>>>> logging shows another thread successfully doing a put on a
>>>>>> different
>>>>>> great-great-great-great-grandchild a millisecond earlier.
>>>>>> (Also other
>>>>>> test runs w/ less verbose logging showed failures getting a WL
>>>>>> on fqn /
>>>>>> which of course exists).
>>>>>>
>>>>>> 2) In call to PNBLM.writeLockNeeded(...)
>>>>>> currentNode.isLockForChildInsertRemove() is returning
"true".
>>>>>> Which
>>>>>> AFAICT it shouldn't be.
>>>>>>
>>>>>> [1] The logging from the thread:
>>>>>>
>>>>>> [JBoss] 20:29:27,488 TRACE
>>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>>> (Incoming-5,10.34.32.154:46064) Executing command:
>>>>>> DataGravitationCleanupCommand{fqn=/JSESSION/st_localhost/
>>>>>> wUaU4FhUzWcuiDEjkJzDiQ__,
>>>>>> backup=/_BUDDY_BACKUP_/10.34.32.155_41754:DEAD/1/JSESSION/
>>>>>> st_localhost/wUaU4FhUzWcuiDEjkJzDiQ__}
>>>>>> [sender=10.34.32.153:24595]
>>>>>> [JBoss] 20:29:27,489 TRACE
>>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>>> (Incoming-5,10.34.32.154:46064) This is a non-visitable
>>>>>> command - so
>>>>>> performing directly and not via the invoker.
>>>>>> [JBoss] 20:29:42,497 ERROR
>>>>>> [org.jboss.cache.lock.ReadWriteLockWithUpgrade]
>>>>>> (Incoming-5,10.34.32.154:46064) upgradeLock(): failed
>>>>>> [JBoss] 20:29:42,500 TRACE
>>>>>> [org.jboss.cache.marshall.CommandAwareRpcDispatcher]
>>>>>> (Incoming-5,10.34.32.154:46064) Problems invoking command.
>>>>>> [JBoss] org.jboss.cache.lock.UpgradeException: upgrade lock for
>>>>>> /_BUDDY_BACKUP_ could not be acquired after 15000 ms. Lock map
>>>>>> ownership
>>>>>> Read lock owners:
[GlobalTransaction:<10.34.32.153:24595>:70,
>>>>>> GlobalTransaction:<10.34.32.153:24595>:69]
>>>>>> [JBoss] Write lock owner: null
>>>>>> [JBoss] (caller=Thread[Incoming-5,10.34.32.154:46064,5,Thread
>>>>>> Pools],
>>>>>> lock info: read owners=[GlobalTransaction:<
>>>>>> 10.34.32.153:24595>:70,
>>>>>> GlobalTransaction:<10.34.32.153:24595>:69]
(activeReaders=2,
>>>>>> activeWriter=null, waitingReaders=1, waitingWriters=0,
>>>>>> waitingUpgrader=0))
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache.lock.IdentityLock.acquireWriteLock0(IdentityLock.java:
>>>>>> 227)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:176)
>>>>>> [JBoss] at
>>>>>> org.jboss.cache.lock.IdentityLock.acquire(IdentityLock.java:503)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .lock
>>>>>> .PessimisticNodeBasedLockManager
>>>>>> .acquireNodeLock(PessimisticNodeBasedLockManager.java:260)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .lock
>>>>>> .PessimisticNodeBasedLockManager
>>>>>> .lock(PessimisticNodeBasedLockManager.java:169)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .lock
>>>>>> .PessimisticNodeBasedLockManager
>>>>>> .lockPessimistically(PessimisticNodeBasedLockManager.java:97)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .PessimisticLockInterceptor
>>>>>> .handleRemoveNodeCommand(PessimisticLockInterceptor.java:265)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .PrePostProcessingCommandInterceptor
>>>>>> .visitRemoveNodeCommand
>>>>>> (PrePostProcessingCommandInterceptor.java:126)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .CommandInterceptor
>>>>>> .invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .ReplicationInterceptor
>>>>>> .handleCrudMethod(ReplicationInterceptor.java:148)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .ReplicationInterceptor
>>>>>> .visitRemoveNodeCommand(ReplicationInterceptor.java:125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .CommandInterceptor
>>>>>> .invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base.CommandInterceptor.handleDefault(CommandInterceptor.java:
>>>>>> 131)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .CommandInterceptor
>>>>>> .invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .TxInterceptor.attachGtxAndPassUpChain(TxInterceptor.java:301)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors.TxInterceptor.handleDefault(TxInterceptor.java:
>>>>>> 283)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .CommandInterceptor
>>>>>> .invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base.CommandInterceptor.handleDefault(CommandInterceptor.java:
>>>>>> 131)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .AbstractVisitor.visitRemoveNodeCommand(AbstractVisitor.java:75)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .base
>>>>>> .CommandInterceptor
>>>>>> .invokeNextInterceptor(CommandInterceptor.java:116)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .InvocationContextInterceptor
>>>>>> .handleAll(InvocationContextInterceptor.java:178)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors
>>>>>> .InvocationContextInterceptor
>>>>>> .visitRemoveNodeCommand(InvocationContextInterceptor.java:88)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .write.RemoveNodeCommand.acceptVisitor(RemoveNodeCommand.java:
>>>>>> 125)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .interceptors.InterceptorChain.invoke(InterceptorChain.java:287)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .remote
>>>>>> .DataGravitationCleanupCommand
>>>>>> .executeRemove(DataGravitationCleanupCommand.java:151)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .commands
>>>>>> .remote
>>>>>> .DataGravitationCleanupCommand
>>>>>> .perform(DataGravitationCleanupCommand.java:105)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .marshall
>>>>>> .CommandAwareRpcDispatcher
>>>>>> .executeCommand(CommandAwareRpcDispatcher.java:305)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jboss
>>>>>> .cache
>>>>>> .marshall
>>>>>> .CommandAwareRpcDispatcher
>>>>>> .handle(CommandAwareRpcDispatcher.java:242)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jgroups
>>>>>> .blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:
>>>>>> 637)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jgroups
>>>>>> .blocks
>>>>>> .RequestCorrelator.receiveMessage(RequestCorrelator.java:545)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jgroups
>>>>>> .blocks.RequestCorrelator.receive(RequestCorrelator.java:368)
>>>>>> [JBoss] at
>>>>>> org.jgroups.blocks.MessageDispatcher
>>>>>> $ProtocolAdapter.up(MessageDispatcher.java:775)
>>>>>> [JBoss] at org.jgroups.JChannel.up(JChannel.java:1339)
>>>>>> [JBoss] at
>>>>>> org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:462)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:432)
>>>>>> [JBoss] at
>>>>>> org
>>>>>> .jgroups
>>>>>> .protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:144)
>>>>>> [JBoss] at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>>>>>> [JBoss] at org.jgroups.protocols.FC.up(FC.java:473)
>>>>>> [JBoss] at org.jgroups.protocols.pbcast.GMS.up(GMS.java:824)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:233)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:
>>>>>> 616)
>>>>>> [JBoss] at org.jgroups.protocols.UNICAST.up(UNICAST.java:
>>>>>> 282)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:758)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
>>>>>> [JBoss] at org.jgroups.protocols.FD.up(FD.java:284)
>>>>>> [JBoss] at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:
>>>>>> 309)
>>>>>> [JBoss] at org.jgroups.protocols.MERGE2.up(MERGE2.java:144)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.Discovery.up(Discovery.java:264)
>>>>>> [JBoss] at org.jgroups.protocols.PING.up(PING.java:273)
>>>>>> [JBoss] at org.jgroups.protocols.TP
>>>>>> $ProtocolAdapter.up(TP.java:2327)
>>>>>> [JBoss] at org.jgroups.protocols.TP.passMessageUp(TP.java:
>>>>>> 1261)
>>>>>> [JBoss] at org.jgroups.protocols.TP.access$100(TP.java:49)
>>>>>> [JBoss] at
>>>>>> org.jgroups.protocols.TP
>>>>>> $IncomingPacket.handleMyMessage(TP.java:1838)
>>>>>> [JBoss] at org.jgroups.protocols.TP
>>>>>> $IncomingPacket.run(TP.java:1817)
>>>>>> [JBoss] at
>>>>>> java.util.concurrent.ThreadPoolExecutor
>>>>>> $Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>> [JBoss] at
>>>>>> java.util.concurrent.ThreadPoolExecutor
>>>>>> $Worker.run(ThreadPoolExecutor.java:908)
>>>>>> [JBoss] at java.lang.Thread.run(Thread.java:619)
>>>>>>
>>>>>> Manik Surtani wrote:
>>>>>>> On 4 Aug 2009, at 21:28, Brian Stansberry wrote:
>>>>>>>
>>>>>>>> Continuing dialogue with myself. :)
>>>>>>>>
>>>>>>>> I've found the logic in
PessimisticNodeBasedLockManager
>>>>>>>> where the
>>>>>>>> LockParentForChildInsertRemove configuration is meant to
be
>>>>>>>> coming
>>>>>>>> through via the Node.
>>>>>>> Yes, this is because you can override this on a per-node
basis.
>>>>>>>
>>>>>>>> In our testing we're seeing cases where write
>>>>>>>> locks are being acquired on parent nodes, which isn't
the
>>>>>>>> intent. I'll
>>>>>>>> dig further.
>>>>>>>>
>>>>>>>> Brian Stansberry wrote:
>>>>>>>>>
https://jira.jboss.org/jira/browse/JBCACHE-1527
>>>>>>>>>
>>>>>>>>> Brian Stansberry wrote:
>>>>>>>>>> From looking at the JBC 3 code, it seems the
>>>>>>>>>> LockParentForChildInsertRemove configuration is
no longer
>>>>>>>>>> respected for
>>>>>>>>>> pessimistic locking. I can't trace any path
from the
>>>>>>>>>> property in
>>>>>>>>>> Configuration to code that uses it.
>>>>>>>>>> PessimisticLockInterceptor.handlePutCommand,
>>>>>>>>>> handleMoveCommand and
>>>>>>>>>> handleRemoveNodeCommand all always tell the lock
manager
>>>>>>>>>> to lock
>>>>>>>>>> parents. handleEvictFqnCommand always tells the
lock
>>>>>>>>>> manager not to
>>>>>>>>>> lock
>>>>>>>>>> parents.
>>>>>>>>>>
>>>>>>>>>> This is causing failures in buddy replication
testing when
>>>>>>>>>> nodes
>>>>>>>>>> join/leave clusters under load. There's a lot
of data
>>>>>>>>>> gravitation plus
>>>>>>>>>> stuff like migrating defunct backup trees to
"DEAD"
>>>>>>>>>> regions. Too much
>>>>>>>>>> contention for parent level locks.
>>>>>>>>>>
>>>>>>>>>> Plus locking on the structural parent to
add/remove
>>>>>>>>>> session nodes will
>>>>>>>>>> suck for the session caching use case.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Brian Stansberry
>>>>>>>> Lead, AS Clustering
>>>>>>>> JBoss by Red Hat
>>>>>>>> _______________________________________________
>>>>>>>> jbosscache-dev mailing list
>>>>>>>> jbosscache-dev(a)lists.jboss.org
>>>>>>>>
https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>>>>> --
>>>>>>> Manik Surtani
>>>>>>> manik(a)jboss.org
>>>>>>> Lead, Infinispan
>>>>>>> Lead, JBoss Cache
>>>>>>>
http://www.infinispan.org
>>>>>>>
http://www.jbosscache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Brian Stansberry
>>>>>> Lead, AS Clustering
>>>>>> JBoss by Red Hat
>>>>>> _______________________________________________
>>>>>> jbosscache-dev mailing list
>>>>>> jbosscache-dev(a)lists.jboss.org
>>>>>>
https://lists.jboss.org/mailman/listinfo/jbosscache-dev
>>>>> --
>>>>> Manik Surtani
>>>>> manik(a)jboss.org
>>>>> Lead, Infinispan
>>>>> Lead, JBoss Cache
>>>>>
http://www.infinispan.org
>>>>>
http://www.jbosscache.org
>>>>
>>>> --
>>>> Brian Stansberry
>>>> Lead, AS Clustering
>>>> JBoss by Red Hat
>>> --
>>> Manik Surtani
>>> manik(a)jboss.org
>>> Lead, Infinispan
>>> Lead, JBoss Cache
>>>
http://www.infinispan.org
>>>
http://www.jbosscache.org
>>>
>>>
>>>
>>>
>>
--
Brian Stansberry
Lead, AS Clustering
JBoss by Red Hat