New subject: Fundamental problem with pessimistic locking

Tuesday, 28 November 2006

So we still haven't discussed my biggest concern here, which is item  
5) below in the list of implications.  Is this performance penalty  
and potential for deadlocks small enough a price to pay for the  
correctness of concurrent access on the root node?  What do people  
think?

...
 From: Manik Surtani [mailto:manik@jboss.org]
 Sent: Monday, November 27, 2006 7:19 PM
 To: Manik Surtani
 Cc: Bela Ban; Ben Wang; Brian Stansberry; Vladimir Blagojevic;  
 Galder Zamarreno
 Subject: Re: Fundamental problem with pessimistic locking

 Ok, this seems to work, making things a lot more 'correct'.  But  
 before I roll this into an official release and start making  
 changes en-masse, porting this to 1.4.x and 2.0.0, I'd like to step  
 back and think about whether this is what we really want.  Here is  
 what I've effectively done with 1.3.0.SP4, all related to  
 pessimistic locking only:

 a)  Added a mechanism for not removing nodes when remove() is  
 called, and instead storing them in a map which can be referenced  
 by concurrent threads and locks attempted.  (Mutated version of  
 Brian's original fix to JBCACHE-871)
 b)  When locking nodes in PLI.lock(), added a mechanism to obtain a  
 WL on a node if the next node after it needs to be created or  
 removed.  (JBCACHE-875)
 c)  Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get 
 (0), which applies the same cache-wide locking behaviour to the  
 root as well.  Prior to this, the root never was locked for anything.

 The implications of these, for the sake of accuracy and  
 correctness, are possibly:

 1)  Performance impact on inspecting nodes in b) to decide on  
 whether WLs are needed
 2)  Memory impact on maintaining a map of removed nodes in a)
 3)  Reduced concurrency due to overall stronger locks in b)
 4)  Much reduced concurrency because of the locking in c)
 5)  Potential of more deadlocks/timeouts because of 3) and 4) above.

 Of the above, 5) manifests itself in a few unit tests that have now  
 started to fail (TxCacheLoaderTest, some state transfer tests,  
 etc.).  Simple example, taken from one of the failing tests, leads  
 to a deadlock:

 1: mgr.begin();
 2: Transaction tx=mgr.getTransaction();
 3: cache1.put("/one/two/three", "key1", "val1");
 4: assertNull(cache2.get("/one/two/three", "key1"));
 5: tx.commit();

 Line 3 obtains a WL on "/" on cache1, for GTX 1
 Line 4 obtains a WL on "/" on cache2, for GTX 2
 Line 5, on replication, tries to get a WL on "/" on cache2, for GTX 1

 Both GTXs relate to the same TX since they are in the same thread.   
 Boom, deadlock.

 One thing here though, in my opinion, another bug in the original PLI:

 When doing a get on a node that doesn't exist, intermediate nodes  
 are created.  E.g., cache2.get("/one/two/three", "key1") actually  
 ends up creating /one/two/three first, and after the JBCACHE-875  
 fix, /, /one and /one/two will be WL'd for a get() on a nonexistent  
 node!!  Shouldn't the loop just be short-circuited such that at any  
 point, if the next node does not exist and the lock_type requested  
 is READ, just return a null?  Saves us a whole bunch of unnecessary  
 WL's ...

 Sorry about the long and rambling email.  Thoughts and opinions?
 --
 Manik Surtani

 Lead, JBoss Cache
 JBoss, a division of Red Hat

 Email: manik(a)jboss.org
 Telephone: +44 7786 702 706
 MSN: manik(a)surtani.org
 Yahoo/AIM/Skype: maniksurtani

 On 27 Nov 2006, at 10:16, Manik Surtani wrote:

> Ok, take away the crap about it being a bug in the util.concurrent  
> code.  It's a bug in JBC, specifically on line 75 in TreeCache.java:
>
> protected DataNode root = NodeFactory.getInstance().createDataNode 
> (NodeFactory.NODE_TYPE_TREENODE, SEPARATOR, Fqn.fromString 
> (SEPARATOR), null, null, this);
>
> :-)  The root node is initialised when new TreeCache() is called,  
> well before isolation levels, etc. are set, which causes the root  
> node to be created with isolation level of NONE.  Hence the insane  
> behaviour when trying to content for write locks on the root node.
>
> Just fixed this, running a bunch of regressions now.
>
> Cheers,
> --
> Manik Surtani
>
> Lead, JBoss Cache
> JBoss, a division of Red Hat
>
> Email: manik(a)jboss.org
> Telephone: +44 7786 702 706
> MSN: manik(a)surtani.org
> Yahoo/AIM/Skype: maniksurtani
>
>
>
> On 26 Nov 2006, at 19:04, Bela Ban wrote:
>
>> Forwarding to the entire group
>>
>> Manik Surtani wrote:
>>> Ok, boiled it down to a contention issue on locking Fqn.ROOT,  
>>> which prior to JBCACHE-875, was never locked - either for  
>>> reading or writing.  (I do this by changing the loop in the lock 
>>> () method in PLI to first consider the root before the  
>>> individual Fqn elements).  (This problem is also apparent in  
>>> o.j.c.transaction.DeadlockTest on a multi-cpu box).
>>>
>>> 2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:  
>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=<unlocked>
>>>  <snip>
>>> 2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:  
>>> fqn=/, caller=GlobalTransaction:<null>:3, lock=<unlocked>
>>>  <snip>
>>> 2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:  
>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=write  
>>> owner=GlobalTransaction:<null>:2
>>>  <snip>
>>> 2006-11-26 14:58:09,581 INFO  [TxInterceptor] (Thread-3) There  
>>> was a problem handling this request
>>> java.lang.IllegalStateException: there is already a writer  
>>> holding the lock: GlobalTransaction:<null>:2 and caller is  
>>> GlobalTransaction:<null>:3
>>> at org.jboss.cache.lock.LockMap.setWriterIfNotNull(LockMap.java: 
>>> 101)
>>> at org.jboss.cache.lock.IdentityLock.acquireWriteLock 
>>> (IdentityLock.java:187)
>>> at org.jboss.cache.Node.acquireWriteLock(Node.java:557)
>>> at org.jboss.cache.Node.acquire(Node.java:517)
>>> < snip - lots>
>>> 2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:  
>>> fqn=/, child_name=NODE
>>>
>>> I can't understand why concurrent WL acquisition in  
>>> IdentityLock.acquireWriteLock() behaves correctly for all nodes  
>>> except the root node.  As you can see in the log snippet above,  
>>> both Thread-2 and Thread-3 call IdentityLock.acquireWriteLock  
>>> (line 178) and get a 'true', and one of the threads cause an  
>>> exception on line 187.
>>>
>>>
>>> --
>>> Manik Surtani
>>>
>>> Lead, JBoss Cache
>>> JBoss, a division of Red Hat
>>>
>>> Email: manik(a)jboss.org
>>> Telephone: +44 7786 702 706
>>> MSN: manik(a)surtani.org
>>> Yahoo/AIM/Skype: maniksurtani
>>>
>>>
>>>
>>> On 26 Nov 2006, at 13:54, Manik Surtani wrote:
>>>
>>>> I didn't want to acquire the WL immediately since it involved  
>>>> an additional test to check if the next node in the fqn needed  
>>>> creation.  But I went with that algorithm in the end since the  
>>>> locks had problems with concurrent readers attempting to  
>>>> upgrade to writers.
>>>>
>>>> So most of the regressions pass, as well as the new tests  
>>>> introduced, and I am very close to something working, EXCEPT  
>>>> one very strange problem with the IdentityLock and  
>>>> ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modifications( 
>>>> ) - get the latest on the 1.3.0 branch for this to make any  
>>>> sense.  The problem is between lines 178 and 187 of  
>>>> IdentityLock, in the acquireWriteLock() method.
>>>>
>>>> Attempting to get a hold of a write lock returns true, but  
>>>> setting the writer throws an exception since another writer  
>>>> exists.  Odd that this happens since the calling thread should  
>>>> have the semaphore by then, also odd that this only seems to  
>>>> happen in this one test which is meant to test concurrency in  
>>>> the CacheLoaderInterceptor.
>>>>
>>>> I'm still investigating, but if you have any ideas about how  
>>>> and why this may happen, I'd love to hear it ... :-)
>>>>
>>>> Cheers,
>>>> --
>>>> Manik Surtani
>>>>
>>>> Lead, JBoss Cache
>>>> JBoss, a division of Red Hat
>>>>
>>>> Email: manik(a)jboss.org
>>>> Telephone: +44 7786 702 706
>>>> MSN: manik(a)surtani.org
>>>> Yahoo/AIM/Skype: maniksurtani
>>>>
>>>>
>>>>
>>>> On 24 Nov 2006, at 15:25, Bela Ban wrote:
>>>>
>>>>>
>>>>>
>>>>> Manik Surtani wrote:
>>>>>>
>>>>>> On 24 Nov 2006, at 14:44, Bela Ban wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The first one you mentioned can lead to race conditions,  
>>>>>> depending on the order of whether the upgrade on b or the  
>>>>>> creation/WL on c happens first.  What I've implemented is  
>>>>>> more like:
>>>>>>
>>>>>> 1: Acquire RL on a
>>>>>> 2: Acquire RL on b
>>>>>> 3: Identify that we need to create c.
>>>>>> 4: Upgrade RL on b to WL
>>>>>> 5: *now* create c, and acquire WL on it.
>>>>>>
>>>>>> So there is a possibility that step 4 may block until other  
>>>>>> readers on b release their locks, but no one else could grab  
>>>>>> the WL since the current TX will have a RL.
>>>>>
>>>>> I see. Why don't you acquire a WL on b (step 2) *immediately*  
>>>>> rather than going through the upgrade if you know you have to  
>>>>> acquire a WL later anyway ? You might still deadlock:
>>>>> 2: acquire RL on b
>>>>> (in the meantime): some other TX acquires a RL on b, possibly  
>>>>> upgrades to WL
>>>>> 3: you want to acquire a WL on b and block on the other TX's  
>>>>> RL or WL
>>>>>
>>>>> -- 
>>>>> Bela Ban
>>>>> Lead JGroups / JBoss Clustering team
>>>>> JBoss - a division of Red Hat
>>>>>
>>>>
>>>
>>
>> -- 
>> Bela Ban
>> Lead JGroups / JBoss Clustering team
>> JBoss - a division of Red Hat
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: Fundamental problem with pessimistic locking