I agree as well; just wanted to sound everyone on potential breakage  
in (poorly written) end-user code.
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 1 Dec 2006, at 08:38, Bela Ban wrote:
 IMO correctness trumps speed. If developers map their data  
 structures to JBC incorrectly, creating lots of concurrency issues  
 it is THEIR problem. This is the same with mapping apps to a DB in  
 essence.
 Folks can always opt for lower isolation levels and/or optimistic  
 locking...
 Manik Surtani wrote:
> So we still haven't discussed my biggest concern here, which is  
> item 5) below in the list of implications.  Is this performance  
> penalty and potential for deadlocks small enough a price to pay  
> for the correctness of concurrent access on the root node?  What  
> do people think?
>
>
>
>> *From:* Manik Surtani [mailto:manik@jboss.org]
>> *Sent:* Monday, November 27, 2006 7:19 PM
>> *To:* Manik Surtani
>> *Cc:* Bela Ban; Ben Wang; Brian Stansberry; Vladimir Blagojevic;  
>> Galder Zamarreno
>> *Subject:* Re: Fundamental problem with pessimistic locking
>>
>> Ok, this seems to work, making things a lot more 'correct'.  But  
>> before I roll this into an official release and start making  
>> changes en-masse, porting this to 1.4.x and 2.0.0, I'd like to  
>> step back and think about whether this is what we really want.   
>> Here is what I've effectively done with 1.3.0.SP4, all related to  
>> pessimistic locking only:
>>
>> a)  Added a mechanism for not removing nodes when remove() is  
>> called, and instead storing them in a map which can be referenced  
>> by concurrent threads and locks attempted.  (Mutated version of  
>> Brian's original fix to JBCACHE-871)
>> b)  When locking nodes in PLI.lock(), added a mechanism to obtain  
>> a WL on a node if the next node after it needs to be created or  
>> removed.  (JBCACHE-875)
>> c)  Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get 
>> (0), which applies the same cache-wide locking behaviour to the  
>> root as well.  Prior to this, the root never was locked for  
>> anything.
>> The implications of these, for the sake of accuracy and  
>> correctness, are possibly:
>>
>> 1)  Performance impact on inspecting nodes in b) to decide on  
>> whether WLs are needed
>> 2)  Memory impact on maintaining a map of removed nodes in a)
>> 3)  Reduced concurrency due to overall stronger locks in b)
>> 4)  Much reduced concurrency because of the locking in c)
>> 5)  Potential of more deadlocks/timeouts because of 3) and 4) above.
>>
>> Of the above, 5) manifests itself in a few unit tests that have  
>> now started to fail (TxCacheLoaderTest, some state transfer  
>> tests, etc.).  Simple example, taken from one of the failing  
>> tests, leads to a deadlock:
>>
>> 1: mgr.begin();
>> 2: Transaction tx=mgr.getTransaction();
>> 3: cache1.put("/one/two/three", "key1", "val1");
>> 4: assertNull(cache2.get("/one/two/three", "key1"));
>> 5: tx.commit();
>>
>> Line 3 obtains a WL on "/" on cache1, for GTX 1
>> Line 4 obtains a WL on "/" on cache2, for GTX 2
>> Line 5, on replication, tries to get a WL on "/" on cache2, for  
>> GTX 1
>>
>> Both GTXs relate to the same TX since they are in the same  
>> thread.  Boom, deadlock.
>>
>> One thing here though, in my opinion, another bug in the original  
>> PLI:
>>
>> When doing a get on a node that doesn't exist, intermediate nodes  
>> are created.  E.g., cache2.get("/one/two/three", "key1")
actually  
>> ends up creating /one/two/three first, and after the JBCACHE-875  
>> fix, /, /one and /one/two will be WL'd for a get() on a  
>> nonexistent node!!  Shouldn't the loop just be short-circuited  
>> such that at any point, if the next node does not exist and the  
>> lock_type requested is READ, just return a null?  Saves us a  
>> whole bunch of unnecessary WL's ...
>> Sorry about the long and rambling email.  Thoughts and opinions?
>> --
>> Manik Surtani
>>
>> Lead, JBoss Cache
>> JBoss, a division of Red Hat
>>
>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>> Telephone: +44 7786 702 706
>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>> Yahoo/AIM/Skype: maniksurtani
>>
>>
>>
>> On 27 Nov 2006, at 10:16, Manik Surtani wrote:
>>
>>> Ok, take away the crap about it being a bug in the  
>>> util.concurrent code.  It's a bug in JBC, specifically on line  
>>> 75 in TreeCache.java:
>>>
>>> protected DataNode root = NodeFactory.getInstance 
>>> ().createDataNode(NodeFactory.NODE_TYPE_TREENODE, SEPARATOR,  
>>> Fqn.fromString(SEPARATOR), null, null, this);
>>>
>>> :-)  The root node is initialised when new TreeCache() is  
>>> called, well before isolation levels, etc. are set, which causes  
>>> the root node to be created with isolation level of NONE.  Hence  
>>> the insane behaviour when trying to content for write locks on  
>>> the root node.
>>>
>>> Just fixed this, running a bunch of regressions now.
>>>
>>> Cheers,
>>> --
>>> Manik Surtani
>>>
>>> Lead, JBoss Cache
>>> JBoss, a division of Red Hat
>>>
>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>> Telephone: +44 7786 702 706
>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>> Yahoo/AIM/Skype: maniksurtani
>>>
>>>
>>>
>>> On 26 Nov 2006, at 19:04, Bela Ban wrote:
>>>
>>>> Forwarding to the entire group
>>>>
>>>> Manik Surtani wrote:
>>>>> Ok, boiled it down to a contention issue on locking Fqn.ROOT,  
>>>>> which prior to JBCACHE-875, was never locked - either for  
>>>>> reading or writing.  (I do this by changing the loop in the  
>>>>> lock() method in PLI to first consider the root before the  
>>>>> individual Fqn elements).  (This problem is also apparent in  
>>>>> o.j.c.transaction.DeadlockTest on a multi-cpu box).
>>>>>
>>>>> 2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:  
>>>>> fqn=/, caller=GlobalTransaction:<null>:2,
lock=<unlocked>
>>>>>  <snip>
>>>>> 2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:  
>>>>> fqn=/, caller=GlobalTransaction:<null>:3,
lock=<unlocked>
>>>>>  <snip>
>>>>> 2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:  
>>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=write  
>>>>> owner=GlobalTransaction:<null>:2
>>>>>  <snip>
>>>>> 2006-11-26 14:58:09,581 INFO  [TxInterceptor] (Thread-3) There  
>>>>> was a problem handling this request
>>>>> java.lang.IllegalStateException: there is already a writer  
>>>>> holding the lock: GlobalTransaction:<null>:2 and caller is  
>>>>> GlobalTransaction:<null>:3
>>>>> at org.jboss.cache.lock.LockMap.setWriterIfNotNull 
>>>>> (LockMap.java:101)
>>>>> at org.jboss.cache.lock.IdentityLock.acquireWriteLock 
>>>>> (IdentityLock.java:187)
>>>>> at org.jboss.cache.Node.acquireWriteLock(Node.java:557)
>>>>> at org.jboss.cache.Node.acquire(Node.java:517)
>>>>> < snip - lots>
>>>>> 2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:  
>>>>> fqn=/, child_name=NODE
>>>>>
>>>>> I can't understand why concurrent WL acquisition in  
>>>>> IdentityLock.acquireWriteLock() behaves correctly for all  
>>>>> nodes except the root node.  As you can see in the log snippet  
>>>>> above, both Thread-2 and Thread-3 call  
>>>>> IdentityLock.acquireWriteLock (line 178) and get a 'true',
and  
>>>>> one of the threads cause an exception on line 187.
>>>>>
>>>>>
>>>>> --
>>>>> Manik Surtani
>>>>>
>>>>> Lead, JBoss Cache
>>>>> JBoss, a division of Red Hat
>>>>>
>>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>>> Telephone: +44 7786 702 706
>>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>
>>>>>
>>>>>
>>>>> On 26 Nov 2006, at 13:54, Manik Surtani wrote:
>>>>>
>>>>>> I didn't want to acquire the WL immediately since it involved
 
>>>>>> an additional test to check if the next node in the fqn  
>>>>>> needed creation.  But I went with that algorithm in the end  
>>>>>> since the locks had problems with concurrent readers  
>>>>>> attempting to upgrade to writers.
>>>>>>
>>>>>> So most of the regressions pass, as well as the new tests  
>>>>>> introduced, and I am very close to something working, EXCEPT  
>>>>>> one very strange problem with the IdentityLock and  
>>>>>> ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modification
>>>>>> s() - get the latest on the 1.3.0 branch for this to make any  
>>>>>> sense.  The problem is between lines 178 and 187 of  
>>>>>> IdentityLock, in the acquireWriteLock() method.
>>>>>> Attempting to get a hold of a write lock returns true, but  
>>>>>> setting the writer throws an exception since another writer  
>>>>>> exists.  Odd that this happens since the calling thread  
>>>>>> should have the semaphore by then, also odd that this only  
>>>>>> seems to happen in this one test which is meant to test  
>>>>>> concurrency in the CacheLoaderInterceptor.
>>>>>> I'm still investigating, but if you have any ideas about how 
>>>>>> and why this may happen, I'd love to hear it ... :-)
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Manik Surtani
>>>>>>
>>>>>> Lead, JBoss Cache
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>>>> Telephone: +44 7786 702 706
>>>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24 Nov 2006, at 15:25, Bela Ban wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Manik Surtani wrote:
>>>>>>>>
>>>>>>>> On 24 Nov 2006, at 14:44, Bela Ban wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> The first one you mentioned can lead to race conditions, 
>>>>>>>> depending on the order of whether the upgrade on b or the
 
>>>>>>>> creation/WL on c happens first.  What I've
implemented is  
>>>>>>>> more like:
>>>>>>>>
>>>>>>>> 1: Acquire RL on a
>>>>>>>> 2: Acquire RL on b
>>>>>>>> 3: Identify that we need to create c.
>>>>>>>> 4: Upgrade RL on b to WL
>>>>>>>> 5: *now* create c, and acquire WL on it.
>>>>>>>>
>>>>>>>> So there is a possibility that step 4 may block until
other  
>>>>>>>> readers on b release their locks, but no one else could 
>>>>>>>> grab the WL since the current TX will have a RL.
>>>>>>>
>>>>>>> I see. Why don't you acquire a WL on b (step 2)  
>>>>>>> *immediately* rather than going through the upgrade if you  
>>>>>>> know you have to acquire a WL later anyway ? You might still 
>>>>>>> deadlock:
>>>>>>> 2: acquire RL on b
>>>>>>> (in the meantime): some other TX acquires a RL on b,  
>>>>>>> possibly upgrades to WL
>>>>>>> 3: you want to acquire a WL on b and block on the other
TX's  
>>>>>>> RL or WL
>>>>>>>
>>>>>>> -- 
>>>>>>> Bela Ban
>>>>>>> Lead JGroups / JBoss Clustering team
>>>>>>> JBoss - a division of Red Hat
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> -- 
>>>> Bela Ban
>>>> Lead JGroups / JBoss Clustering team
>>>> JBoss - a division of Red Hat
>>>>
>>>
>>
>
 -- 
 Bela Ban
 Lead JGroups / JBoss Clustering team
 JBoss - a division of Red Hat