IMO correctness trumps speed. If developers map their data structures to
JBC incorrectly, creating lots of concurrency issues it is THEIR
problem. This is the same with mapping apps to a DB in essence.
Folks can always opt for lower isolation levels and/or optimistic locking...
Manik Surtani wrote:
So we still haven't discussed my biggest concern here, which is
item
5) below in the list of implications. Is this performance penalty and
potential for deadlocks small enough a price to pay for the
correctness of concurrent access on the root node? What do people think?
> *From:* Manik Surtani [mailto:manik@jboss.org]
> *Sent:* Monday, November 27, 2006 7:19 PM
> *To:* Manik Surtani
> *Cc:* Bela Ban; Ben Wang; Brian Stansberry; Vladimir Blagojevic;
> Galder Zamarreno
> *Subject:* Re: Fundamental problem with pessimistic locking
>
> Ok, this seems to work, making things a lot more 'correct'. But
> before I roll this into an official release and start making changes
> en-masse, porting this to 1.4.x and 2.0.0, I'd like to step back and
> think about whether this is what we really want. Here is what I've
> effectively done with 1.3.0.SP4, all related to pessimistic locking
> only:
>
> a) Added a mechanism for not removing nodes when remove() is called,
> and instead storing them in a map which can be referenced by
> concurrent threads and locks attempted. (Mutated version of Brian's
> original fix to JBCACHE-871)
> b) When locking nodes in PLI.lock(), added a mechanism to obtain a
> WL on a node if the next node after it needs to be created or
> removed. (JBCACHE-875)
> c) Modified PLI.lock() to start with Fqn.ROOT rather than
> Fqn.get(0), which applies the same cache-wide locking behaviour to
> the root as well. Prior to this, the root never was locked for
> anything.
>
> The implications of these, for the sake of accuracy and correctness,
> are possibly:
>
> 1) Performance impact on inspecting nodes in b) to decide on whether
> WLs are needed
> 2) Memory impact on maintaining a map of removed nodes in a)
> 3) Reduced concurrency due to overall stronger locks in b)
> 4) Much reduced concurrency because of the locking in c)
> 5) Potential of more deadlocks/timeouts because of 3) and 4) above.
>
> Of the above, 5) manifests itself in a few unit tests that have now
> started to fail (TxCacheLoaderTest, some state transfer tests,
> etc.). Simple example, taken from one of the failing tests, leads to
> a deadlock:
>
> 1: mgr.begin();
> 2: Transaction tx=mgr.getTransaction();
> 3: cache1.put("/one/two/three", "key1", "val1");
> 4: assertNull(cache2.get("/one/two/three", "key1"));
> 5: tx.commit();
>
> Line 3 obtains a WL on "/" on cache1, for GTX 1
> Line 4 obtains a WL on "/" on cache2, for GTX 2
> Line 5, on replication, tries to get a WL on "/" on cache2, for GTX 1
>
> Both GTXs relate to the same TX since they are in the same thread.
> Boom, deadlock.
>
> One thing here though, in my opinion, another bug in the original PLI:
>
> When doing a get on a node that doesn't exist, intermediate nodes are
> created. E.g., cache2.get("/one/two/three", "key1") actually
ends up
> creating /one/two/three first, and after the JBCACHE-875 fix, /, /one
> and /one/two will be WL'd for a get() on a nonexistent node!!
> Shouldn't the loop just be short-circuited such that at any point, if
> the next node does not exist and the lock_type requested is READ,
> just return a null? Saves us a whole bunch of unnecessary WL's ...
>
> Sorry about the long and rambling email. Thoughts and opinions?
> --
> Manik Surtani
>
> Lead, JBoss Cache
> JBoss, a division of Red Hat
>
> Email: manik(a)jboss.org <mailto:manik@jboss.org>
> Telephone: +44 7786 702 706
> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
> Yahoo/AIM/Skype: maniksurtani
>
>
>
> On 27 Nov 2006, at 10:16, Manik Surtani wrote:
>
>> Ok, take away the crap about it being a bug in the util.concurrent
>> code. It's a bug in JBC, specifically on line 75 in TreeCache.java:
>>
>> protected DataNode root =
>> NodeFactory.getInstance().createDataNode(NodeFactory.NODE_TYPE_TREENODE,
>> SEPARATOR, Fqn.fromString(SEPARATOR), null, null, this);
>>
>> :-) The root node is initialised when new TreeCache() is called,
>> well before isolation levels, etc. are set, which causes the root
>> node to be created with isolation level of NONE. Hence the insane
>> behaviour when trying to content for write locks on the root node.
>>
>> Just fixed this, running a bunch of regressions now.
>>
>> Cheers,
>> --
>> Manik Surtani
>>
>> Lead, JBoss Cache
>> JBoss, a division of Red Hat
>>
>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>> Telephone: +44 7786 702 706
>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>> Yahoo/AIM/Skype: maniksurtani
>>
>>
>>
>> On 26 Nov 2006, at 19:04, Bela Ban wrote:
>>
>>> Forwarding to the entire group
>>>
>>> Manik Surtani wrote:
>>>> Ok, boiled it down to a contention issue on locking Fqn.ROOT,
>>>> which prior to JBCACHE-875, was never locked - either for reading
>>>> or writing. (I do this by changing the loop in the lock() method
>>>> in PLI to first consider the root before the individual Fqn
>>>> elements). (This problem is also apparent in
>>>> o.j.c.transaction.DeadlockTest on a multi-cpu box).
>>>>
>>>> 2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:
>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=<unlocked>
>>>> <snip>
>>>> 2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:
>>>> fqn=/, caller=GlobalTransaction:<null>:3, lock=<unlocked>
>>>> <snip>
>>>> 2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:
>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=write
>>>> owner=GlobalTransaction:<null>:2
>>>> <snip>
>>>> 2006-11-26 14:58:09,581 INFO [TxInterceptor] (Thread-3) There was
>>>> a problem handling this request
>>>> java.lang.IllegalStateException: there is already a writer holding
>>>> the lock: GlobalTransaction:<null>:2 and caller is
>>>> GlobalTransaction:<null>:3
>>>> at org.jboss.cache.lock.LockMap.setWriterIfNotNull(LockMap.java:101)
>>>> at
>>>>
org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:187)
>>>> at org.jboss.cache.Node.acquireWriteLock(Node.java:557)
>>>> at org.jboss.cache.Node.acquire(Node.java:517)
>>>> < snip - lots>
>>>> 2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:
>>>> fqn=/, child_name=NODE
>>>>
>>>> I can't understand why concurrent WL acquisition in
>>>> IdentityLock.acquireWriteLock() behaves correctly for all nodes
>>>> except the root node. As you can see in the log snippet above,
>>>> both Thread-2 and Thread-3 call IdentityLock.acquireWriteLock
>>>> (line 178) and get a 'true', and one of the threads cause an
>>>> exception on line 187.
>>>>
>>>>
>>>> --
>>>> Manik Surtani
>>>>
>>>> Lead, JBoss Cache
>>>> JBoss, a division of Red Hat
>>>>
>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>> Telephone: +44 7786 702 706
>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>> Yahoo/AIM/Skype: maniksurtani
>>>>
>>>>
>>>>
>>>> On 26 Nov 2006, at 13:54, Manik Surtani wrote:
>>>>
>>>>> I didn't want to acquire the WL immediately since it involved an
>>>>> additional test to check if the next node in the fqn needed
>>>>> creation. But I went with that algorithm in the end since the
>>>>> locks had problems with concurrent readers attempting to upgrade
>>>>> to writers.
>>>>>
>>>>> So most of the regressions pass, as well as the new tests
>>>>> introduced, and I am very close to something working, EXCEPT one
>>>>> very strange problem with the IdentityLock
>>>>> and
ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modifications()
>>>>> - get the latest on the 1.3.0 branch for this to make any sense.
>>>>> The problem is between lines 178 and 187 of IdentityLock, in the
>>>>> acquireWriteLock() method.
>>>>>
>>>>> Attempting to get a hold of a write lock returns true, but
>>>>> setting the writer throws an exception since another writer
>>>>> exists. Odd that this happens since the calling thread should
>>>>> have the semaphore by then, also odd that this only seems to
>>>>> happen in this one test which is meant to test concurrency in the
>>>>> CacheLoaderInterceptor.
>>>>>
>>>>> I'm still investigating, but if you have any ideas about how and
>>>>> why this may happen, I'd love to hear it ... :-)
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Manik Surtani
>>>>>
>>>>> Lead, JBoss Cache
>>>>> JBoss, a division of Red Hat
>>>>>
>>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>>> Telephone: +44 7786 702 706
>>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>
>>>>>
>>>>>
>>>>> On 24 Nov 2006, at 15:25, Bela Ban wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Manik Surtani wrote:
>>>>>>>
>>>>>>> On 24 Nov 2006, at 14:44, Bela Ban wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The first one you mentioned can lead to race conditions,
>>>>>>> depending on the order of whether the upgrade on b or the
>>>>>>> creation/WL on c happens first. What I've implemented is
more
>>>>>>> like:
>>>>>>>
>>>>>>> 1: Acquire RL on a
>>>>>>> 2: Acquire RL on b
>>>>>>> 3: Identify that we need to create c.
>>>>>>> 4: Upgrade RL on b to WL
>>>>>>> 5: *now* create c, and acquire WL on it.
>>>>>>>
>>>>>>> So there is a possibility that step 4 may block until other
>>>>>>> readers on b release their locks, but no one else could grab
>>>>>>> the WL since the current TX will have a RL.
>>>>>>
>>>>>> I see. Why don't you acquire a WL on b (step 2) *immediately*
>>>>>> rather than going through the upgrade if you know you have to
>>>>>> acquire a WL later anyway ? You might still deadlock:
>>>>>> 2: acquire RL on b
>>>>>> (in the meantime): some other TX acquires a RL on b, possibly
>>>>>> upgrades to WL
>>>>>> 3: you want to acquire a WL on b and block on the other TX's
RL
>>>>>> or WL
>>>>>>
>>>>>> --
>>>>>> Bela Ban
>>>>>> Lead JGroups / JBoss Clustering team
>>>>>> JBoss - a division of Red Hat
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Bela Ban
>>> Lead JGroups / JBoss Clustering team
>>> JBoss - a division of Red Hat
>>>
>>
>
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat