[jbosscache-dev] Re: Fundamental problem with pessimistic locking
Manik Surtani
manik at jboss.org
Fri Dec 1 09:19:14 EST 2006
I agree as well; just wanted to sound everyone on potential breakage
in (poorly written) end-user code.
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik at jboss.org
Telephone: +44 7786 702 706
MSN: manik at surtani.org
Yahoo/AIM/Skype: maniksurtani
On 1 Dec 2006, at 08:38, Bela Ban wrote:
> IMO correctness trumps speed. If developers map their data
> structures to JBC incorrectly, creating lots of concurrency issues
> it is THEIR problem. This is the same with mapping apps to a DB in
> essence.
> Folks can always opt for lower isolation levels and/or optimistic
> locking...
>
> Manik Surtani wrote:
>> So we still haven't discussed my biggest concern here, which is
>> item 5) below in the list of implications. Is this performance
>> penalty and potential for deadlocks small enough a price to pay
>> for the correctness of concurrent access on the root node? What
>> do people think?
>>
>>
>>
>>> *From:* Manik Surtani [mailto:manik at jboss.org]
>>> *Sent:* Monday, November 27, 2006 7:19 PM
>>> *To:* Manik Surtani
>>> *Cc:* Bela Ban; Ben Wang; Brian Stansberry; Vladimir Blagojevic;
>>> Galder Zamarreno
>>> *Subject:* Re: Fundamental problem with pessimistic locking
>>>
>>> Ok, this seems to work, making things a lot more 'correct'. But
>>> before I roll this into an official release and start making
>>> changes en-masse, porting this to 1.4.x and 2.0.0, I'd like to
>>> step back and think about whether this is what we really want.
>>> Here is what I've effectively done with 1.3.0.SP4, all related to
>>> pessimistic locking only:
>>>
>>> a) Added a mechanism for not removing nodes when remove() is
>>> called, and instead storing them in a map which can be referenced
>>> by concurrent threads and locks attempted. (Mutated version of
>>> Brian's original fix to JBCACHE-871)
>>> b) When locking nodes in PLI.lock(), added a mechanism to obtain
>>> a WL on a node if the next node after it needs to be created or
>>> removed. (JBCACHE-875)
>>> c) Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get
>>> (0), which applies the same cache-wide locking behaviour to the
>>> root as well. Prior to this, the root never was locked for
>>> anything.
>>> The implications of these, for the sake of accuracy and
>>> correctness, are possibly:
>>>
>>> 1) Performance impact on inspecting nodes in b) to decide on
>>> whether WLs are needed
>>> 2) Memory impact on maintaining a map of removed nodes in a)
>>> 3) Reduced concurrency due to overall stronger locks in b)
>>> 4) Much reduced concurrency because of the locking in c)
>>> 5) Potential of more deadlocks/timeouts because of 3) and 4) above.
>>>
>>> Of the above, 5) manifests itself in a few unit tests that have
>>> now started to fail (TxCacheLoaderTest, some state transfer
>>> tests, etc.). Simple example, taken from one of the failing
>>> tests, leads to a deadlock:
>>>
>>> 1: mgr.begin();
>>> 2: Transaction tx=mgr.getTransaction();
>>> 3: cache1.put("/one/two/three", "key1", "val1");
>>> 4: assertNull(cache2.get("/one/two/three", "key1"));
>>> 5: tx.commit();
>>>
>>> Line 3 obtains a WL on "/" on cache1, for GTX 1
>>> Line 4 obtains a WL on "/" on cache2, for GTX 2
>>> Line 5, on replication, tries to get a WL on "/" on cache2, for
>>> GTX 1
>>>
>>> Both GTXs relate to the same TX since they are in the same
>>> thread. Boom, deadlock.
>>>
>>> One thing here though, in my opinion, another bug in the original
>>> PLI:
>>>
>>> When doing a get on a node that doesn't exist, intermediate nodes
>>> are created. E.g., cache2.get("/one/two/three", "key1") actually
>>> ends up creating /one/two/three first, and after the JBCACHE-875
>>> fix, /, /one and /one/two will be WL'd for a get() on a
>>> nonexistent node!! Shouldn't the loop just be short-circuited
>>> such that at any point, if the next node does not exist and the
>>> lock_type requested is READ, just return a null? Saves us a
>>> whole bunch of unnecessary WL's ...
>>> Sorry about the long and rambling email. Thoughts and opinions?
>>> --
>>> Manik Surtani
>>>
>>> Lead, JBoss Cache
>>> JBoss, a division of Red Hat
>>>
>>> Email: manik at jboss.org <mailto:manik at jboss.org>
>>> Telephone: +44 7786 702 706
>>> MSN: manik at surtani.org <mailto:manik at surtani.org>
>>> Yahoo/AIM/Skype: maniksurtani
>>>
>>>
>>>
>>> On 27 Nov 2006, at 10:16, Manik Surtani wrote:
>>>
>>>> Ok, take away the crap about it being a bug in the
>>>> util.concurrent code. It's a bug in JBC, specifically on line
>>>> 75 in TreeCache.java:
>>>>
>>>> protected DataNode root = NodeFactory.getInstance
>>>> ().createDataNode(NodeFactory.NODE_TYPE_TREENODE, SEPARATOR,
>>>> Fqn.fromString(SEPARATOR), null, null, this);
>>>>
>>>> :-) The root node is initialised when new TreeCache() is
>>>> called, well before isolation levels, etc. are set, which causes
>>>> the root node to be created with isolation level of NONE. Hence
>>>> the insane behaviour when trying to content for write locks on
>>>> the root node.
>>>>
>>>> Just fixed this, running a bunch of regressions now.
>>>>
>>>> Cheers,
>>>> --
>>>> Manik Surtani
>>>>
>>>> Lead, JBoss Cache
>>>> JBoss, a division of Red Hat
>>>>
>>>> Email: manik at jboss.org <mailto:manik at jboss.org>
>>>> Telephone: +44 7786 702 706
>>>> MSN: manik at surtani.org <mailto:manik at surtani.org>
>>>> Yahoo/AIM/Skype: maniksurtani
>>>>
>>>>
>>>>
>>>> On 26 Nov 2006, at 19:04, Bela Ban wrote:
>>>>
>>>>> Forwarding to the entire group
>>>>>
>>>>> Manik Surtani wrote:
>>>>>> Ok, boiled it down to a contention issue on locking Fqn.ROOT,
>>>>>> which prior to JBCACHE-875, was never locked - either for
>>>>>> reading or writing. (I do this by changing the loop in the
>>>>>> lock() method in PLI to first consider the root before the
>>>>>> individual Fqn elements). (This problem is also apparent in
>>>>>> o.j.c.transaction.DeadlockTest on a multi-cpu box).
>>>>>>
>>>>>> 2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:
>>>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=<unlocked>
>>>>>> <snip>
>>>>>> 2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:
>>>>>> fqn=/, caller=GlobalTransaction:<null>:3, lock=<unlocked>
>>>>>> <snip>
>>>>>> 2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:
>>>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=write
>>>>>> owner=GlobalTransaction:<null>:2
>>>>>> <snip>
>>>>>> 2006-11-26 14:58:09,581 INFO [TxInterceptor] (Thread-3) There
>>>>>> was a problem handling this request
>>>>>> java.lang.IllegalStateException: there is already a writer
>>>>>> holding the lock: GlobalTransaction:<null>:2 and caller is
>>>>>> GlobalTransaction:<null>:3
>>>>>> at org.jboss.cache.lock.LockMap.setWriterIfNotNull
>>>>>> (LockMap.java:101)
>>>>>> at org.jboss.cache.lock.IdentityLock.acquireWriteLock
>>>>>> (IdentityLock.java:187)
>>>>>> at org.jboss.cache.Node.acquireWriteLock(Node.java:557)
>>>>>> at org.jboss.cache.Node.acquire(Node.java:517)
>>>>>> < snip - lots>
>>>>>> 2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:
>>>>>> fqn=/, child_name=NODE
>>>>>>
>>>>>> I can't understand why concurrent WL acquisition in
>>>>>> IdentityLock.acquireWriteLock() behaves correctly for all
>>>>>> nodes except the root node. As you can see in the log snippet
>>>>>> above, both Thread-2 and Thread-3 call
>>>>>> IdentityLock.acquireWriteLock (line 178) and get a 'true', and
>>>>>> one of the threads cause an exception on line 187.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Manik Surtani
>>>>>>
>>>>>> Lead, JBoss Cache
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>>>> Email: manik at jboss.org <mailto:manik at jboss.org>
>>>>>> Telephone: +44 7786 702 706
>>>>>> MSN: manik at surtani.org <mailto:manik at surtani.org>
>>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 26 Nov 2006, at 13:54, Manik Surtani wrote:
>>>>>>
>>>>>>> I didn't want to acquire the WL immediately since it involved
>>>>>>> an additional test to check if the next node in the fqn
>>>>>>> needed creation. But I went with that algorithm in the end
>>>>>>> since the locks had problems with concurrent readers
>>>>>>> attempting to upgrade to writers.
>>>>>>>
>>>>>>> So most of the regressions pass, as well as the new tests
>>>>>>> introduced, and I am very close to something working, EXCEPT
>>>>>>> one very strange problem with the IdentityLock and
>>>>>>> ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modification
>>>>>>> s() - get the latest on the 1.3.0 branch for this to make any
>>>>>>> sense. The problem is between lines 178 and 187 of
>>>>>>> IdentityLock, in the acquireWriteLock() method.
>>>>>>> Attempting to get a hold of a write lock returns true, but
>>>>>>> setting the writer throws an exception since another writer
>>>>>>> exists. Odd that this happens since the calling thread
>>>>>>> should have the semaphore by then, also odd that this only
>>>>>>> seems to happen in this one test which is meant to test
>>>>>>> concurrency in the CacheLoaderInterceptor.
>>>>>>> I'm still investigating, but if you have any ideas about how
>>>>>>> and why this may happen, I'd love to hear it ... :-)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> --
>>>>>>> Manik Surtani
>>>>>>>
>>>>>>> Lead, JBoss Cache
>>>>>>> JBoss, a division of Red Hat
>>>>>>>
>>>>>>> Email: manik at jboss.org <mailto:manik at jboss.org>
>>>>>>> Telephone: +44 7786 702 706
>>>>>>> MSN: manik at surtani.org <mailto:manik at surtani.org>
>>>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 24 Nov 2006, at 15:25, Bela Ban wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Manik Surtani wrote:
>>>>>>>>>
>>>>>>>>> On 24 Nov 2006, at 14:44, Bela Ban wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The first one you mentioned can lead to race conditions,
>>>>>>>>> depending on the order of whether the upgrade on b or the
>>>>>>>>> creation/WL on c happens first. What I've implemented is
>>>>>>>>> more like:
>>>>>>>>>
>>>>>>>>> 1: Acquire RL on a
>>>>>>>>> 2: Acquire RL on b
>>>>>>>>> 3: Identify that we need to create c.
>>>>>>>>> 4: Upgrade RL on b to WL
>>>>>>>>> 5: *now* create c, and acquire WL on it.
>>>>>>>>>
>>>>>>>>> So there is a possibility that step 4 may block until other
>>>>>>>>> readers on b release their locks, but no one else could
>>>>>>>>> grab the WL since the current TX will have a RL.
>>>>>>>>
>>>>>>>> I see. Why don't you acquire a WL on b (step 2)
>>>>>>>> *immediately* rather than going through the upgrade if you
>>>>>>>> know you have to acquire a WL later anyway ? You might still
>>>>>>>> deadlock:
>>>>>>>> 2: acquire RL on b
>>>>>>>> (in the meantime): some other TX acquires a RL on b,
>>>>>>>> possibly upgrades to WL
>>>>>>>> 3: you want to acquire a WL on b and block on the other TX's
>>>>>>>> RL or WL
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bela Ban
>>>>>>>> Lead JGroups / JBoss Clustering team
>>>>>>>> JBoss - a division of Red Hat
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Bela Ban
>>>>> Lead JGroups / JBoss Clustering team
>>>>> JBoss - a division of Red Hat
>>>>>
>>>>
>>>
>>
>
> --
> Bela Ban
> Lead JGroups / JBoss Clustering team
> JBoss - a division of Red Hat
>
More information about the jbosscache-dev
mailing list