I agree as well; just wanted to sound everyone on potential breakage
in (poorly written) end-user code.
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 1 Dec 2006, at 08:38, Bela Ban wrote:
IMO correctness trumps speed. If developers map their data
structures to JBC incorrectly, creating lots of concurrency issues
it is THEIR problem. This is the same with mapping apps to a DB in
essence.
Folks can always opt for lower isolation levels and/or optimistic
locking...
Manik Surtani wrote:
> So we still haven't discussed my biggest concern here, which is
> item 5) below in the list of implications. Is this performance
> penalty and potential for deadlocks small enough a price to pay
> for the correctness of concurrent access on the root node? What
> do people think?
>
>
>
>> *From:* Manik Surtani [mailto:manik@jboss.org]
>> *Sent:* Monday, November 27, 2006 7:19 PM
>> *To:* Manik Surtani
>> *Cc:* Bela Ban; Ben Wang; Brian Stansberry; Vladimir Blagojevic;
>> Galder Zamarreno
>> *Subject:* Re: Fundamental problem with pessimistic locking
>>
>> Ok, this seems to work, making things a lot more 'correct'. But
>> before I roll this into an official release and start making
>> changes en-masse, porting this to 1.4.x and 2.0.0, I'd like to
>> step back and think about whether this is what we really want.
>> Here is what I've effectively done with 1.3.0.SP4, all related to
>> pessimistic locking only:
>>
>> a) Added a mechanism for not removing nodes when remove() is
>> called, and instead storing them in a map which can be referenced
>> by concurrent threads and locks attempted. (Mutated version of
>> Brian's original fix to JBCACHE-871)
>> b) When locking nodes in PLI.lock(), added a mechanism to obtain
>> a WL on a node if the next node after it needs to be created or
>> removed. (JBCACHE-875)
>> c) Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get
>> (0), which applies the same cache-wide locking behaviour to the
>> root as well. Prior to this, the root never was locked for
>> anything.
>> The implications of these, for the sake of accuracy and
>> correctness, are possibly:
>>
>> 1) Performance impact on inspecting nodes in b) to decide on
>> whether WLs are needed
>> 2) Memory impact on maintaining a map of removed nodes in a)
>> 3) Reduced concurrency due to overall stronger locks in b)
>> 4) Much reduced concurrency because of the locking in c)
>> 5) Potential of more deadlocks/timeouts because of 3) and 4) above.
>>
>> Of the above, 5) manifests itself in a few unit tests that have
>> now started to fail (TxCacheLoaderTest, some state transfer
>> tests, etc.). Simple example, taken from one of the failing
>> tests, leads to a deadlock:
>>
>> 1: mgr.begin();
>> 2: Transaction tx=mgr.getTransaction();
>> 3: cache1.put("/one/two/three", "key1", "val1");
>> 4: assertNull(cache2.get("/one/two/three", "key1"));
>> 5: tx.commit();
>>
>> Line 3 obtains a WL on "/" on cache1, for GTX 1
>> Line 4 obtains a WL on "/" on cache2, for GTX 2
>> Line 5, on replication, tries to get a WL on "/" on cache2, for
>> GTX 1
>>
>> Both GTXs relate to the same TX since they are in the same
>> thread. Boom, deadlock.
>>
>> One thing here though, in my opinion, another bug in the original
>> PLI:
>>
>> When doing a get on a node that doesn't exist, intermediate nodes
>> are created. E.g., cache2.get("/one/two/three", "key1")
actually
>> ends up creating /one/two/three first, and after the JBCACHE-875
>> fix, /, /one and /one/two will be WL'd for a get() on a
>> nonexistent node!! Shouldn't the loop just be short-circuited
>> such that at any point, if the next node does not exist and the
>> lock_type requested is READ, just return a null? Saves us a
>> whole bunch of unnecessary WL's ...
>> Sorry about the long and rambling email. Thoughts and opinions?
>> --
>> Manik Surtani
>>
>> Lead, JBoss Cache
>> JBoss, a division of Red Hat
>>
>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>> Telephone: +44 7786 702 706
>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>> Yahoo/AIM/Skype: maniksurtani
>>
>>
>>
>> On 27 Nov 2006, at 10:16, Manik Surtani wrote:
>>
>>> Ok, take away the crap about it being a bug in the
>>> util.concurrent code. It's a bug in JBC, specifically on line
>>> 75 in TreeCache.java:
>>>
>>> protected DataNode root = NodeFactory.getInstance
>>> ().createDataNode(NodeFactory.NODE_TYPE_TREENODE, SEPARATOR,
>>> Fqn.fromString(SEPARATOR), null, null, this);
>>>
>>> :-) The root node is initialised when new TreeCache() is
>>> called, well before isolation levels, etc. are set, which causes
>>> the root node to be created with isolation level of NONE. Hence
>>> the insane behaviour when trying to content for write locks on
>>> the root node.
>>>
>>> Just fixed this, running a bunch of regressions now.
>>>
>>> Cheers,
>>> --
>>> Manik Surtani
>>>
>>> Lead, JBoss Cache
>>> JBoss, a division of Red Hat
>>>
>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>> Telephone: +44 7786 702 706
>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>> Yahoo/AIM/Skype: maniksurtani
>>>
>>>
>>>
>>> On 26 Nov 2006, at 19:04, Bela Ban wrote:
>>>
>>>> Forwarding to the entire group
>>>>
>>>> Manik Surtani wrote:
>>>>> Ok, boiled it down to a contention issue on locking Fqn.ROOT,
>>>>> which prior to JBCACHE-875, was never locked - either for
>>>>> reading or writing. (I do this by changing the loop in the
>>>>> lock() method in PLI to first consider the root before the
>>>>> individual Fqn elements). (This problem is also apparent in
>>>>> o.j.c.transaction.DeadlockTest on a multi-cpu box).
>>>>>
>>>>> 2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:
>>>>> fqn=/, caller=GlobalTransaction:<null>:2,
lock=<unlocked>
>>>>> <snip>
>>>>> 2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:
>>>>> fqn=/, caller=GlobalTransaction:<null>:3,
lock=<unlocked>
>>>>> <snip>
>>>>> 2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:
>>>>> fqn=/, caller=GlobalTransaction:<null>:2, lock=write
>>>>> owner=GlobalTransaction:<null>:2
>>>>> <snip>
>>>>> 2006-11-26 14:58:09,581 INFO [TxInterceptor] (Thread-3) There
>>>>> was a problem handling this request
>>>>> java.lang.IllegalStateException: there is already a writer
>>>>> holding the lock: GlobalTransaction:<null>:2 and caller is
>>>>> GlobalTransaction:<null>:3
>>>>> at org.jboss.cache.lock.LockMap.setWriterIfNotNull
>>>>> (LockMap.java:101)
>>>>> at org.jboss.cache.lock.IdentityLock.acquireWriteLock
>>>>> (IdentityLock.java:187)
>>>>> at org.jboss.cache.Node.acquireWriteLock(Node.java:557)
>>>>> at org.jboss.cache.Node.acquire(Node.java:517)
>>>>> < snip - lots>
>>>>> 2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:
>>>>> fqn=/, child_name=NODE
>>>>>
>>>>> I can't understand why concurrent WL acquisition in
>>>>> IdentityLock.acquireWriteLock() behaves correctly for all
>>>>> nodes except the root node. As you can see in the log snippet
>>>>> above, both Thread-2 and Thread-3 call
>>>>> IdentityLock.acquireWriteLock (line 178) and get a 'true',
and
>>>>> one of the threads cause an exception on line 187.
>>>>>
>>>>>
>>>>> --
>>>>> Manik Surtani
>>>>>
>>>>> Lead, JBoss Cache
>>>>> JBoss, a division of Red Hat
>>>>>
>>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>>> Telephone: +44 7786 702 706
>>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>
>>>>>
>>>>>
>>>>> On 26 Nov 2006, at 13:54, Manik Surtani wrote:
>>>>>
>>>>>> I didn't want to acquire the WL immediately since it involved
>>>>>> an additional test to check if the next node in the fqn
>>>>>> needed creation. But I went with that algorithm in the end
>>>>>> since the locks had problems with concurrent readers
>>>>>> attempting to upgrade to writers.
>>>>>>
>>>>>> So most of the regressions pass, as well as the new tests
>>>>>> introduced, and I am very close to something working, EXCEPT
>>>>>> one very strange problem with the IdentityLock and
>>>>>> ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modification
>>>>>> s() - get the latest on the 1.3.0 branch for this to make any
>>>>>> sense. The problem is between lines 178 and 187 of
>>>>>> IdentityLock, in the acquireWriteLock() method.
>>>>>> Attempting to get a hold of a write lock returns true, but
>>>>>> setting the writer throws an exception since another writer
>>>>>> exists. Odd that this happens since the calling thread
>>>>>> should have the semaphore by then, also odd that this only
>>>>>> seems to happen in this one test which is meant to test
>>>>>> concurrency in the CacheLoaderInterceptor.
>>>>>> I'm still investigating, but if you have any ideas about how
>>>>>> and why this may happen, I'd love to hear it ... :-)
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Manik Surtani
>>>>>>
>>>>>> Lead, JBoss Cache
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>>>> Email: manik(a)jboss.org <mailto:manik@jboss.org>
>>>>>> Telephone: +44 7786 702 706
>>>>>> MSN: manik(a)surtani.org <mailto:manik@surtani.org>
>>>>>> Yahoo/AIM/Skype: maniksurtani
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24 Nov 2006, at 15:25, Bela Ban wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Manik Surtani wrote:
>>>>>>>>
>>>>>>>> On 24 Nov 2006, at 14:44, Bela Ban wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> The first one you mentioned can lead to race conditions,
>>>>>>>> depending on the order of whether the upgrade on b or the
>>>>>>>> creation/WL on c happens first. What I've
implemented is
>>>>>>>> more like:
>>>>>>>>
>>>>>>>> 1: Acquire RL on a
>>>>>>>> 2: Acquire RL on b
>>>>>>>> 3: Identify that we need to create c.
>>>>>>>> 4: Upgrade RL on b to WL
>>>>>>>> 5: *now* create c, and acquire WL on it.
>>>>>>>>
>>>>>>>> So there is a possibility that step 4 may block until
other
>>>>>>>> readers on b release their locks, but no one else could
>>>>>>>> grab the WL since the current TX will have a RL.
>>>>>>>
>>>>>>> I see. Why don't you acquire a WL on b (step 2)
>>>>>>> *immediately* rather than going through the upgrade if you
>>>>>>> know you have to acquire a WL later anyway ? You might still
>>>>>>> deadlock:
>>>>>>> 2: acquire RL on b
>>>>>>> (in the meantime): some other TX acquires a RL on b,
>>>>>>> possibly upgrades to WL
>>>>>>> 3: you want to acquire a WL on b and block on the other
TX's
>>>>>>> RL or WL
>>>>>>>
>>>>>>> --
>>>>>>> Bela Ban
>>>>>>> Lead JGroups / JBoss Clustering team
>>>>>>> JBoss - a division of Red Hat
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bela Ban
>>>> Lead JGroups / JBoss Clustering team
>>>> JBoss - a division of Red Hat
>>>>
>>>
>>
>
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat